Text Mining Terminologies
- Document is a sentence. For example, " Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal."
- Tokens represent words. For example: "nation", "Liberty", "men".
- Terms may represent single words or multiword units, such as “civil war”
- Corpus is a collection of documents (database). For example, A corpus contains 16 documents (16 txt files).
- Document Term Matrix is a matrix consisting of documents in a row and terms in columns
Example of document term matrix :
6. Sparse terms - Terms occurring only in very few documents (Sentences).
7. Tokenization - It is the process to divide unstructured data into tokens such as words, phrase, keywords etc.
8. Stemming - For example, "interesting", "interest" and "interested" are all stemmed to "interest". After that, we can stem to their original forms, so that the words would look "normal".
9. Polarity - Whether a document or sentence is positive, negative or neutral. This term is commonly used in sentiment analysis.
10. Bag-of-words - Each sentence (or document) is a bag of words ignoring grammar and even word order. The terms ' make India' and 'India make' have the same probability score.
11. Part of Speech Tagging - It involves tagging every word in the document and assigns part of speech - noun, verb, adjective, pronoun, single noun, plural noun, etc.
12. N-grams - They are basically a set of co-occuring words within a given window.
- N-gram of size 1 - unigram
- N-gram of size 2 - bigram
- N-gram of size 3 - trigram
For Example, for the sentence "The cow jumps over the moon".
the cow jumps, cow jumps over, jumps over the, over the moon
How many N-grams in a sentence?
If X=Number of words in a given sentence K, the number of n-grams for sentence K would be: N-grams = X – (N-1)
Check out the detailed documentation : Trigrams and Bigrams Explained