Quantcast
Channel: ListenData
Viewing all articles
Browse latest Browse all 425

Text Mining Basics

$
0
0
Text Mining Terminologies
  1. Document is a sentence. For example, " Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal."
  2. Tokens represent words. For example:  "nation", "Liberty", "men".  
  3. Terms may represent single words or multiword units, such as “civil war”
  4. Corpus is a collection of documents (database). For example, A corpus contains 16 documents (16 txt files).
  5. Document Term Matrix is a matrix consisting of documents in a row and terms in columns
Example of document term matrix :


6. Sparse terms - Terms occurring only in very few documents (Sentences).

7. Tokenization - It is the process to divide unstructured data into tokens such as words, phrase, keywords etc.

8. Stemming -  For example, "interesting", "interest" and "interested" are all stemmed to "interest". After that, we can stem to their original forms, so that the words would look "normal".

9. Polarity - Whether a document or sentence is positive, negative or neutral. This term is commonly used in sentiment analysis.

10. Bag-of-words - Each sentence (or document) is a bag of words ignoring grammar and even word order. The terms ' make India' and 'India make' have the same probability score.

11. Part of Speech Tagging - It involves tagging every word in the document and assigns part of speech - noun, verb, adjective, pronoun, single noun, plural noun, etc.

12. N-grams - They are basically a set of co-occuring words within a given window.
    • N-gram of size 1 - unigram 
    • N-gram of size 2 - bigram 
    • N-gram of size 3 - trigram

    For Example, for the sentence "The cow jumps over the moon". 

      I.If N=2 (known as bigrams), then the ngrams would be:
        the cow, cow jumps, jumps over, over the, the moon 
          So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

            II.If N=3 (trigram), the n-grams would be:

            the cow jumps, cow jumps over, jumps over the, over the moon 
              So you have 4 n-grams in this case.

              How many N-grams in a sentence? 
                If X=Number of words in a given sentence K, the number of n-grams for sentence K would be: N-grams = X – (N-1) 
                    N-grams is used to use tokens such as bigrams in the feature space instead of just unigrams (one word). But various research papers warned the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement.
                      Trigrams vs. Bigrams
                        The Trigrams do have an advantage over bigrams but it is small.

                        Check out the detailed documentation : Trigrams and Bigrams Explained


                        Viewing all articles
                        Browse latest Browse all 425

                        Trending Articles