Text Mining Basics

Text Mining Terminologies

Document is a sentence. For example, " Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal."
Tokens represent words. For example: "nation", "Liberty", "men".
Terms may represent single words or multiword units, such as “civil war”
Corpus is a collection of documents (database). For example, A corpus contains 16 documents (16 txt files).
Document Term Matrix is a matrix consisting of documents in a row and terms in columns

Example of document term matrix :

6. Sparse terms - Terms occurring only in very few documents (Sentences).

7. Tokenization - It is the process to divide unstructured data into tokens such as words, phrase, keywords etc.

8. Stemming - For example, "interesting", "interest" and "interested" are all stemmed to "interest". After that, we can stem to their original forms, so that the words would look "normal".

9. Polarity - Whether a document or sentence is positive, negative or neutral. This term is commonly used in sentiment analysis.

10. Bag-of-words - Each sentence (or document) is a bag of words ignoring grammar and even word order. The terms ' make India' and 'India make' have the same probability score.

11. Part of Speech Tagging - It involves tagging every word in the document and assigns part of speech - noun, verb, adjective, pronoun, single noun, plural noun, etc.

12. N-grams - They are basically a set of co-occuring words within a given window.

N-gram of size 1 - unigram
N-gram of size 2 - bigram
N-gram of size 3 - trigram

For Example, for the sentence "The cow jumps over the moon".

I.If N=2 (known as bigrams), then the ngrams would be:

the cow, cow jumps, jumps over, over the, the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

II.If N=3 (trigram), the n-grams would be:

the cow jumps, cow jumps over, jumps over the, over the moon

So you have 4 n-grams in this case.

How many N-grams in a sentence?

If X=Number of words in a given sentence K, the number of n-grams for sentence K would be: N-grams = X – (N-1)

N-grams is used to use tokens such as bigrams in the feature space instead of just unigrams (one word). But various research papers warned the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement.

Trigrams vs. Bigrams

The Trigrams do have an advantage over bigrams but it is small.

Check out the detailed documentation : Trigrams and Bigrams Explained

Text Mining Basics

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112