I am watching lectures on Natural Language Processing. In the week 3 lectures, professor talks about text classification. I found writing the formulas in words helps.
Given a training set, classify test set into a class
We need to calculate two probabilities
Say, we have a training set of 5 documents , with 3 documents of class ‘A’, and 2 documents of class ‘B’.
1) Probability of a class A, given a training set
= number of documents classified as ‘A’\total number of documents in training set
2) Probability of each word in the vocabulary
- Vocabulary (V) – unique words in the training set
- Assume there are 3 words in training set hello, world,goodbye
- All documents for class ‘A’ are merged and same for class ‘B’
- Probability of a word ‘hello’ given class ‘A’
= number of times word ‘hello’ occurs in documents classified as ‘A’ + 1 \total words in documents classified as ‘A’ + V
Then, we tackle the test set. And, figure out which class is proportionally having maximum probability. How ?
1 ) Use the prior probability of a class, say ‘A’
2) And take each word in the test set, use the corresponding probability of the word in that class(‘A’) ^ frequency word in the test set
3) And, just multiply…