PicScout's Engineering Blog: May 2012

Machine Learning Approach to Documents Classification

Let's say for the sake of the argument that we would like classify different web sites into some predefined categories or classes. There are different approaches one can use in order to do it.
Below I will give a toy example that you can enhance further more for your needs.

1. Solution Approaches:

Rule based classification - a rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. A technically skilled person (e.g., a domain expert who is good at writing regular expressions) can create rule sets that will rival or exceed the accuracy of the automatically generated classifiers taking advantage of the fact that classification criteria can be easily controlled when the number of rules is small. However hand-crafted rules approach suffer from several disadvantages:

- Sometimes, rules conflicts each other.

- Maintenance of rules becomes more difficult as the number of rules increases.

- The rules have to be manually reconstructed when a target domain changes.

- Low coverage because of a wide variety of expressions.

Machine learning classification - In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. In machine text classification, we require a number of good example documents (or training documents) for each class. The need for manual classification is not eliminated because the training documents come from a person who has labeled them - where labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to some topic. Moreover once the training data is ready the learning process is fully automatic and domain independent.

2. Machine Learning Workflow:

1. Prepare a set of training data – collection of documents labeled by well defined topics.

2. Represent the training data as numerical feature vectors.

3. Use the training feature-vectors to learn a classification function.

4. Classify new documents by transforming them into feature vectors and applying

the classification function.

5. Evaluate the accuracy.

Fig.1 - Machine Classification Workflow:

(a) - learning a classification function on training documents set.

(b) - classification of new query documents

3. Feature – vectors representation:

Bag-of-Word approach – a document is regarded as a set of words regardless of the word order and grammar. Can be extended to include n-grams – contiguous sequences of n words.

To build BOW based vector representation we usually use the following flow:

Normalize - Convert words into a normalized forms applying text processing techniques such as down-case, lemmatization, stemming and etc.

Remove stop words – ignore predefined common words, e.g., “the”,” a”, “to”, “with”, ‘that...

Build a vocabulary – vocabulary is constructed as a unified set of distinct words from the entire training set collection after normalizing and removing stop words. Each word in the vocabulary is assigned a running index from 1 to number of words (it is not important what will be the order of the words).

Represent a document as a vector - Use the vocabulary to construct the BOW vector representation of each document. The dimension of the BOW vector space will be equal to the number of words in the vocabulary. Each index of the vector refers to the corresponding word index of the vocabulary. The entry corresponding to the index i in the vector measure the occurrences of the corresponding i-th word in the vocabulary. There are several common options to measure a word occurrence:

Binary: occur (1) or not-occur (0)

Term Frequency (tf) : the number of times where word appears in a document

Inverse document frequency ( idf): the inverted rate of documents that contain word against a whole set of documents

Tf-Idf: Multiplication of tf and idf measures. Frequent words that appear only in a small number of documents achieve high value

Consider, for example, two simple documents:

Document 1: John likes to watch movies. Mary likes too.

Document 2: John also likes to watch football games.

Based on these two text documents, a vocabulary is constructed as:

Vocabulary = {1:"John", 2:"likes", 3:"to", 4:"watch", 5:"movies", 6:"also", 7:"football", 8 :"games", 9:"Mary", 10:"too"},

which has 10 distinct words. Using the indexes of the dictionary, each document is represented by a 10-entry vector:

Document 1: x₁ = [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]

Document 2 : x₂ = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

4. Accuracy Evaluation:

The analysis of classification results is based on the following characteristics:

True Positives (TP) - documents correctly labeled as belonging to a class

False Negatives (FN) – Items which were not labeled as belonging to a class but should have been (also often called missed alarms or type II errors)

False Positives (FP) – Items incorrectly labeled as belonging to a class (also often called false alarms or type I errors).

These characteristic compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the observation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the expectation).

Using the above definitions the following important measures are constructed:

Precision – Percentage of documents correctly identified as belonging to the class

Recall - Percentage of documents belonging to the class that where correctly identified

A precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C). Thus, Precision is used when probability that a positive result is correct is important

Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. For example, a classification system for deciding whether or not, say, a fruit is an orange, can achieve high precision by only classifying fruits with the exact right shape and color as oranges, but at the cost of low recall due to the number of false negatives from oranges that did not quite match the specification.

F-measure - Weighted harmonic mean of precision and recall

Why use harmonic mean?

Harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large.

Confusion matrix - a specific table layout that allows visualization of the performance of a classification. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

Fig.3. Confusion matrix example. Of the 69 actual Sports documents 51 where labeled correctly while 6 where wrongly labeled as Animals and 12 as arts. Similarly, of 736 Arts documents 699 where labeled correctly while 10 where wrongly labeled as sports, 25 as Animals and 2 as Adults.

5. Refernces:

[1] http://en.wikipedia.org/wiki/Precision_and_recall

[2] http://en.wikipedia.org/wiki/Confusion_matrix

Eli Goz

Thursday, May 10, 2012