Machine Learning Approach to Documents Classification
Let's say for the sake of the argument that we would like classify
different web sites into some predefined categories or classes. There are different
approaches one can use in order to do it.
Below I will give a toy example that you can enhance further more for your needs.
Below I will give a toy example that you can enhance further more for your needs.
1. Solution
Approaches:
- Rule based classification - a rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. A technically skilled person (e.g., a domain expert who is good at writing regular expressions) can create rule sets that will rival or exceed the accuracy of the automatically generated classifiers taking advantage of the fact that classification criteria can be easily controlled when the number of rules is small. However hand-crafted rules approach suffer from several disadvantages:
-
Sometimes, rules conflicts
each other.
-
Maintenance of rules becomes
more difficult as the number of rules increases.
-
The rules have to be
manually reconstructed when a target domain changes.
-
Low coverage because of a
wide variety of expressions.
- Machine learning classification - In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. In machine text classification, we require a number of good example documents (or training documents) for each class. The need for manual classification is not eliminated because the training documents come from a person who has labeled them - where labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to some topic. Moreover once the training data is ready the learning process is fully automatic and domain independent.
2. Machine
Learning Workflow:
1. Prepare a set of training
data – collection of documents labeled by well defined topics.
2.
Represent the training data
as numerical feature vectors.
3. Use the training feature-vectors
to learn a classification function.
4. Classify new documents by
transforming them into feature vectors and applying
the classification function.
5. Evaluate the accuracy.
Fig.1 - Machine Classification Workflow:
(a) - learning a classification function on training documents set.
(b) - classification of new query documents
3. Feature –
vectors representation:
Bag-of-Word approach – a document is
regarded as a set of words regardless of the word order and grammar. Can be
extended to include n-grams – contiguous sequences of n words.
To build BOW based vector representation we usually use the
following flow:
- Normalize - Convert words into a normalized forms applying text processing techniques such as down-case, lemmatization, stemming and etc.
- Remove stop words – ignore predefined common words, e.g., “the”,” a”, “to”, “with”, ‘that...
- Build a vocabulary – vocabulary is constructed as a unified set of distinct words from the entire training set collection after normalizing and removing stop words. Each word in the vocabulary is assigned a running index from 1 to number of words (it is not important what will be the order of the words).
- Represent a document as a vector - Use the vocabulary to construct the BOW vector representation of each document. The dimension of the BOW vector space will be equal to the number of words in the vocabulary. Each index of the vector refers to the corresponding word index of the vocabulary. The entry corresponding to the index i in the vector measure the occurrences of the corresponding i-th word in the vocabulary. There are several common options to measure a word occurrence:
- Binary: occur (1) or not-occur (0)
- Term Frequency (tf) : the number of times where word appears in a document
- Inverse document frequency ( idf): the inverted rate of documents that contain word against a whole set of documents
- Tf-Idf: Multiplication of tf and idf measures. Frequent words that appear only in a small number of documents achieve high value
Consider, for example, two simple documents:
- Document 1: John likes to watch movies. Mary likes too.
- Document 2: John also likes to watch football games.
Based on these
two text documents, a vocabulary is constructed as:
- Vocabulary = {1:"John", 2:"likes", 3:"to", 4:"watch", 5:"movies", 6:"also", 7:"football", 8 :"games", 9:"Mary", 10:"too"},
which has 10
distinct words. Using the indexes of the dictionary, each document is
represented by a 10-entry vector:
- Document 1: x1 = [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
- Document 2 : x2 = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
4. Accuracy
Evaluation:
The analysis of classification results is based on the
following characteristics:
- True Positives (TP) - documents correctly labeled as belonging to a class
- False Negatives (FN) – Items which were not labeled as belonging to a class but should have been (also often called missed alarms or type II errors)
- False Positives (FP) – Items incorrectly labeled as belonging to a class (also often called false alarms or type I errors).
These characteristic compare the results of the classifier
under test with trusted external judgments. The terms positive and negative refer
to the classifier's prediction (sometimes known as the observation), and
the terms true and false refer to whether that prediction
corresponds to the external judgment (sometimes known as the expectation).
Using the above definitions the following important measures
are constructed:
- Precision – Percentage of documents correctly identified as belonging to the class
- Recall - Percentage of documents belonging to the class that where correctly identified
A precision score of 1.0 for a class C means that every item
labeled as belonging to class C does indeed belong to class C (but says nothing
about the number of items from class C that were not labeled correctly) whereas
a recall of 1.0 means that every item from class C was labeled as belonging to
class C (but says nothing about how many other items were incorrectly also
labeled as belonging to class C). Thus, Precision is used when probability that
a positive result is correct is important
Often, there is an inverse relationship between precision and
recall, where it is possible to increase one at the cost of reducing the other.
For example, a classification system for deciding whether or not, say, a fruit
is an orange, can achieve high precision by only classifying fruits with the
exact right shape and color as oranges, but at the cost of low recall due to
the number of false negatives from oranges that did not quite match
the specification.
- F-measure - Weighted harmonic mean of precision and recall
Why use harmonic mean?
Harmonic mean emphasizes the
importance of small values, whereas the arithmetic mean is affected more by
outliers that are unusually large.
- Confusion matrix - a specific table layout that allows visualization of the performance of a classification. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.
Fig.3. Confusion
matrix example. Of the 69 actual
Sports documents 51 where labeled correctly while 6 where wrongly labeled as
Animals and 12 as arts. Similarly, of 736 Arts documents 699 where labeled
correctly while 10 where wrongly labeled as sports, 25 as Animals and 2 as
Adults.
5. Refernces:
[1] http://en.wikipedia.org/wiki/Precision_and_recall
[2] http://en.wikipedia.org/wiki/Confusion_matrix
Eli Goz