Thursday, May 10, 2012

Machine Learning Approach to Documents Classification



Let's say for the sake of the argument that we would like classify different web sites into some predefined categories or classes. There are different approaches one can use in order to do it.
Below I will give a toy example that you can enhance further more for your needs. 


1. Solution Approaches:

  • Rule based classification - a rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. A technically skilled person (e.g., a domain expert who is good at writing regular expressions) can create rule sets that will rival or exceed the accuracy of the automatically generated classifiers taking advantage of the fact that classification criteria can be  easily controlled when the number of rules is small. However hand-crafted rules approach suffer from several disadvantages:
-          Sometimes, rules conflicts each other.
-          Maintenance of rules becomes more difficult as the number of rules increases.
-          The rules have to be manually reconstructed when a target domain changes.
-          Low coverage because of a wide variety of expressions.

  • Machine learning classification - In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. In machine text classification, we require a number of good example documents (or training documents) for each class. The need for manual classification is not eliminated because the training documents come from a person who has labeled them - where labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to some topic. Moreover once the training data is ready the learning process is fully automatic and domain independent.


2. Machine Learning Workflow:
           1. Prepare a set of training data – collection of documents labeled by well defined topics.
           2.  Represent the training data as numerical feature vectors.
           3.  Use the training feature-vectors to learn a classification function.
           4. Classify new documents by transforming them into feature vectors and applying 
               the classification function.
           5.  Evaluate the accuracy.




Fig.1 - Machine Classification Workflow:
           (a) - learning a classification function on training documents set.
           (b) - classification of new query documents




3. Feature – vectors representation:

Bag-of-Word approach – a document is regarded as a set of words regardless of the word order and grammar. Can be extended to include n-grams – contiguous sequences of n words.
To build BOW based vector representation we usually use the following flow:
  • Normalize - Convert words into a normalized forms applying text processing techniques such as  down-case,  lemmatization, stemming and  etc.
  • Remove stop words – ignore predefined common words, e.g., “the”,” a”, “to”, “with”, ‘that...
  • Build a vocabulary – vocabulary is constructed as a unified set of distinct words from the entire training set collection after normalizing and removing stop words. Each word in the vocabulary is assigned a running index from 1 to number of words (it is not important what will be the order of the words).
  • Represent a document as a vector - Use the vocabulary to construct the BOW vector representation of each document.  The dimension of the BOW vector space will be equal to the number of words in the vocabulary. Each index of the vector refers to the corresponding word index of the vocabulary. The entry corresponding to the index i in the vector measure the occurrences of the corresponding i-th word in the vocabulary.  There are several common options to measure a word occurrence:
  • Binary: occur (1) or not-occur (0)
  • Term Frequency (tf) :  the number of times where word appears in a document
  •   Inverse document frequency ( idf): the inverted rate of documents that contain word against a whole set of documents
  • Tf-Idf:   Multiplication of tf and idf measures. Frequent words that appear only in a small number of documents achieve high value


Consider, for example,  two simple documents:
  • Document 1: John likes to watch movies. Mary likes too.
  •  Document 2: John also likes to watch football games.
Based on these two text documents, a vocabulary is constructed as:
  •  Vocabulary = {1:"John", 2:"likes", 3:"to", 4:"watch", 5:"movies", 6:"also", 7:"football",                          8 :"games", 9:"Mary", 10:"too"},
which has 10 distinct words. Using the indexes of the dictionary, each document is represented by a 10-entry vector:
  • Document 1:  x1 = [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
  • Document 2 : x2 = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]



4. Accuracy Evaluation:

The analysis of classification results is based on the following characteristics:
  • True Positives (TP) documents  correctly labeled as belonging to a class
  • False Negatives (FN) Items which were not labeled as belonging to a class but should have been (also often called missed alarms or type II errors)
  • False Positives (FP) – Items incorrectly labeled as belonging to a class (also often called false alarms or type I errors).
These characteristic compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the observation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the expectation).
Using the above definitions the following important measures are constructed:

  • Precision – Percentage of documents correctly identified as belonging to the class


  • Recall  - Percentage of documents  belonging to the class that where correctly identified

A precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C). Thus, Precision is used when probability that a positive result is correct is important
Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. For example, a classification system for deciding whether or not, say, a fruit is an orange, can achieve high precision by only classifying fruits with the exact right shape and color as oranges, but at the cost of low recall due to the number of false negatives from oranges that did not quite match the specification.

  • F-measure  - Weighted harmonic mean of precision and recall


Why use harmonic mean?
Harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large.

  • Confusion matrix - a specific table layout that allows visualization of the performance of a classification. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.



Fig.3. Confusion matrix example. Of the  69 actual Sports documents 51 where labeled correctly while 6 where wrongly labeled as Animals and 12 as arts. Similarly, of 736 Arts documents 699 where labeled correctly while 10 where wrongly labeled as sports, 25 as Animals and 2 as Adults.




5. Refernces:

[1] http://en.wikipedia.org/wiki/Precision_and_recall
[2] http://en.wikipedia.org/wiki/Confusion_matrix

Eli Goz

Tuesday, April 17, 2012

SQL Temporary Tables vs. Table Variables

If you ever encountered the need to perform some calculations in a stored procedure, this one’s for you J
Today I am going to talk about the options we have to perform such calculations within side tables in the database.

Basically, we have 3 alternative ways to store such data-
Temporary Tables (#Table), Table Variables (@Table) and Permanent Tables (Table).

Usually, the most intuitive thing to do is to use a temporary table, without giving it much thought in terms of performance. I must admit that I do it myself from time to time. Perhaps the reason for that is that temporary tables have been around longer. But, are they really the best thing to use?
As you’ve probably already guessed, the answer here is not unequivocally. All 3 alternatives have their advantages and disadvantages. So let’s look into it…

Some Basic Facts:
Table variables are pretty similar to temporary tables in concept, but in spite of these similarities, there are also some notable differences between them.
Let’s have a look:

·         Temp tables are essential mostly for data manipulation, whereas table variables were designed primarily for the use of table-valued functions.
·         Temp tables can be altered with DDL statements but table variables can't (for example: You can’t create a Nonclustered Index on a table variable).
·         Transactions on table variables are usually shorter than transactions involving temp tables and require less locking.
·         SQL Server creates statistics for temp tables, which eventually helps the query optimizer to choose the relevant execution plan, as opposed to table variables. This causes less recompiles of stored procedures involving table variables.

Great! So let’s always use table variables. Seems the least expensive, doesn’t it?
Mmm… Not exactly. I once heard that when something sounds too good to be true, it probably is. As a "rule of thumb" I can say that in simple queries we would rather use table variables, whereas in long/complex ones we prefer temp tables.
In the following example I will show you why:

Example:
I’ve performed 2 tests, which count the amount of even numbers in every table and calculates the execution time.
I did this for 2 types of calculations – Simple and Joined.
These tests will be executed first on small amounts of data and then on large amounts of data, so we could see the difference more clearly.

1.  First we create our test data (1 million random values):
CREATE TABLE TestData (Id INT IDENTITY PRIMARY KEY, VALUE INT)

-- Filling TestData with 1 million random rows to work with
INSERT INTO TestData (VALUE)
SELECT TOP (1000000) ABS(CHECKSUM(NEWID()))/1000
FROM sys.all_columns CROSS JOIN sys.all_columns c1

--Creating also our permanent table on which we will be performing calculations
CREATE TABLE dbo.Test (VALUE INT)
                                          

2.  We will also create a table that will hold the final test results:
CREATE TABLE TestResults
(TestType VARCHAR(10), TableType VARCHAR(20), NumOfRows INTExecTime_ms INT)

3.  Now we create the “Simple” stored procedures.
All they do is calculate the amount of even numbers within the tables:
CREATE PROCEDURE usp_SimplePermanent
   @NumOfRows INT
AS
BEGIN
   TRUNCATE TABLE dbo.Test
   INSERT INTO dbo.Test SELECT TOP (@NumOfRows) VALUE FROM TestData
   --Calculate amount of even numbers
   SELECT COUNT(1) FROM dbo.Test WHERE (VALUE % 2 = 0)
END
-------------------------------------------------------------------------
CREATE PROCEDURE usp_SimpleTemporary
   @NumOfRows INT
AS
BEGIN
     CREATE TABLE #Test (VALUE INT)
     INSERT INTO #Test SELECT TOP (@NumOfRows) VALUE FROM TestData
     --Calculate amount of even numbers
     SELECT COUNT(1) FROM #Test WHERE (VALUE % 2 = 0)
END
-------------------------------------------------------------------------
CREATE PROCEDURE usp_SimpleTableVar
   @NumOfRows INT
AS
BEGIN
     DECLARE @Test TABLE (VALUE INT)
     INSERT INTO @Test SELECT TOP (@NumOfRows) VALUE FROM TestData
     --Calculate amount of even numbers
     SELECT COUNT(1) FROM @Test WHERE (VALUE % 2 = 0)
END

4.  In the same matter, we create the “Joined” stored procedures (same as above, but with a more complex query – This one joins between our table type and the full 1M table):
CREATE PROCEDURE usp_JoinedPermanent
   @NumOfRows INT
AS
BEGIN
     TRUNCATE TABLE dbo.Test
     INSERT INTO dbo.Test SELECT TOP (@NumOfRows) VALUE FROM TestData
     --Calculate amount of even numbers
     SELECT COUNT(1)
     FROM dbo.Test t
       INNER JOIN dbo.TestData td ON t.VALUE = td.Id --forcing a table scan
     WHERE (t.VALUE % 2 = 0)
END
-------------------------------------------------------------------------
CREATE PROCEDURE usp_JoinedTemporary   
   @NumOfRows INT
AS
BEGIN
     CREATE TABLE #Test (VALUE INT)
     INSERT INTO #Test SELECT TOP (@NumOfRows) VALUE FROM TestData
     --Calculate amount of even numbers
     SELECT COUNT(1)
     FROM #Test t
       INNER JOIN dbo.TestData td ON t.VALUE = td.Id --forcing a table scan
     WHERE (t.VALUE % 2 = 0)
END
-------------------------------------------------------------------------
CREATE PROCEDURE usp_JoinedTableVar
   @NumOfRows INT
AS
BEGIN
     DECLARE @Test TABLE (VALUE INT)
     INSERT INTO @Test SELECT TOP (@NumOfRows) VALUE FROM TestData
     --Calculate amount of even numbers
     SELECT COUNT(1)
     FROM @Test t
       INNER JOIN dbo.TestData td ON t.VALUE = td.Id --forcing a table scan
     WHERE (t.VALUE % 2 = 0)
END

5.  Now we will create the procedures that will activate our tests:
CREATE PROC usp_ExecuteSimpleTests
(
     @NumOfRows INT, -- 10 / 1000000
     @TableType VARCHAR(20) -- 'Temporary' / 'TableVar' / 'Permanent'
)     
AS
BEGIN
   DECLARE @sql NVARCHAR(2000)

   SET @sql =
   'DECLARE
          @i INT = 1,
          @n INT = 10, --run 10 iterations on every table type and use AVG results
          @Begin DATETIME2 ,
          @End DATETIME2 ,
          @TotalTime_ms  INT = 0

   WHILE (@i <= @n)
   BEGIN
          DBCC FREEPROCCACHE;
          SET @Begin = SYSDATETIME()

          EXEC usp_Simple' @TableType + ' @NumOfRows = '+CAST(@NumOfRows AS VARCHAR)

          SET @End = SYSDATETIME()
          SET @TotalTime_ms += DATEDIFF(ms, @Begin, @End)
          SET @i = @i + 1
   END

   INSERT INTO TestResults (TestType, TableType, NumOfRows, ExecTime_ms)
   SELECT ''Simple'', '''+@TableType+''', '+CAST(@NumOfRows AS VARCHAR)+
          ', @TotalTime_ms/@n'

   EXEC sp_executesql @sql
END
-------------------------------------------------------------------------
CREATE PROC usp_ExecuteJoinedTests
(
     @NumOfRows INT, -- 10 / 1000000
     @TableType VARCHAR(20) -- 'Temporary' / 'TableVar' / 'Permanent'
)     
AS
BEGIN
   DECLARE @sql NVARCHAR(2000)

   SET @sql =
   'DECLARE
          @i INT = 1,
          @n INT = 10, --run 10 iterations on every table type and use AVG results
          @Begin DATETIME2 ,
          @End DATETIME2 ,
          @TotalTime_ms  INT = 0

   WHILE (@i <= @n)
   BEGIN
          SET @Begin = SYSDATETIME()

          EXEC usp_Joined' @TableType + ' @NumOfRows = '+CAST(@NumOfRows AS VARCHAR)

          SET @End = SYSDATETIME()
          SET @TotalTime_ms += DATEDIFF(ms, @Begin, @End)
          SET @i = @i + 1
   END

   INSERT INTO TestResults (TestType, TableType, NumOfRows, ExecTime_ms)
   SELECT ''Joined'', '''+@TableType+''', '+CAST(@NumOfRows AS VARCHAR)+
          ', @TotalTime_ms/@n'

   EXEC sp_executesql @sql
END



6. Let’s execute the tests…
NOTE: For the first test we don't want to use the cache (this will affect our results). We want a "clean" execution plan for every run. Therefore, we will clean the cache before every run.
The second, obviously, will perform differently when relying on the cache (it will use the pre-calculated statistics). Therefore, we will not clean our cache here:
TRUNCATE TABLE TestResults
DBCC FREEPROCCACHE;

EXEC usp_ExecuteSimpleTests @NumOfRows = 10, @TableType = 'Permanent'
DBCC FREEPROCCACHE;
EXEC usp_ExecuteSimpleTests @NumOfRows = 10, @TableType = 'Temporary'
DBCC FREEPROCCACHE;
EXEC usp_ExecuteSimpleTests @NumOfRows = 10, @TableType = 'TableVar'
DBCC FREEPROCCACHE;
EXEC usp_ExecuteSimpleTests @NumOfRows = 1000000, @TableType = 'Permanent'
DBCC FREEPROCCACHE;
EXEC usp_ExecuteSimpleTests @NumOfRows = 1000000, @TableType = 'Temporary'
DBCC FREEPROCCACHE;
EXEC usp_ExecuteSimpleTests @NumOfRows = 1000000, @TableType = 'TableVar'
GO

EXEC usp_ExecuteJoinedTests @NumOfRows = 10, @TableType = 'Permanent'
EXEC usp_ExecuteJoinedTests @NumOfRows = 10, @TableType = 'Temporary'
EXEC usp_ExecuteJoinedTests @NumOfRows = 10, @TableType = 'TableVar'
EXEC usp_ExecuteJoinedTests @NumOfRows = 1000000, @TableType = 'Permanent'
EXEC usp_ExecuteJoinedTests @NumOfRows = 1000000, @TableType = 'Temporary'
EXEC usp_ExecuteJoinedTests @NumOfRows = 1000000, @TableType = 'TableVar'
GO

SELECT * FROM TestResults


  Results:
Simple query:

TableType
NumOfRows
ExecTime (ms)
Permanent
10
215
Temporary
10
21
TableVar
10
18
Permanent
1,000,000
6,447
Temporary
1,000,000
2,188
TableVar
1,000,000
573

It is pretty clear that the table variable has much better performance than the others.

Joined Query:

TableType
NumOfRows
ExecTime(MS)
Permanent
10
11
Temporary
10
0
TableVar
10
0
Permanent
1,000,000
6,182
Temporary
1,000,000
3,024
TableVar
1,000,000
5,644

In this case, as you can see, the temporary table is the winner.

So what is going on?
Simple Query:
Since the execution plan in this case is always the same, statistics have no meaning here and updating them creates an overhead. This is where the table variable has an advantage over the other tables.
Joined Query:
Now statistics have a huge influence on the execution plan. Since we joined our table types with the large table (1M rows), the optimizer needs statistics to figure out which is the best plan to use.

Conclusion:
According to the tests above, it is pretty clear that the permanent table is not good in any case for such calculations. It has too much of an overhead.
As stated before, the table variable seems to be a better choice for simple calculations when there is only one way to execute the procedure, and the temp table would be the best choice when the procedure can be executed in more than one way.

To conclude, there is no universal rule of when and where to use temp tables or table variables. Try them both and experiment in both sides of the spectrum – small and large amounts of records.
I, from my personal experience, would rather go for temp tables. Usually we already have the plan in cache, so we don’t have to worry about recompilation. Moreover, the advantage the table variable has over the temp table in simple calculations is minor compared to the advantage the temporary table has for complex ones.

Good Luck! J