Thursday, December 27, 2012

JavaScript - MVVM and TDD.

Back in 2000, when I started my programming career, JavaScript was quite a negligible language which was used only to support quite a poor client side UI.
As a result, most programmers didn't pay much attention to the quality of their JS code as they did with their server side ASP/JSP code.

Over the years, the nature of the web applications changed constantly and the direction was obvious: richer client side applications and that meant much bigger JavaScript code and a lot more logic on the client side.

Unfortunately, most programmers haven't absorb this change and they still treat JavaScript as they did for the last decade - no OOP principles, no continues integration, no code conventions, no TDD, no MV{X} patteren and no design - in that manner, JavaScript is still a stepson.

Our small donation to change this situation is this wonderful serious of posts about JavaScript best practices.
In this post I would like to add 2 more very important aspects that will greatly push your JavaScript forward: MVVM and TDD.

MVVM
Rich client applications encapsulate a huge amount of logic - UI logic. This logic must be tested, especially in a dynamic language such as JavaScript which has no compilation process. Luckily, many tools for JavaScript unit testing have emerged lately and they are all capable of testing web UI. But, if your client side code is strongly coupled with UI elements - you're going to face many difficulties to test your code with those tools. The easiest way to work with them is to supply them a pure JavaScript that involves only logic and no UI elements.

Suppose we have the following requirement: in a travel agency, to search a vacation, a user can fill up 2 date fields - From Date and To Date.
When the user presses the search button, the application checks if From Date is bigger than To Date and if so, it will show a validation message, else, it will do an ajax query.

The code will probably look something like this:

search = function(){   
   var fromDate = $('#txtFromDate');
   var toDate = $('#txtToDate')
   if(fromDate > toDate){
       $('#spnDatesValidation').css('display', 'inline');   
   } else {
       $.ajax(...)
   }
}

This code can only run within a container that contains all the DOM elements - most probably a web browser. Also, you will have to supply the relevant HTML to that container.
As stated, you're going to face many difficulties to automate a test for this code.

Here comes MVVM to the rescue. MVVM stands for Model-View-ViewModel and it's an MV{X} pattern. This family of patterns also includes MVC and MVP and they all give you the ability to decouple the UI logic from the View. In the MVVM pattern, the UI logic resides in the ViewModel.
To achieve this decoupling, MVVM relies heavily on a binding mechanism. A binding mechanism usually binds between a UI control i.e. TextBox, ComboBox etc. and a property of the ViewModel.
In our example it will bind between the txtFromDate (which is a TextBox) and ReservationViewModel.fromDate. And so, whenever the user changes txtFromDate, the binding mechanism automatically updates ReservationViewModel.fromDate and it does this behind the scenes. Of course, the same goes the other way around.
This means that whenever the user presses the search button, ReservationViewModel.fromDate and ReservationViewModel.toDate are already updated.

And so, our new code can now look something like this:

ReservationViewModel = function(){
   this.search = function(){   
       //at this stage, fromDate and toDate are already updated
       if(this.fromDate > this.toDate){
           this.showDatesValidation = true;   
       } else {
           $.ajax(...)
       }  
   }
}

In the code above we also assume that the visibility of spnDatesValidation (which is probably a SPAN that shows a validation error) is bound to ReservationViewModel.showDatesValidation.
As you can see, this code is no longer coupled with any DOM element - it's a plain JavaScript code that can run on any JavaScript engine with no need to supply a HTML for that engine.

A great tool that provides a binding mechanism for JavaScript is Knockout.js. It's probably the most popular MVVM tool these days. We are working with that tool for several months now and it looks robust, simple, mature and it has a great documentation.

TDD
In this post i assume that the readers are already familiar with TDD and its concepts like stubs, mocks and dependency injection.

Stubs & Mocks
Looking again at the last code sample above, we can see that the search method is not quite testable yet - it's performing an ajax call and probably has some logic regarding a successful or a failure response.
Suppose we have a test called "search with valid dates should make an ajax call". This test supplies valid dates to the ViewModel and expects to see that an ajax call has been made. The problem is that with the current code, there is not a clean way to assert that an ajax call has been made.

A Dependency Injection might help us out here. What we can do is to inject an object that encapsulates all ajax calls regarding the ReservationViewModel. Thus, we will be able to inject a stub object that will record any call its own methods.

So let's change the method a little bit, lets use Dependency Injection:

ReservationViewModel = function(reservationAjax){   
   
   this.reservationAjax = reservationAjax;
   
   this.search = function(){   
       if(this.fromDate > this.toDate){
           this.showDatesValidation = true;   
       } else {
           this.reservationAjax.search(...);
       }  
   }
}

And the test will look something like this (here i use Buster.js for testing):

buster.testCase("ReservationViewModel tests", {
    "search with valid dates should make an ajax call": function () {
        var ajaxStub = new Object();
        var ajaxCalled = false;
        ajaxStub.search = function(){
            ajaxCalled = true;
        }
        var vm = new ReservationViewModel(ajaxStub);
        vm.fromDate = new Date(2000, 1, 1);
        vm.toDate = new Date(2001, 1, 1);
    
        vm.search();
      
        assert.equals(ajaxCalled, true);
    }
});

As you can see, I dynamically override the search method of the ajaxStub object and adjust it to my needs - raise a flag whenever this method is being called.

Conclusions

TDD is a very good practice and it's almost a must for JavaScript for 2 resons:

JS is a dynamic language and it has no compilation process. TDD should compensate for this.
UI logic is a lot of logic and quite a complicated one. This logic must be tested.

You can test your JS code even without any MV{X} pattern but this will probably cause you a lot of headaches. Use MVVM - it will save you a lot of code and it will improve your design and testability.
Use Dependency Injection for things like ajax and message boxes - it will give you a greater flexibility.

Wednesday, October 10, 2012

High Volume Redis

If you’ve been reading our previous posts, you already know that we are using Redisas a messaging framework. In this post I’ll show you how we use Redis for a different solution.

The problem

As you might know, we are dealing with images. The number of images varies from product to product, but usually we are dealing with millions to hundreds of millions of images. One of the common questions that keep rising is how we can store and efficiently access such a big amount of data.

There are different solutions to that problem and we use different techniques, from our own distributed and proprietary data structure to… Redis. Redis can provide a very simple on spot updates as well as very efficient queries when dealing with complex data structures as values, such as: lists and sets.

Redis

Redis caught our eyes as a super-fast NoSQL DB.

Though it was promising at the beginning, when we’ve used Redis lists to push our data and then to query (read) the data in range it was quite disappointing.

Comparing to our own proprietary distributed data structure, Redis lists consumed twice more memory (even after we’ve ensured we are avoiding fragmentation).

The read results were very poor as well – ten times slower than our implementation (data structure).

We were expecting some overheads, but not like that.

Memory Optimizations

After some digging, we succeeded to apply some memory optimizations, with two thumb rules:

- Use hashes when possible – Redis hashes are much more memory optimized than lists, sets or other Redis supported structures.

- Reduce amount of keys & don’t exceed the hash-max-zipmap-entries (and thanks to the guys at instagram).

We also applied compressing techniques on some of our data before storing it in Redis (we obviously “pay” by decompressing after the read).

Those optimizations yielded great results, very similar to our proprietary data structure.

Query Optimizations

The memory optimizations we’ve made also resulted in a significant query performance.

Instead of using LRange, a Multi bulk reply with O(S+N) time complexity, we are using HGet to read a hash bucket with O(1) time complexity.

On top of that, one of the coolest optimization we can take is to use Redis’s UNIX Domain Sockets instead of TCP/IP, to reduce network bandwidth.

Oh, and of course, we are using Redis pipelines whenever it’s possible.

Results

After applying the optimizations we got a very efficient NoSQL store:

Time wised: 1.1 times of latency compared to our own implementation

Memory wised: 0.9 5 times of memory consumption compared to our own implementation.

Side Effect Bonus

Using Redis, we can now separate between our algorithms and data representations.

The data itself can grow faster and support various operations provided by the NoSQL store.

And, it is much easier to write & execute sets of automated integration tests, YEH!!!

Tuesday, August 14, 2012

WADL

This post will bring some information about WADL, what is it, how to read it and why we should create this document when publishing company API's.

So what is it WADL ….

Web Application Description Language – an XML based file format that provides a description of HTTP-based web applications, most known are: REST web services. WADL is the way to explicit your methods , headers and parameters in one readable format.

Here is the explanation of WADL by WIKI :

“The purpose of WADL is to allow services on the Internet (or any other IP network) to be described in a machine processable way, to make it easier to create Web 2.0 style applications and create a dynamic way of creating and configuring services. Prior to this, it was necessary to go to an existing web service, study it and write the application manually. WADL can be thought of as the REST equivalent of Web Services Description Language version 1.1. WADL models the resources provided by a service, and the relationships between them.”

For this post we choose the D&B REST API.

Here is the full WADL document you can download and try dnbDnBAPI-10

Let's take a look on the example

<application xmlns="http://research.sun.com/wadl/2006/10">
  .....
    <resources base="http://dnbdirect-sandbox.dnb.com/DnBAPI-10/rest">
        <resource path="/">
            ..........              
            <resource path="search/company">
                ........               
                <resource path="/">
                    <resource path="{keyword}">
                        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" type="xs:string" style="template" name="keyword"/>
                        <method name="POST" id="findCompanyJSONRequest"/>                         
                        <method name="GET" id="findCompany">
                            <request>
                                <param xmlns:xs="http://www.w3.org/2001/XMLSchema" type="xs:long" style="query" name="duns_from"/>
                                <param xmlns:xs="http://www.w3.org/2001/XMLSchema" type="xs:long" style="query" name="duns_to"/>
                                .......
                            </request>
                            <response>
                                <representation mediaType="application/json;charset=UTF-8"/>
                            </response>
                        </method>
                    </resource>
                </resource>                 
            .........       
        </resource>
    </resources>
</application>

So, what we are getting here:

This service is described using a set of resource elements. Each of these contains parameter elements to describe the inputs, and method elements which describe the request and response of a resource. The request element specifies how to represent the input, what types are required.
The result of the following representation looks like this:

http://dnbdirect-sandbox.dnb.com/DnBAPI-10/rest/search/company/{keyword}

Actually it is an API url to search companies by keyword where {keyword} - is search query and it is string.

This REST service can be used by both GET and POST methods, look at parameter element name. Using of POST it reaches “findCompanyJSONRequest” function and GET method will reach “findCompany”. The code also shows that both of them can get response parameters and will return the same structure result.

The example environment has more than one version , it known , based on two different WADL documents.The second WADL document you can get from here ( dnbDnBAPI-11). Compare them and you’ll find the differences.

In conclusion :

WADL will help you :

· to versioning your REST API

· to make your REST API clear and readable

· to reduce your customer service expenses.

Good Luck!!

Thursday, July 12, 2012

Why you should consider asynchronous programming model (APM) when writing web server in Python.

In our days there is an increasing interest towards asynchronous technologies. This is particularly as a result of a huge success of Nginx and NodeJS. But there is also many people that still not getting what is this all about and what are the differences between synchronous, asynchronous and threaded programming. In this article I will try to clarify these things…

TCP server internals

Let’s look on how the server application receives a request. When the client connects to the server, the server opens a new connection and starts listening for incoming data. Data usually arrives in chunks, and the server tries to find the end of the request by looking for delimiter characters or by a specified length, which might be indicated in the first few bytes.

A typical CPU can handle data transfer rates that are orders of magnitude faster than a network link is capable of sustaining. Thus, the server that is doing lots of I/O will spend much of its time blocked while network catches up. To not to block the other clients while waiting, the server has to handle the request reading in a concurrent manner.

A popular way to do so is to use a thread-per-client. But there are some problems with threads. Well, Python has no real multithreading support at all. Instead it has GIL (Global Interpreter Lock). GIL is necessary mainly because Python's memory management is not thread-safe. It's preventing multiple native threads from executing Python bytecodes at once.

On the one hand this makes the threaded programming with Python fairly simple: to add an item to a list or set a dictionary key, no locks are required. But on the other hand it leads to relatively big chunks of code that are executed sequentially, blocking each other for an undetermined amount of time.

The in-depth explanation of this problem is in this video by David Beazley.

Disregarding the Python, there is much wider problem with threads. I was actually surprised with how many cons are in using them. Apparently, the cons are varying from being a bad design (as described here) to more pragmatic ones such as consuming a fair amount of memory, since each thread needs to have its own stack. The stack size may vary on different OS's.

On .NET it's usually 1 Mb on 32 bit OS and 4 Mb on 64 bit OS. On Linux OS's it might be up to 10 Mb per thread. Also, the context switches between many threads will degrade the performance significantly. Commonly, it's not recommended to have more than 100 threads. Not surprisingly, it is also always difficult to write code that is thread safe. You have to care about such things as race condition, deadlocks, live-locks and starvation!

Fortunately there is a better way to handle concurrency. Python excels in a very specific area; asynchronous (aka non-blocking) network servers.

Back in the days before multi-threading was invented, asynchronous designs were the only available mechanism for managing more than one connection in a single process.

I'd like to illustrate this principle by example published in the linuxjournals article by Ken Kinder:

Have you ever been standing in the express lane of a grocery store, buying a single bottle of water, only to have the customer in front of you challenge the price of an item, causing you and everyone behind you to wait five minutes for the price to be verified?

Plenty of explanations of asynchronous programming exist, but I think the best way to understand its benefits is to wait in line with an idle cashier. If the cashier were asynchronous, he or she would put the person in front of you on hold and conduct your transaction while waiting for the price check. Unfortunately, cashiers are seldom asynchronous. In the world of software, however, event-driven servers make the best use of available resources, because there are no threads holding up valuable memory waiting for traffic on a socket. Following the grocery store metaphor, a threaded server solves the problem of long lines by adding more cashiers, while an asynchronous model lets each cashier help more than one customer at a time.

The APM basic flow is visualized below:

The module is waiting for the event. Once there is any, it reacts (thus the name reactor) by calling the appropriate callback.

Python has introduced a high performance asynchronous server framework already since 1995 called Medusa.

That has turned to an archetype of nowadays well known Zope and Twisted. It has been built initially addressing C10K problem, which is a simple one; how to service 10,000 simultaneous network requests. I refer you to the C10K website for enormously detailed technical information on this complex problem

It is sufficient to say that asynchronous architectures, with their much smaller memory usage, and lack of need for locking, synchronization and context-switching, are generally considered to be far more performant than the threaded architectures.

So if you'll ever need to handle hight traffic or just to have fun with trying a different programming thinking, you should consider to write an asynchronous application.

Acknowledgments:

http://google.com

http://aboutsimon.com/tag/david-beazley

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

http://www.linuxjournal.com/article/7871

http://www.nightmare.com/medusa/

http://www.zope.org/

http://twistedmatrix.com/

http://www.kegel.com/c10k.html

http://krondo.com/?p=1209

https://ep2012.europython.eu/conference/talks/asynchronous-programming-with-twisted

http://jython.xhaus.com/twisted-and-zope-high-performance-asynchronous-network-servers-in-jython/

http://www.guptamayank.com/why-node.js-single-thread-event-loop-javascript

Thursday, July 5, 2012

API Best Practices - Introducing PicScout’s API

In the past few months, since we announced our new technological blog, we’ve published many posts regarding our technological point of view. Posts like "Redis as a Messaging Framework", “Javascipt Best Practices” and “Machine Learning Approach to Document Classification” are just a few samples of the posts we’ve published since then.

One thing we came to realize is that you, our devoted readers, can’t really check if what we post about is actually what we do. I mean, what Aran wrote on his post about disposing is really nice, but do we, the PicScout team, really follow those guidelines? Well the answer is clearly yes! But you can’t really know that, can you?

As a result, we’ve decided that it is mandatory for us to publish a post about something our readers can actually put us to the test. In the following few minutes we would like to share with you our point of view about APIs Best Practices and show you how they interpolate in our Web API. Hopefully you’ll accept the challenge to put us to the test.

Let's get started!

There are many ways you can implement your API. Yet, not many of them can ensure you that following them will guarantee a lightweight, flexible and user-friendly API. Following a few simple guidelines, as shown on Uri Lavi’s “API Best Practices” slides here, we’ve implemented our API in a way that ensured us the above features.

Let’s discuss a few of those guidelines.

Is this line secure?

The web is full of data going from one point to another. In PicScout’s case, the data that is transferred from and to our API shouldn’t be visible for anybody besides our trusted partners.

In order to support the secure transfer of our data, we’ve decided that our API should support only secure communication scheme, HTTPS in other words. As a result, each request sent to our API should comply with the following form:

https://api.picscout.com/

What version is this?

As many APIs, PicScout’s API goes through many changes and modification during its life cycle. Framework and endpoints are samples of things that can change during that time. Such changes and modification should be applied without causing third-party tools that use our API to break. As a result, versioning is a one of the crucial features we’ve implemented in our API.

This feature is easily achieved by specifying the API's version as part of the request URI. More specifically, the API version is the first segment of the URI after the base address. For example, the URI for sending requests to our API will look as follows:

https://api.picscout.com/v1/

This feature will allow us in the future to make major changes, if needed, to our API without the fear that any third-party tool that uses our API will break.

“English ******, do you speak it?”

At the end of the day, the data is handled by two machines, the client and the server. Yet one thing any API developer should keep in mind - APIs are for humans!

Keeping that in mind, we’ve decided to keep our API's URI formats as readable as possible. In order to achieve this feature we use basic English grammatical terms such as nouns, verbs and relationships. No more programmers’ favorite method names in the URI.

For example, let’s say you want to get the details of an image that has 12345 as its id. One way we could achieve that is by sending a request as follows:

https://api.picscout.com/v1/getImageDetails?id=12345

This seems to us a bit… well… ugly. As you might have guessed, there is a clear relationship between an image and its id. Furthermore, there is also one between our API and all the images in our storage. Considering this, it is only logical that the request should have the following format:

https://api.picscout.com/v1/images/12345

In translation to English, you want to get access to all our images details but only to the one with 12345 as its id. Simplicity in action.

“Hey! You promised verbs! You cheated!” – “Well allow me to retort”. Another operation we expose through our API is to search for similar images in our storage based on an image URL or its binary data. So in order for you to use that ability all you got to do is to set the URI format in the following manner:

https://api.picscout.com/v1/search?url=<imageURL>

In translation to English, you want to search for images that are similar to the one you provided. Yet again, simplicity in action.

Don’t reinvent the wheel

You might have noticed that I “forgot” to show you an example of how you can search for images based on an image binary data. Well I had to “forget” about it in order to illustrate the following concept. So please, forgive me.

Not all images are stored online, like the ones you have stored on your PC. Thus, no URL can be provided to access them. Exactly for this type of cases, we at PicScout, decided that is mandatory for us to support the search of similar images based on an image binary data.

But wait, we already used the “search” verb to search for images based on URL! Well lucky for us we can always add another endpoint like:

https://api.picscout.com/v1/search_binary?data=<imageBinaryData>

And there you have it, minor additions create major abilities right? NO! Why on earth would you want to add another endpoint to an operation you already support? And pass binary data as part of the URI?

Luckily for us, there is more than one method we can use to access endpoints. In fact, considering how lucky we are, why not just use those methods and by that make our API much more readable and user-friendly

So to make a long story short, since the operation is the same operation (“search”), we decided that it will be better if what distinguishes between the two requests is the method. For searching an image based on URL we use the GET method. For binary data based search we use the POST method where the binary data itself is passed in body of the request.

As a result, all you have to do in order to use our binary data based search is to attach the image file to the POST request’s body and send it over to:

https://api.picscout.com/v1/search

Why do I need to know all of this?

As you might know, the amount of data that is transferred over the web is enormous. In addition, this amount only keeps getting larger and larger. To put it simple, more cargo means more weight and more weight means more time spent moving it, unless new and improved trucks are constructed. Most of us don’t have control over the trucks construction comity, but we do have control (or at least partial control) over the cargo.

Using that knowledge, one should always try to find more efficient ways to transfer his\her cargo or data in our case. So without further ado, the PicScout team is proud to present one of our API’s major, and the coolest in my opinion, features – The Field Selector!!!

On the client side, the field selector allows you, our trusted partner, to specify exactly which information you want to retrieve. On the server side, which is our Eco-friendly API, it allows us to send back only a relatively small amount of data which can transfer much faster.

“How?” you say? Well that’s really simple; just name the fields you want to include in the response and you’re good to go. For example, let’s say you only want to know where you can buy the image with 12345 as its id. All you have to do is send the following request:

https://api.picscout.com/v1/images/12345?fields=purchaseUrl

In conclusion, following the few simple guidelines we discussed in this post helped us, at PicScout, reaching our goal in creating a simple, readable and flexible API. To support this claim, we implemented 3 client in 3 different languages: Node.JS, Python and C# in our case.

While as exciting as it is to implement the same code in 3 different languages, the interesting part was a tiny rule we agreed on; each implementation shouldn’t take more than 10 minutes. To be fair, it took us around 5 minutes each.

But that’s not such a big deal, considering we know our API from top to bottom. So this is where you, if you’re up for it, step in. We challenge you to implement a client for our API, in any language you’ll like. Same rule applies here, 10 minutes and that’s it! No need to send us your code or anything like that, just share your experience. To those of you that are not interested in the challenge, you’re more than welcome to try out our API’s abilities.

One last thing, before you go playing around with our API. As many other Web APIs, our API supports only request that are sent from known users. In order to identify yourself as one, contact us for key requests and we’ll issue one for you along with our API documentation.

We hope you enjoyed reading this post and looking forward to adding more exciting new features to our API based on your feedback.

Sunday, June 24, 2012

Using Read/Write Through In Your Data Access Layer

When thinking about accessing data we usually think of one layer. It can be DB, Cache or even a file on our file system. However, when we want to scale our system, we need to think of using multiple layers. The read and write through logic can help us balance our load and even sync data as we go. In the next few paragraphs I will try to explain the way we chose to deal with this kind of data accessing.

First let's explain the logic of read through and write through.

Write through means that the application writes data to one layer and that layer is responsible for writing the data to another layer if necessary.

Read through means that the application read from one layer of data (usually cache) and that layer is responsible of getting the data from another layer if necessary. It also responsible to update the layers that had missing data.

Note: There's also another way for handling data accessing that should be mentioned in this context. It's called "Write Behind". It means the data is always being saved to one layer, and once in a while (not for every write) this layer updates the other layers with the latest changes.

Getting Started

To get started, let's assume implementing the following interface when writing your data accessing code:

public interface IDataAccess<Key, Value>

{

void Save(Key key, Value value);

Value Load(Key key);

void Delete(Key key);

}

It is very important to use this kind of interface when accessing data and it will help us later when we’ll want to enhance it for multiple layers.

Now let's implement this interface for each one of our layers. In this example I will use 2 layers: DB and Cache.

public class DBDataAccess<Key, Value> : IDataAccess<Key, Value>

{

//Implement Save, Load and Delete for database

}

public class CacheDataAccess<Key, Value> : IDataAccess<Key, Value>

{

//Implement Save, Load and Delete for Cache

}

Note: In the examples above and throughout the post, I used generics. In your own implementation you will probably have to specify specific types.

Combining the Layers

Now that we have our implementation for each layer, let's start putting it all together.

For this, let's create another implementation which will use the 2 layers above to simulate read\write through. This implementation will be used to save, load and delete, and will be responsible for the execution of the read\write through.

For this, we will store our layers in a list or an array in the order we want the data to be read. In our example, if we want to read data first from cache and then from DB, the order will be:

1) CacheDataAccess

2) DBDataAccess

For example:

public class DataAccess<Key, Value> : IDataAccess<Key, Value>

{

//the list of our data access implementations

private List<IDataAccess<Key, Value>> dataAccessProviders;

public DataAccess ()

{

//create the list with the desired order

dataAccessProviders = new List<IDataAccess<Key, Value>>()

{

new CacheDataAccess<Key, Value>(),

new DBDataAccess<Key, Value>()

};

}

Note: In the example I created the providers list myself. In your implementation you should create it through better mechanism such as injection or factory.

Note: This example uses the "Decorator" pattern, which dynamically adds functionality to an object.

Add Some Logic

Now we just have to write the read and write through logic. It will be pretty easy since we used the same interface for each one of our layers. The Save method will be very simple. Just use the code above with the dataAccessProviders as the list parameter.

public void Save(Key key, Value value)

{

dataAccessProviders.ForEach(provider => provider.Save(key, value));

}

The Load method is a bit more complicated. It needs to return the data from the first layer that has a result, but also update the layers that we went through which had no result. This is done in the following code:

public Value Load(Key key)

{

Value value = LoadFromProvider(key);

List<IDataAccess<Key, Value>> staleProviders = FindStaleProviders(key);

UpdateStaleProviders(key, value, staleProviders);

return value;

}

private Value LoadFromProvider(Key key)

{

foreach (IDataAccess<Key, Value> provider in dataAccessProviders)

{

Value value = provider.Load(key);

if (Exists(value))

{

return value;

}

return default(Value);

}

private List<IDataAccess<Key, Value>> FindStaleProviders(Key key)

{

List<IDataAccess<Key, Value>> staleProviders = new List<IDataAccess<Key, Value>>();

foreach (IDataAccess<Key, Value> provider in dataAccessProviders)

{

Value value = provider.Load(key);

if (Exists(value))

{

break;

}

staleProviders.Add(provider);

}

return staleProviders;

}

private void UpdateStaleProviders(Key key, Value value,

List<IDataAccess<Key, Value>> staleProviders)

{

if (Exists(value))

{

staleProviders.ForEach(provider => provider.Save(key, value));

}

private bool Exists(Value value)

{

return !Equals(value, default(Value));

}

Note: Pay attention, code is optimized for readiness and not for performance.

As you can see, we first try to load from each one of our layers (The order will be the order you specified earlier). Then, we create a sub-list of all the layers that have missing value. Last thing we do before returning the value is update those layers that have missing value.

The Delete method has the same logic as the Save method. I didn't implement it here; you can try it yourself…

That's it, we're done. We finally have a simple implementation to our read/write through problem. Coding using interface has made it possible for us to work with any kind of provider, regardless its actual implementation. We can easily replace the behavior of the code by using a different module which implements the same interface.

Thursday, May 10, 2012

Machine Learning Approach to Documents Classification

Let's say for the sake of the argument that we would like classify different web sites into some predefined categories or classes. There are different approaches one can use in order to do it.
Below I will give a toy example that you can enhance further more for your needs.

1. Solution Approaches:

Rule based classification - a rule captures a certain combination of keywords that indicates a class. Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. A technically skilled person (e.g., a domain expert who is good at writing regular expressions) can create rule sets that will rival or exceed the accuracy of the automatically generated classifiers taking advantage of the fact that classification criteria can be easily controlled when the number of rules is small. However hand-crafted rules approach suffer from several disadvantages:

- Sometimes, rules conflicts each other.

- Maintenance of rules becomes more difficult as the number of rules increases.

- The rules have to be manually reconstructed when a target domain changes.

- Low coverage because of a wide variety of expressions.

Machine learning classification - In machine learning, the set of rules or, more generally, the decision criterion of the text classifier, is learned automatically from training data. In machine text classification, we require a number of good example documents (or training documents) for each class. The need for manual classification is not eliminated because the training documents come from a person who has labeled them - where labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules. Almost anybody can look at a document and decide whether or not it is related to some topic. Moreover once the training data is ready the learning process is fully automatic and domain independent.

2. Machine Learning Workflow:

1. Prepare a set of training data – collection of documents labeled by well defined topics.

2. Represent the training data as numerical feature vectors.

3. Use the training feature-vectors to learn a classification function.

4. Classify new documents by transforming them into feature vectors and applying

the classification function.

5. Evaluate the accuracy.

Fig.1 - Machine Classification Workflow:

(a) - learning a classification function on training documents set.

(b) - classification of new query documents

3. Feature – vectors representation:

Bag-of-Word approach – a document is regarded as a set of words regardless of the word order and grammar. Can be extended to include n-grams – contiguous sequences of n words.

To build BOW based vector representation we usually use the following flow:

Normalize - Convert words into a normalized forms applying text processing techniques such as down-case, lemmatization, stemming and etc.

Remove stop words – ignore predefined common words, e.g., “the”,” a”, “to”, “with”, ‘that...

Build a vocabulary – vocabulary is constructed as a unified set of distinct words from the entire training set collection after normalizing and removing stop words. Each word in the vocabulary is assigned a running index from 1 to number of words (it is not important what will be the order of the words).

Represent a document as a vector - Use the vocabulary to construct the BOW vector representation of each document. The dimension of the BOW vector space will be equal to the number of words in the vocabulary. Each index of the vector refers to the corresponding word index of the vocabulary. The entry corresponding to the index i in the vector measure the occurrences of the corresponding i-th word in the vocabulary. There are several common options to measure a word occurrence:

Binary: occur (1) or not-occur (0)

Term Frequency (tf) : the number of times where word appears in a document

Inverse document frequency ( idf): the inverted rate of documents that contain word against a whole set of documents

Tf-Idf: Multiplication of tf and idf measures. Frequent words that appear only in a small number of documents achieve high value

Consider, for example, two simple documents:

Document 1: John likes to watch movies. Mary likes too.

Document 2: John also likes to watch football games.

Based on these two text documents, a vocabulary is constructed as:

Vocabulary = {1:"John", 2:"likes", 3:"to", 4:"watch", 5:"movies", 6:"also", 7:"football", 8 :"games", 9:"Mary", 10:"too"},

which has 10 distinct words. Using the indexes of the dictionary, each document is represented by a 10-entry vector:

Document 1: x₁ = [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]

Document 2 : x₂ = [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

4. Accuracy Evaluation:

The analysis of classification results is based on the following characteristics:

True Positives (TP) - documents correctly labeled as belonging to a class

False Negatives (FN) – Items which were not labeled as belonging to a class but should have been (also often called missed alarms or type II errors)

False Positives (FP) – Items incorrectly labeled as belonging to a class (also often called false alarms or type I errors).

These characteristic compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the observation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the expectation).

Using the above definitions the following important measures are constructed:

Precision – Percentage of documents correctly identified as belonging to the class

Recall - Percentage of documents belonging to the class that where correctly identified

A precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C). Thus, Precision is used when probability that a positive result is correct is important

Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. For example, a classification system for deciding whether or not, say, a fruit is an orange, can achieve high precision by only classifying fruits with the exact right shape and color as oranges, but at the cost of low recall due to the number of false negatives from oranges that did not quite match the specification.

F-measure - Weighted harmonic mean of precision and recall

Why use harmonic mean?

Harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large.

Confusion matrix - a specific table layout that allows visualization of the performance of a classification. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class.

Fig.3. Confusion matrix example. Of the 69 actual Sports documents 51 where labeled correctly while 6 where wrongly labeled as Animals and 12 as arts. Similarly, of 736 Arts documents 699 where labeled correctly while 10 where wrongly labeled as sports, 25 as Animals and 2 as Adults.

5. Refernces:

[1] http://en.wikipedia.org/wiki/Precision_and_recall

[2] http://en.wikipedia.org/wiki/Confusion_matrix

Eli Goz