PicScout's Engineering Blog: July 2012

In our days there is an increasing interest towards asynchronous technologies. This is particularly as a result of a huge success of Nginx and NodeJS. But there is also many people that still not getting what is this all about and what are the differences between synchronous, asynchronous and threaded programming. In this article I will try to clarify these things…

TCP server internals

Let’s look on how the server application receives a request. When the client connects to the server, the server opens a new connection and starts listening for incoming data. Data usually arrives in chunks, and the server tries to find the end of the request by looking for delimiter characters or by a specified length, which might be indicated in the first few bytes.

A typical CPU can handle data transfer rates that are orders of magnitude faster than a network link is capable of sustaining. Thus, the server that is doing lots of I/O will spend much of its time blocked while network catches up. To not to block the other clients while waiting, the server has to handle the request reading in a concurrent manner.

A popular way to do so is to use a thread-per-client. But there are some problems with threads. Well, Python has no real multithreading support at all. Instead it has GIL (Global Interpreter Lock). GIL is necessary mainly because Python's memory management is not thread-safe. It's preventing multiple native threads from executing Python bytecodes at once.

On the one hand this makes the threaded programming with Python fairly simple: to add an item to a list or set a dictionary key, no locks are required. But on the other hand it leads to relatively big chunks of code that are executed sequentially, blocking each other for an undetermined amount of time.

The in-depth explanation of this problem is in this video by David Beazley.

Disregarding the Python, there is much wider problem with threads. I was actually surprised with how many cons are in using them. Apparently, the cons are varying from being a bad design (as described here) to more pragmatic ones such as consuming a fair amount of memory, since each thread needs to have its own stack. The stack size may vary on different OS's.

On .NET it's usually 1 Mb on 32 bit OS and 4 Mb on 64 bit OS. On Linux OS's it might be up to 10 Mb per thread. Also, the context switches between many threads will degrade the performance significantly. Commonly, it's not recommended to have more than 100 threads. Not surprisingly, it is also always difficult to write code that is thread safe. You have to care about such things as race condition, deadlocks, live-locks and starvation!

Fortunately there is a better way to handle concurrency. Python excels in a very specific area; asynchronous (aka non-blocking) network servers.

Back in the days before multi-threading was invented, asynchronous designs were the only available mechanism for managing more than one connection in a single process.

I'd like to illustrate this principle by example published in the linuxjournals article by Ken Kinder:

Have you ever been standing in the express lane of a grocery store, buying a single bottle of water, only to have the customer in front of you challenge the price of an item, causing you and everyone behind you to wait five minutes for the price to be verified?

Plenty of explanations of asynchronous programming exist, but I think the best way to understand its benefits is to wait in line with an idle cashier. If the cashier were asynchronous, he or she would put the person in front of you on hold and conduct your transaction while waiting for the price check. Unfortunately, cashiers are seldom asynchronous. In the world of software, however, event-driven servers make the best use of available resources, because there are no threads holding up valuable memory waiting for traffic on a socket. Following the grocery store metaphor, a threaded server solves the problem of long lines by adding more cashiers, while an asynchronous model lets each cashier help more than one customer at a time.

The APM basic flow is visualized below:

The module is waiting for the event. Once there is any, it reacts (thus the name reactor) by calling the appropriate callback.

Python has introduced a high performance asynchronous server framework already since 1995 called Medusa.

That has turned to an archetype of nowadays well known Zope and Twisted. It has been built initially addressing C10K problem, which is a simple one; how to service 10,000 simultaneous network requests. I refer you to the C10K website for enormously detailed technical information on this complex problem

It is sufficient to say that asynchronous architectures, with their much smaller memory usage, and lack of need for locking, synchronization and context-switching, are generally considered to be far more performant than the threaded architectures.

So if you'll ever need to handle hight traffic or just to have fun with trying a different programming thinking, you should consider to write an asynchronous application.

Acknowledgments:

http://google.com

http://aboutsimon.com/tag/david-beazley

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

http://www.linuxjournal.com/article/7871

http://www.nightmare.com/medusa/

http://www.zope.org/

http://twistedmatrix.com/

http://www.kegel.com/c10k.html

http://krondo.com/?p=1209

https://ep2012.europython.eu/conference/talks/asynchronous-programming-with-twisted

http://jython.xhaus.com/twisted-and-zope-high-performance-asynchronous-network-servers-in-jython/

http://www.guptamayank.com/why-node.js-single-thread-event-loop-javascript

In the past few months, since we announced our new technological blog, we’ve published many posts regarding our technological point of view. Posts like "Redis as a Messaging Framework", “Javascipt Best Practices” and “Machine Learning Approach to Document Classification” are just a few samples of the posts we’ve published since then.

One thing we came to realize is that you, our devoted readers, can’t really check if what we post about is actually what we do. I mean, what Aran wrote on his post about disposing is really nice, but do we, the PicScout team, really follow those guidelines? Well the answer is clearly yes! But you can’t really know that, can you?

As a result, we’ve decided that it is mandatory for us to publish a post about something our readers can actually put us to the test. In the following few minutes we would like to share with you our point of view about APIs Best Practices and show you how they interpolate in our Web API. Hopefully you’ll accept the challenge to put us to the test.

Let's get started!

There are many ways you can implement your API. Yet, not many of them can ensure you that following them will guarantee a lightweight, flexible and user-friendly API. Following a few simple guidelines, as shown on Uri Lavi’s “API Best Practices” slides here, we’ve implemented our API in a way that ensured us the above features.

Let’s discuss a few of those guidelines.

Is this line secure?

The web is full of data going from one point to another. In PicScout’s case, the data that is transferred from and to our API shouldn’t be visible for anybody besides our trusted partners.

In order to support the secure transfer of our data, we’ve decided that our API should support only secure communication scheme, HTTPS in other words. As a result, each request sent to our API should comply with the following form:

https://api.picscout.com/

What version is this?

As many APIs, PicScout’s API goes through many changes and modification during its life cycle. Framework and endpoints are samples of things that can change during that time. Such changes and modification should be applied without causing third-party tools that use our API to break. As a result, versioning is a one of the crucial features we’ve implemented in our API.

This feature is easily achieved by specifying the API's version as part of the request URI. More specifically, the API version is the first segment of the URI after the base address. For example, the URI for sending requests to our API will look as follows:

https://api.picscout.com/v1/

This feature will allow us in the future to make major changes, if needed, to our API without the fear that any third-party tool that uses our API will break.

“English ******, do you speak it?”

At the end of the day, the data is handled by two machines, the client and the server. Yet one thing any API developer should keep in mind - APIs are for humans!

Keeping that in mind, we’ve decided to keep our API's URI formats as readable as possible. In order to achieve this feature we use basic English grammatical terms such as nouns, verbs and relationships. No more programmers’ favorite method names in the URI.

For example, let’s say you want to get the details of an image that has 12345 as its id. One way we could achieve that is by sending a request as follows:

https://api.picscout.com/v1/getImageDetails?id=12345

This seems to us a bit… well… ugly. As you might have guessed, there is a clear relationship between an image and its id. Furthermore, there is also one between our API and all the images in our storage. Considering this, it is only logical that the request should have the following format:

https://api.picscout.com/v1/images/12345

In translation to English, you want to get access to all our images details but only to the one with 12345 as its id. Simplicity in action.

“Hey! You promised verbs! You cheated!” – “Well allow me to retort”. Another operation we expose through our API is to search for similar images in our storage based on an image URL or its binary data. So in order for you to use that ability all you got to do is to set the URI format in the following manner:

https://api.picscout.com/v1/search?url=<imageURL>

In translation to English, you want to search for images that are similar to the one you provided. Yet again, simplicity in action.

Don’t reinvent the wheel

You might have noticed that I “forgot” to show you an example of how you can search for images based on an image binary data. Well I had to “forget” about it in order to illustrate the following concept. So please, forgive me.

Not all images are stored online, like the ones you have stored on your PC. Thus, no URL can be provided to access them. Exactly for this type of cases, we at PicScout, decided that is mandatory for us to support the search of similar images based on an image binary data.

But wait, we already used the “search” verb to search for images based on URL! Well lucky for us we can always add another endpoint like:

https://api.picscout.com/v1/search_binary?data=<imageBinaryData>

And there you have it, minor additions create major abilities right? NO! Why on earth would you want to add another endpoint to an operation you already support? And pass binary data as part of the URI?

Luckily for us, there is more than one method we can use to access endpoints. In fact, considering how lucky we are, why not just use those methods and by that make our API much more readable and user-friendly

So to make a long story short, since the operation is the same operation (“search”), we decided that it will be better if what distinguishes between the two requests is the method. For searching an image based on URL we use the GET method. For binary data based search we use the POST method where the binary data itself is passed in body of the request.

As a result, all you have to do in order to use our binary data based search is to attach the image file to the POST request’s body and send it over to:

https://api.picscout.com/v1/search

Why do I need to know all of this?

As you might know, the amount of data that is transferred over the web is enormous. In addition, this amount only keeps getting larger and larger. To put it simple, more cargo means more weight and more weight means more time spent moving it, unless new and improved trucks are constructed. Most of us don’t have control over the trucks construction comity, but we do have control (or at least partial control) over the cargo.

Using that knowledge, one should always try to find more efficient ways to transfer his\her cargo or data in our case. So without further ado, the PicScout team is proud to present one of our API’s major, and the coolest in my opinion, features – The Field Selector!!!

On the client side, the field selector allows you, our trusted partner, to specify exactly which information you want to retrieve. On the server side, which is our Eco-friendly API, it allows us to send back only a relatively small amount of data which can transfer much faster.

“How?” you say? Well that’s really simple; just name the fields you want to include in the response and you’re good to go. For example, let’s say you only want to know where you can buy the image with 12345 as its id. All you have to do is send the following request:

https://api.picscout.com/v1/images/12345?fields=purchaseUrl

In conclusion, following the few simple guidelines we discussed in this post helped us, at PicScout, reaching our goal in creating a simple, readable and flexible API. To support this claim, we implemented 3 client in 3 different languages: Node.JS, Python and C# in our case.

While as exciting as it is to implement the same code in 3 different languages, the interesting part was a tiny rule we agreed on; each implementation shouldn’t take more than 10 minutes. To be fair, it took us around 5 minutes each.

But that’s not such a big deal, considering we know our API from top to bottom. So this is where you, if you’re up for it, step in. We challenge you to implement a client for our API, in any language you’ll like. Same rule applies here, 10 minutes and that’s it! No need to send us your code or anything like that, just share your experience. To those of you that are not interested in the challenge, you’re more than welcome to try out our API’s abilities.

One last thing, before you go playing around with our API. As many other Web APIs, our API supports only request that are sent from known users. In order to identify yourself as one, contact us for key requests and we’ll issue one for you along with our API documentation.

We hope you enjoyed reading this post and looking forward to adding more exciting new features to our API based on your feedback.

PicScout's Engineering Blog

Thursday, July 12, 2012

Why you should consider asynchronous programming model (APM) when writing web server in Python.

Thursday, July 5, 2012

API Best Practices - Introducing PicScout’s API