PicScout's Engineering Blog: 2011

Thursday, December 29, 2011

ILTechTalks Week

During the second week of January, PicScout will host a week of IL Tech Talks.
We will have 2 talks a day starting from 16:30.

Please follow IL Tech Talks meet-up for more details.

Here's the agenda:

S - 8/1	M - 9/1	T - 10/1	W - 11/1	Th - 12/1
Web Performance 101 (16:30)	Advanced Unit Testing (16:00)	All about Scala (16:30)	Thrift - Facebook's inter-language IDL and networking framework (16:30)	Lean software development (16:30)
Building a web infrastructure to support more than 10M users (18:00)	Continous Deployment @ Outbrain (17:10)	Scaling without re-writing your core (18:00)	Outbrain's view of Scalability (18:00)	Team Up (18:00)

The number of places is limited, so hurry up if you are interested.

Sunday, December 11, 2011

Disposing - doing it the right way

.NET’s Garbage-Collection (GC) is a well-documented and discussed feature of the Common-Language-Runtime (CLR). Plenty of posts and tutorials can be found online.
This topic is almost always followed by of another popular aspect - ”Finalize\Dispose Pattern”.

However, there are dark corners that usually aren’t mentioned widely enough, although they play important role in performance. While disposing seems as a simple plug-and-play, overusing it can affect the performance of CPU-bounded applications.

This post will try to shed some light on disposing dangers and show some best practices.

GC and disposing in a nutshell

.NET CLR holds managed heap for each application-domain where reference-types (objects) are stored, while their pointing variables are located on the stack. Those variables are often referred to as “roots”. Every once in a while (according to specific rules), CLR executes a thread which scans the heap, identifies unreachable (“dead-root”) objects and clears them from memory.

Microsoft published a GC optimization technique, the “Finalize\Dispose Pattern”, which is recommended to read before moving on with this post.
Objects can clear both managed and unmanaged members before their root vanishes, so GC won’t need to dedicate precious resources for finalizing later.
This technique is even enforced by Code-Analysis tools (such as FxCop), but bad news is that it causes programmers to often miss important key-points.

Here are the basic rules to watch out:

DO NOT implement Finalizer automatically

DO NOT implement Dispose automatically

Safety Comes First

Optimize collections if possible

DO NOT implement Finalizer automatically

A major mistake is to implement needless Finalizer, and it happens quite a lot.
Class must implement such if and only if it creates unmanaged resources which do not reside on heap.

Implementing needless Finalize method means it will take extra time to allocate the object as a reference must be inserted into the finalization queue.
Moreover, each and every object allocated for this type will require two GC collections and additional finalization step, since there is still a finalization-root reference!

Those aren’t things to underestimate – imagine scenarios where hundreds of objects are allocated & deallocated per second!

DO NOT implement Dispose automatically

Other two common mistakes, are that 1) Dispose method is called by GC to clear memory, and 2) Calling Dispose manually prevents objects even from going into GC.
That’s totally wrong!

Dispose can be called only by programmer’s code – either directly or via “using” block.
GC on the other hand clears the memory by physically detecting and removing dead-objects and later executing Finalizers if needed.
So, objects must go into one GC at least.

Implementing Dispose is only recommended when type has Finalizer or at least one of its members is IDisposable. Such preliminary disposing (and finalization suspension) will prevent the object and its members from consuming second GC plus finalization.

Do not make your class IDisposable unnecessarily since it will enforce classes which hold it to become also IDisposable, and so on...

Safety Comes First

Additional best practice which is usually forgotten is to ensure disposing doesn’t throw exceptions. We use nice extension method for that sake:

public static class IDisposableExtensions

  public static void SafeDispose(this IDisposable disposedObj)

try

       if (disposedObj != null)

           disposedObj.Dispose();

    catch {} // log some stats here…

Optimize collections if possible

Last common mistake (but not least), is that calling Dispose method marks objects as null, i.e. removes their root reference.
Such thing never happens!
Removal of root happens only when the variable pointing to the object expires (end of method for ex.), and only then GC is aware of it and marks it for cleaning. So, if one wishes to optimize GC for disposed objects, we recommend using the following nice trick – setting reference variables to “null” right after their disposal. That way, in case GC collection is about to run, such objects will be collected right away and won’t wait for their variable to be eliminated by CLR.

Integrating the two techniques described above, our virtual disposing methods usually look like this (a simple snippet can be used):

private bool m_IsDisposed = false;

protected virtual void Dispose(bool disposing)

   if (!m_IsDisposed)

      if (disposing)

      {   // clean managed resources...

         m_ClassMember.SafeDispose();

         m_ClassMember = null;

      // clean unmanaged resources here...

      m_IsDisposed = true;

Thursday, November 10, 2011

Redis as a messaging framework plus a free UI admin tool

As part of our crawling infrastructure, we wanted to enhance our messaging framework.
The crawlers are using a collection of dedicated “workers”, each worker is implementing unique business logic like downloading, validating and parsing the content.

At first, we used NServiceBus (Based on MSMQ by default) and the system worked as well as expected.
Unfortunately, when we tried to speed up the crawling process by running a greater amount of processes, we noticed that MSMQ hindered the ability to scale. In essence, there was a huge hit in performance due to heavy I/O operations that we needed to bare because of the MSMQ.

Based on Redis publicity as a fast key-value store (More information can be found at Dvir Volks presentations), we decided to give it a shot.

Redis works on Linux, Solaris and most of the POSIX systems. Although there is no support for Windows builds, we had to try it out because our system runs on Windows (written in C#).

Unfortunately, when we had a lot of Redis connections the system stopped working due to timeouts (Operations were timed out even though the messages were popped which caused a loss of data).
Eventually after running the Ubuntu version those issues were gone and everything started to work.

In order to work with Redis using C# we have reviewed various available clients.
We tested Sider, Booksleeve and ServiceStack.Redis focusing on ease of use, functionality and connections management; Finally our choice was ServiceStack.Redis.
ServiceStack.Redis provides typed clients which allow you to bind a client to a specific type, and a native client that allows you to work with byte arrays.
For our purpose we worked with the native client due to:

Problems deserializing complex types with the typed clients (if we were using only primitives / simple types it would have worked without a problem).
Serialization control for reduction of data on the network and for performance efficiency (we did not want to use a wasteful serialization format as xml).

During the adjustments of the framework to use Redis we have found some useful functionality:

BLPop / BRPop will block a client until a message will be added to the queue ( No queue polling is required and that means that the system uses less networking and leaves Redis free to process other requests).
For better performance (And if persistency is not required) Redis can be configured ti save its data in memory only.
You can use Redis "Set" object for getting a random message from a queue.
You can use Redis "Hash" object in order to verify uniqueness of a message.
We had some memory fragmentation issues on Linux that were fixed after using Redis version that supports Jemalloc (Supported from Redis 2.4).
You can run more than one Redis process on a machine (by specifying a different port for each process).

Queues information can be very useful for development and for testing.
Therefore, we created a tool called "Redis Administration" (this time we used the Sider C# client):

Press here to download

Thursday, October 27, 2011

Welcome to PicScout's Engineering Blog

If you are in the business of constructing software, you probably share the same values as we do.

I am sure that you are seeking for better ideas and solutions, looking for continuous personal and organizational improvements and thirsting for more knowledge.

Needless to say that technology cannot prosper without innovation.
Sometimes the innovation is in how a specific technology (or a solution) is combined to support the business and sometimes the innovation is the technology itself.

These are the values that PicScout's engineering team shares and cherishes.

That's why I am incredibly happy to announce PicScout's engineering blog! It will be the place where we share our technological experience and opinions, while always keeping those values in mind.

Enjoy the reading!