Wednesday, October 10, 2012

High Volume Redis


If you’ve been reading our previous posts, you already know that we are using Redisas a messaging framework. In this post I’ll show you how we use Redis for a different solution.

The problem

As you might know, we are dealing with images. The number of images varies from product to product, but usually we are dealing with millions to hundreds of millions of images. One of the common questions that keep rising is how we can store and efficiently access such a big amount of data.
There are different solutions to that problem and we use different techniques, from our own distributed and proprietary data structure to… Redis. Redis can provide a very simple on spot updates as well as very efficient queries when dealing with complex data structures as values, such as: lists and sets.

Redis

Redis caught our eyes as a super-fast NoSQL DB.
Though it was promising at the beginning, when we’ve used Redis lists to push our data and then to query (read) the data in range it was quite disappointing.
Comparing to our own proprietary distributed data structure, Redis lists consumed twice more memory (even after we’ve ensured we are avoiding fragmentation).
The read results were very poor as well – ten times slower than our implementation (data structure).

We were expecting some overheads, but not like that.

Memory Optimizations

After some digging, we succeeded to apply some memory optimizations, with two thumb rules:
-          Use hashes when possible – Redis hashes are much more memory optimized than lists, sets or other Redis supported structures.
-          Reduce amount of keys & don’t exceed the hash-max-zipmap-entries (and thanks to the guys at instagram).
We also applied compressing techniques on some of our data before storing it in Redis (we obviously “pay” by decompressing after the read).
Those optimizations yielded great results, very similar to our proprietary data structure.

Query Optimizations

The memory optimizations we’ve made also resulted in a significant query performance.
Instead of using LRange, a Multi bulk reply with O(S+N) time complexity, we are using HGet to read a hash bucket with O(1) time complexity.
On top of that, one of the coolest optimization we can take is to use Redis’s UNIX Domain Sockets instead of TCP/IP, to reduce network bandwidth.
Oh, and of course, we are using Redis pipelines whenever it’s possible.

Results

After applying the optimizations we got a very efficient NoSQL store:
Time wised: 1.1 times of latency compared to our own implementation
Memory wised: 0.9 5 times of memory consumption compared to our own implementation.

Side Effect Bonus

Using Redis, we can now separate between our algorithms and data representations.
The data itself can grow faster and support various operations provided by the NoSQL store.
And, it is much easier to write & execute sets of automated integration tests, YEH!!!