Sunday, March 3, 2013

Large objects serialization with C#.


Preface, I’m going to talk about the serialization of Large objects (with size of hundreds of MBs or even GBs). It's better to keep things small, but it's not always possible due to large architecture changes, so we've decided to take it to the limits (where we actually are limited only by PC’s physical memory).

Let’s say we have the classes:
[Serializable]
public class Result
{
    public string Uri { get; set; }
    public List<Data> AData{ get; set; }
}

[Serializable]
public class Data
{  
    public string Data1{ get; set; }  
    public string Data2{ get; set; }
}
We want to Binary serialize the Result class with, for example, 10 million Data objects inside in order to persist to storage. Later, it should be de-serialized back.

First,we used the .Net binary serializer and got:
System.Runtime.Serialization.SerializationException: The internal array cannot expand to greater than Int32.MaxValue elements. You could find the explanation of that issue here.

Next step was to implement the ISerializable interface and handle the serialization of the Datas collection explicitly. We used the Newtonsoft Json serializer: 
[Serializable]
public class Result : ISerializable
{
    public string Uri { get; set; }
    public List<Data> AData{ get; set; }

    public Result()
    {
    }

    protected Result(SerializationInfo info, StreamingContext context)
    {
        Uri = info.GetString("Uri");
        AData= JsonConvert.DeserializeObject<List<Data>>(info.GetString("AData"));  
    }

    public void GetObjectData(SerializationInfo info, StreamingContext context)
    {
        info.AddValue("Uri", Uri, typeof(string));
        info.AddValue("AData", (JsonConvert.SerializeObject(AData, Formatting.None)));
    }    
}
It didn't work either: 
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.StringBuilder.ToString()
   at Newtonsoft.Json.JsonConvert.SerializeObject(Object value, Formatting formatting, JsonSerializerSettings settings) in JsonConvert.cs:line 755

The next one is the protobuf-net. You have to add attributes to your classes:
[Serializable]
[ProtoContract]
public class Data
{
     [ProtoMember(1)]
     public string Data1{ get; set; }
     [ProtoMember(2)]
     public string Data2{ get; set; }
}
Also in the Result class we added support to GzipStream:
[Serializable]
public class Result : ISerializable
{

    public void GetObjectData(SerializationInfo info, StreamingContext context)
    {
        info.AddValue("Uri", Uri, typeof(string));
        PopulateFieldWithData(info, "AData", AData);   
    }

    protected Result(SerializationInfo info, StreamingContext context)
    {
        Uri = info.GetString("Uri");
        AData= GetObjectsByField<List<Data>>(info, "AData");
    }

    private static void PopulateFieldWithData<T>(SerializationInfo info, string fieldName, T obj)
    {
        using (MemoryStream compressedStream = new MemoryStream(),
               MemoryStream byteStream = new MemoryStream())
        {
              Serializer.Serialize<T>(byteStream, obj);
              byteStream.Position = 0;

              using (GZipStream zipStream = new GZipStream(compressedStream, CompressionMode.Compress))
              {
                  byteStream.CopyTo(zipStream);
              }

              info.AddValue(fieldName, compressedStream.ToArray());
       }
   }

   private static T GetObjectsByField<T>(SerializationInfo info, string dataField)
   {
         byte[] byteArray = (byte[])info.GetValue(dataField, typeof(byte[]));

         using (MemoryStream compressedStream = new MemoryStream(byteArray))
         using (MemoryStream dataStream = new MemoryStream())
         using (GZipStream uncompressedStream = new GZipStream(compressedStream, CompressionMode.Decompress))
         {
               uncompressedStream.CopyTo(dataStream);

               dataStream.Position = 0;
               return Serializer.Deserialize<T>(dataStream);
         }
   }
}
It didn't work as well. Even though it didn't crashed, apparently it entered to an endless loop.

Here, we realized that we need to split the Datas collection during the serialization/de-serilization.
Main idea is to take each time, let’s say, 1 million Data objects, serialize them and add to the Serialization Info as a separate field. During the de-serialization these objects should be taken separately and merged to one collection. I updated the Result class with a few more functions:

private const string ADataCountField= "ADataCountField";
private const int NumOfDataObjectsPerSerializedPage = 1000000;

public void GetObjectData(SerializationInfo info, StreamingContext context)
{
       info.AddValue("Uri", Uri, typeof(string));
       SerializeAData(info);   
}

private void SerializeAData(SerializationInfo info)
{
       int numOfADataFields = Datas == null ? 0 :
              (int)Math.Ceiling(Datas.Count / (Double)NumOfDataObjectsPerSerializedPage );

       info.AddValue(ADataCountField, numOfADataFields );

       for (int i = 0; i < numOfADataFields ; i++)
       {
             List<Data> page = Datas.Skip(NumOfDataObjectsPerSerializedPage * i).Take(NumOfDataObjectsPerSerializedPage ).ToList();
             PopulateFieldWithData(info, "AData" + i, page);
       }
}

protected Result(SerializationInfo info, StreamingContext context)
{
        Uri = info.GetString("Uri");
        DeserializeAData(info);
}

private void DeserializeAData(SerializationInfo info)
{
        AData = new List<Link>();
        int aDataFieldsCount = info.GetInt32(ADataCountField);

        for (int i = 0; i < aDataFieldsCount ; i++)
        {
            List<Data> dataObjects= GetObjectsByField<List<Data>>(info, "AData" + i);
            Datas.AddRange(dataObjects);
        }
}
Finally, it worked!

4 comments:

  1. How do we change to see these C# for website?

    ReplyDelete
  2. Where to check the correct templates to see if they work out?

    ReplyDelete
  3. Я столкнулся с похожей проблемой. Мне необходимо делать снепшоты каждые несколько часов, но я не могу делать это синхронно, т.к. данных очень много. А если делать это асинхронно, данные постоянно меняются.

    ReplyDelete
  4. how I can use the source code? can you tell me step by step?

    ReplyDelete