PicScout's Engineering Blog: March 2013

Monday, March 4, 2013

Access your organizational data securely and transparently

What do I have to hide?

If you are dealing with an information system, you always have something to hide. Maybe not at first, but as your system grows, it starts to accommodate more users of various roles and positions inside or outside the organization. Perhaps the types of information you store diversify as well. At some point, you're going to need some boundaries to protect that information, or your data management will end up looking like a Harlem Shake.

For instance, say you have a software system for managing scientific experiments and results. Its purpose is to allow the public easy access to research and to allow scientists to share their knowledge. So why hide anything? It may seem at first sight that you would not need any security here, but let us consider some example use cases:

A general user who wants information about children's safe attachment
A biologist wants to see the latest news about stem cell research
A member of a cross-disciplinary research project wants to see its progress

You can see how all of these users are out for the same type of information (results of research), but as the architect of the system, you would want to give them different levels of access to it, since you wouldn't to expose the details of an ongoing project to anyone but the team and if you're a biologist you can probably see some partial and unverified results that perhaps you wouldn't want to share with the general public at this point.

Let's try to describe a simple security layer to address those needs.

Building a protection layer

Unknown location

The easiest and simplest way to keep a secret. You want to hide something? Don't tell anyone where it is. This is similar to how Google allows you to share photo albums with "Anyone with the link". Given that the link is random enough, it would be pretty hard for someone to locate it. However, once the link is leaked you lose control over it. You give one colleague the link to your research results and he sends it to his journalist friend and you don't know what happens next.

Password protecting

So maybe just put a password on anything you want to keep safe? Again, the password could be leaked, but even if it doesn't or you are able to change it in time, this is problematic since now you need to manage a whole range of passwords for all your data and remember everyone else's passwords. This is very cumbersome and inefficient and doesn't provide real security.

User identity

Why don't we approach this from a different angle and try to identify the user first and then let him access the application? Let's assume our program has some sort of login mechanism in which it identifies the user. The user then tries to find the relevant information. This approach works better than the previous ones since there is no need for user interaction after the first identification (which can sometime be done automatically using Singe Sign On). Also, the control is granular and dynamic per user, so we have lots of control.

But, as Uncle Ben once said, with great control comes great complexity. We now have the responsibility to assign (and maintain!) the different permissions for all users and data which can amount to thousands or even millions of combinations.

A simple, scalable solution

So we come to the conclusion that while using user based permissions is secure enough, we need a way to treat the users as groups with common properties rather than as individuals. If we could label the data in a smart way, users will have an easier time accessing it.

How can we do this? There are several common ways:

Flat labeling – marking pieces of data with some sort of label or tag is the basis of Web 2.0. We can mark each project with a unique label, if needed.
User groups – we can assign the user to various groups according to his affiliation. For example, "biologists from Oxford" can be such a group if we want to allow them to share information only they will see. Another option is to separate "Biologists" and "Oxford" and use some sort of combination logic.
Roles – If you wish to expose only some of the data to general users, this can be accomplished by assigning a hierarchy of roles and only allowing "scientists or above" to view it.

Accessing the data can be done via a (simplified) method such as:

[Secured(true)]

IEnumerable<ResearchData> GetResearchData()

{



   return DAL.GetData<ResearchData>();

}

Notice a couple of things about this method:

The DAL object is some object that allows fetching data from a DB or a service without being aware of security limitations. Don't allow direct access to it.
It has a "Secured" attribute with the value set to true, which means it will trigger some code using an AOP technique.

This is what the implementation of "Secured" should look like:

public class SecuredAttribute : AOPAttribute

…

public override void OnSuccess(MethodExecutionArgs args)



{

   Credentials creds = GetUserCredentials();

   args.ReturnValue = creds.Filter(args.ReturnValue);

}

…

GetUserCredentials() gets a class that consists of the grouping of users which we talked about before and has logic to filter the input data with.

Summary

In this post, I gave an example of a security layer implementation. Of course, every system has its unique needs and should be considered independently. However, I feel there are general principles which hold true for all cases and they should be followed as guidelines:

Keep your security flexible – you never know when a security level for data or users will change.
Separate concerns – let your business logic do its thing, don't mix in security. Try to do as little as marking a method with an attribute, and perhaps not even that. Notice how in the example the logic of filtering by credentials is centralized in a single location, easy to understand and change. NEVER CHECK FOR USER CREDENTIALS IN DOMAIN CODE.
Identify your roles – This is important because once you've managed to map credential types to use cases, you've solved the main logical challenge. Don't get stuck too much on this though, since if you separate concerns properly, you will be able to change this later on.

Good luck!

Sunday, March 3, 2013

Large objects serialization with C#.

Preface, I’m going to talk about the serialization of Large objects (with size of hundreds of MBs or even GBs). It's better to keep things small, but it's not always possible due to large architecture changes, so we've decided to take it to the limits (where we actually are limited only by PC’s physical memory).

Let’s say we have the classes:

[Serializable]
public class Result
{
    public string Uri { get; set; }
    public List<Data> AData{ get; set; }
}

[Serializable]
public class Data
{  
    public string Data1{ get; set; }  
    public string Data2{ get; set; }
}

We want to Binary serialize the Result class with, for example, 10 million Data objects inside in order to persist to storage. Later, it should be de-serialized back.

First,we used the .Net binary serializer and got:

System.Runtime.Serialization.SerializationException: The internal array cannot expand to greater than Int32.MaxValue elements. You could find the explanation of that issue here.

Next step was to implement the ISerializable interface and handle the serialization of the Datas collection explicitly. We used the Newtonsoft Json serializer:

[Serializable]
public class Result : ISerializable
{
    public string Uri { get; set; }
    public List<Data> AData{ get; set; }

    public Result()
    {
    }

    protected Result(SerializationInfo info, StreamingContext context)
    {
        Uri = info.GetString("Uri");
        AData= JsonConvert.DeserializeObject<List<Data>>(info.GetString("AData"));  
    }

    public void GetObjectData(SerializationInfo info, StreamingContext context)
    {
        info.AddValue("Uri", Uri, typeof(string));
        info.AddValue("AData", (JsonConvert.SerializeObject(AData, Formatting.None)));
    }    
}

It didn't work either:

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.

at System.Text.StringBuilder.ToString()

at Newtonsoft.Json.JsonConvert.SerializeObject(Object value, Formatting formatting, JsonSerializerSettings settings) in JsonConvert.cs:line 755

The next one is the protobuf-net. You have to add attributes to your classes:

[Serializable]
[ProtoContract]
public class Data
{
     [ProtoMember(1)]
     public string Data1{ get; set; }
     [ProtoMember(2)]
     public string Data2{ get; set; }
}

Also in the Result class we added support to GzipStream:

[Serializable]
public class Result : ISerializable
{

    public void GetObjectData(SerializationInfo info, StreamingContext context)
    {
        info.AddValue("Uri", Uri, typeof(string));
        PopulateFieldWithData(info, "AData", AData);   
    }

    protected Result(SerializationInfo info, StreamingContext context)
    {
        Uri = info.GetString("Uri");
        AData= GetObjectsByField<List<Data>>(info, "AData");
    }

    private static void PopulateFieldWithData<T>(SerializationInfo info, string fieldName, T obj)
    {
        using (MemoryStream compressedStream = new MemoryStream(),
               MemoryStream byteStream = new MemoryStream())
        {
              Serializer.Serialize<T>(byteStream, obj);
              byteStream.Position = 0;

              using (GZipStream zipStream = new GZipStream(compressedStream, CompressionMode.Compress))
              {
                  byteStream.CopyTo(zipStream);
              }

              info.AddValue(fieldName, compressedStream.ToArray());
       }
   }

   private static T GetObjectsByField<T>(SerializationInfo info, string dataField)
   {
         byte[] byteArray = (byte[])info.GetValue(dataField, typeof(byte[]));

         using (MemoryStream compressedStream = new MemoryStream(byteArray))
         using (MemoryStream dataStream = new MemoryStream())
         using (GZipStream uncompressedStream = new GZipStream(compressedStream, CompressionMode.Decompress))
         {
               uncompressedStream.CopyTo(dataStream);

               dataStream.Position = 0;
               return Serializer.Deserialize<T>(dataStream);
         }
   }
}

It didn't work as well. Even though it didn't crashed, apparently it entered to an endless loop.

Here, we realized that we need to split the Datas collection during the serialization/de-serilization.

Main idea is to take each time, let’s say, 1 million Data objects, serialize them and add to the Serialization Info as a separate field. During the de-serialization these objects should be taken separately and merged to one collection. I updated the Result class with a few more functions:

private const string ADataCountField= "ADataCountField";
private const int NumOfDataObjectsPerSerializedPage = 1000000;

public void GetObjectData(SerializationInfo info, StreamingContext context)
{
       info.AddValue("Uri", Uri, typeof(string));
       SerializeAData(info);   
}

private void SerializeAData(SerializationInfo info)
{
       int numOfADataFields = Datas == null ? 0 :
              (int)Math.Ceiling(Datas.Count / (Double)NumOfDataObjectsPerSerializedPage );

       info.AddValue(ADataCountField, numOfADataFields );

       for (int i = 0; i < numOfADataFields ; i++)
       {
             List<Data> page = Datas.Skip(NumOfDataObjectsPerSerializedPage * i).Take(NumOfDataObjectsPerSerializedPage ).ToList();
             PopulateFieldWithData(info, "AData" + i, page);
       }
}

protected Result(SerializationInfo info, StreamingContext context)
{
        Uri = info.GetString("Uri");
        DeserializeAData(info);
}

private void DeserializeAData(SerializationInfo info)
{
        AData = new List<Link>();
        int aDataFieldsCount = info.GetInt32(ADataCountField);

        for (int i = 0; i < aDataFieldsCount ; i++)
        {
            List<Data> dataObjects= GetObjectsByField<List<Data>>(info, "AData" + i);
            Datas.AddRange(dataObjects);
        }
}

Finally, it worked!