Object Storage for Big Unstructured Data

Big Data is Big, but it also causes a lot of confusion. Big Data is used for anything related storage these days, so people don’t know anymore what it exactly is. Is it Hadoop? Is it analytics? It doesn’t need to be that complicated though. There are two kinds of Big Data: Big Data (for analytics) and Big Unstructured Data.

Big Data for analytics is a paradigm that became popular in the previous decade. A lot of innovation was done for research projects. New technology enabled researchers in many different domains to capture data in a way they had never been able to do before. In agriculture, for example, ploughs would get sensors that would send little bits of information to a central system (over satellite). Every couple of feet these sensors would measure what’s in the ground (minerals for example), how humid the ground is etc. Based on that, large agriculture companies would then be able to make better decisions on where to grow which crop.

The problem was that traditional systems to store this massive amount of small data (relational databases) were no longer adequate to store this information. Systems like MapReduce and Hadoop were created as an alternative and would store these massive volumes of files as concatenated “Big” files. Big Data was born, Big Data for semi-structured data.

Today we are seeing a similar trend with unstructured data. Studies show that data storage requirements will grow 30X over the next decade. 80% of that data are large files: office documents, movies, music, pictures. Similar to databases in the previous decade, traditional storage – file systems – is not the best way to store this data. File systems will not scale sufficiently and actually become obsolete as applications will take over the role of the file system.

A nice example is what Google Picasa does for us: in the old days we would store pictures nicely organized in a file system (hopefully with some backups). One folder per year, one per month in each year, one per holiday or party. Today, we just dump all the pictures in one folder and Picasa will sort them for us based on date, location, face recognition (!) or other metadata. With an intelligent query, we can display the right pictures very fast, much faster than browsing the file system. We don’t even have to worry about backups as we can store copies in the cloud automatically.

The new paradigm that will help us store these massive amounts of unstructured data is Object Storage. Object Storage systems are uniformly scalable pools of storage that are accessible through a REST interface. Files – objects – are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car Valet vs. self park. When you self park you have to remember the lot, the floor, the isle etc (file system); with Valet you get a receipt when you give your keys and you will later use that receipt to get your car back.

So what is needed to build an object storage system? Basically just lots of disks, a REST API and a way to provide durability. This could be done with traditional systems like RAID but the problem is that RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to be needing 200% overhead as some systems do. The smarter way to provide durability for object storage is erasure encoding.

Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects are split up in sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible, also policy-defined. As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations on a healthy level again. A pioneer of this technology is Amplidata, who use low power Atom processors in their hardware to reduce power costs. As the entire system, all storage nodes, can recalculate missing equations as a background task, Amplidata figured out it was not necessary to use the high-end nodes that RAID systems need (to speed up restores and avoid performance losses).

Apart from providing a more efficient and a more scalable way to store data, erasure coding based object storage can save up to 70% on the overall TCO thanks to reduced raw storage needs and reduced power needs (less hardware + low power devices save on power and cooling). Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost.

So what are the use cases for object storage? As data needs grow, object storage will become the storage paradigm of choice in more and more environments, but already today we see the need in a number of situations:

Building live archives

Object storage enables companies to re-activate their data. Currently, most companies see data more as a burden than anything else: the data will never be used again but needs to be archived for a whole lot of reasons. But this data actually has a lot of value. By using live archives, employees have faster access to older data and they can use those valuable resources. With traditional storage it would never be achievable to build disk based archives for this purpose as the overhead would make this too costly.

Online applications

Most of the data-intensive online – cloud – applications are built on public clouds such as Amazon S3, which are early implementations of Object Storage. The benefits for the application providers are plenty: a simple programming interface, low cost and fast time to market. As their data sets grow, those companies might move to private Object Storage implementations to reduce costs even more.

Media and entertainment

Traditionally, the M&E industry has been very much file-oriented but we’re seeing a growing interest in object storage to optimize efficiency and reduce costs, but also because this industry is already hitting the limits of their file systems.

These are just a few examples of Object Storage implementations for Big Unstructured Data. Object Storage was not built to replace any of the current storage architectures. Very much like NAS filers were designed in the 90ies because block storage (SAN was designed when databases were king) was not optimized for Unstructured Data, Object Storage will find it’s place next to those two for Big Unstructured Data.

Advertisements

~ by tomleyden on January 16, 2012.

One Response to “Object Storage for Big Unstructured Data”

  1. […] Robin terminated his first session with some Big Data types, including Big Science, Big Entertainment, Big Streams and Big Business. Big Data is all about variety, volume and velocity of data. After this, I gave a quick overview on the analogy between Big Data for Analytics and Big Unstructured Data as discussed here. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: