Saturday, November 17, 2012

Structured vs Unstructured Data

Persistent data seems to fall into one of two categories. 1) Structured Data (like cells in a spreadsheet or a row/column intersection in a database table) that must adhere to some fairly strict rules regarding type, size, or valid ranges; or 2) Unstructured Data like photos, documents, or software where the data can be much more free-form.

Databases are well equipped to handle structured data but generally do a poor job of managing large amounts of unstructured data (or blobs in database speak). File systems, on the other hand, were designed for large numbers of unstructured data wrapped in a metadata package called a file, but generally do a poor job of trying to handle structured data (although technically, databases themselves are almost always stored as a set of files in a file system volume).

When I first designed the Didget Management System, I concentrated solely on improving the handling of unstructured data. It was designed to be a replacement for file systems. Databases could be stored in a set of Didgets just as easily as in a set of files, but I planned to largely ignore structured data the way file systems do.

But with the introduction of the Didget Tags, I had to figure out how to handle large amounts of structured data as part of Didget metadata since each tag is defined with a schema and each tag value must adhere to this definition. I had to be able to assign each Didget a bunch of tags and then make it so I could query against the whole set of Didgets based on specific tag values. For example, "Find all Photo Didgets where .event.Vacation = Hawaii" would need to return a list of all photos that had been assigned this tag value. This feature is strikingly similar to executing an SQL query against a relational database.

I still didn't make the connection of how this feature could add a whole new dimension to the Didget Management System until one of the programmers helping me with this project pointed out how similar a Didget is to a row in a NoSQL database table. In fact, the entire Didget Chamber could be thought of as a huge table of columns and rows where every column is a tag and every row is a Didget. In our system there can be tens of thousands of different tags defined (columns) and billions of Didgets (rows). Each Didget can have up to 255 different tag/value assignments.

Since each Didget can also have a data stream assigned to it, this data stream could be thought of as just another column in the table (although it is a very special column in that its contents are not defined in a schema and its value can be unstructured and up to 16 TB in length). The Didget metadata record, likewise could be thought of as special columns in this huge table. We can query based on Didget type, stream length, events stamps, attributes, and the like.

What this means is that every Didget could be treated kind of like a file or kind of like a row in a database. Applications can perform operations against a set of Didgets using an API that is very file oriented or by using one more familiar to database operators.

Since the Didget Management System was designed to scale out by breaking a single chamber into multiple pieces and distributing them across a set of servers (local or remote), it could compete directly against large distributed NoSQL systems like CouchDB, MongoDB, Cassandra, or BigTable just as easily as it could against Hadoop in the distributed file system arena.

Companies or individuals that work with large amounts of "Big Data" would no longer need two separate systems, one to handle their unstructured data and another to handle their structured data. With the Didget Management System, all their data (structured and unstructured) could be handled in a single distributed system and managed with the same set of tools and policies.

No comments:

Post a Comment