Sunday, November 18, 2012

The Big Picture

So far, I have posted several blogs that explain certain pieces of the Didget Management System and how each feature adds specific benefits over conventional file system or database architectures. I thought I would devote this 20th blog to explaining the entire system once all the pieces are put together to give the reader an idea of how it will look once completed.

The Didget Realm represents a world-wide collection of individual Didget containers called Chambers. Each Chamber is managed by its own instance of the Didget Manager and together they represent a single node in this global data storage network. Each node can communicate with every other node to exchange Didget information. With the use of Policy Didgets, this information can be exchanged automatically without direct commands from a running application. Nodes can be grouped into domains or federations so that they can exchange even more information between them than can two nodes that are not in the same domain.

Each Chamber can store several billion individual Didgets. The system is designed to effectively manage huge numbers of Didgets without sacrificing speed. Simple queries to a Chamber with over 10 million Didgets in it are designed to execute in under one second. Even the most complex queries are designed to execute in under ten seconds when the Didget Manager is running on a single desktop system. For Chambers with hundreds of millions or with billions of Didgets, the Chamber can be split into many individual pieces and managed by lots of separate systems in a distributed environment to perform lightning fast queries using map-reduce algorithms.

A Chamber that has been converted to a distributed system looks exactly the same to an application or to another node in the global network, as does a Chamber that has not been split into several pieces and distributed. In other words, applications do not need to know if they are communicating with a single piece Chamber running on a laptop computer or if they are communicating with a Chamber that has been split into 100 different pieces and managed by 1000 different servers. The only difference will be the speed at which a query or other command may execute when the number of Didgets in the Chamber is extraordinarily large.

Using Policy Didgets and Security Didgets, operations against all the Didgets with a Chamber can be tightly controlled. Sensitive information can be protected and a whole host of data management functions can happen automatically when either a certain amount of time has expired or when certain events happen.

Individual Didgets can be classified, tagged, and grouped together in ways files or database rows never could. Copying or moving a Didget from one Chamber to another does not cause it to lose any of its metadata or to become any less secure than the original. Special attributes can be assigned to each Didget that enable it to be managed by the Didget Manager in very specific ways. Several of these attributes represent unique features that I have not seen on any other system.

Applications can query for a set of Didgets based on any of these metadata fields and perform operations against the whole set (if permissions allow).

Didgets can represent either structured and unstructured data. All the management functions work the same, regardless of the data type. Didgets can be accessed using file-like APIs or database-like queries.

Inventory, search, backup, recovery, synchronization, organization, version control, and licensing are just a few of the management functions that are provided by the system. In every case, the functions will perform faster and with simpler mechanisms than with conventional systems.

In summary, I think this system offers a far superior data management environment than do conventional file systems or NoSQL database environments. Once data is created as Didgets (or converted from legacy systems) it will be far easier to manage and provide significantly greater value to the end user than it would be as files or as database rows.

The Didget Management system will revolutionize the way the whole world looks at data going forward. (You heard it here first!)

Saturday, November 17, 2012

Structured vs Unstructured Data

Persistent data seems to fall into one of two categories. 1) Structured Data (like cells in a spreadsheet or a row/column intersection in a database table) that must adhere to some fairly strict rules regarding type, size, or valid ranges; or 2) Unstructured Data like photos, documents, or software where the data can be much more free-form.

Databases are well equipped to handle structured data but generally do a poor job of managing large amounts of unstructured data (or blobs in database speak). File systems, on the other hand, were designed for large numbers of unstructured data wrapped in a metadata package called a file, but generally do a poor job of trying to handle structured data (although technically, databases themselves are almost always stored as a set of files in a file system volume).

When I first designed the Didget Management System, I concentrated solely on improving the handling of unstructured data. It was designed to be a replacement for file systems. Databases could be stored in a set of Didgets just as easily as in a set of files, but I planned to largely ignore structured data the way file systems do.

But with the introduction of the Didget Tags, I had to figure out how to handle large amounts of structured data as part of Didget metadata since each tag is defined with a schema and each tag value must adhere to this definition. I had to be able to assign each Didget a bunch of tags and then make it so I could query against the whole set of Didgets based on specific tag values. For example, "Find all Photo Didgets where .event.Vacation = Hawaii" would need to return a list of all photos that had been assigned this tag value. This feature is strikingly similar to executing an SQL query against a relational database.

I still didn't make the connection of how this feature could add a whole new dimension to the Didget Management System until one of the programmers helping me with this project pointed out how similar a Didget is to a row in a NoSQL database table. In fact, the entire Didget Chamber could be thought of as a huge table of columns and rows where every column is a tag and every row is a Didget. In our system there can be tens of thousands of different tags defined (columns) and billions of Didgets (rows). Each Didget can have up to 255 different tag/value assignments.

Since each Didget can also have a data stream assigned to it, this data stream could be thought of as just another column in the table (although it is a very special column in that its contents are not defined in a schema and its value can be unstructured and up to 16 TB in length). The Didget metadata record, likewise could be thought of as special columns in this huge table. We can query based on Didget type, stream length, events stamps, attributes, and the like.

What this means is that every Didget could be treated kind of like a file or kind of like a row in a database. Applications can perform operations against a set of Didgets using an API that is very file oriented or by using one more familiar to database operators.

Since the Didget Management System was designed to scale out by breaking a single chamber into multiple pieces and distributing them across a set of servers (local or remote), it could compete directly against large distributed NoSQL systems like CouchDB, MongoDB, Cassandra, or BigTable just as easily as it could against Hadoop in the distributed file system arena.

Companies or individuals that work with large amounts of "Big Data" would no longer need two separate systems, one to handle their unstructured data and another to handle their structured data. With the Didget Management System, all their data (structured and unstructured) could be handled in a single distributed system and managed with the same set of tools and policies.

Monday, November 12, 2012

Policies

In the conventional file system world, file systems treat all files like black boxes and almost never perform any direct manipulation of files. If any file is created, modified, moved, or deleted it is done as a direct command from either the operating system or an application. All file management functions such as organization, backup, synchronization, or cleanup are performed by something other than the file system itself.

In the Didget system, many of these management tasks can also be performed by the Didget Manager independent of another running program. Programs can schedule specific tasks to execute at specific times or when certain events occur with the use of Policy Didgets. These Didgets are somewhat similar to database triggers. They can cause the Didget Manager to manipulate data even while the application that scheduled the task is no longer available to the system.

Just like all the other Didgets in the system, Policy Didgets can be created, protected, queried, synchronized, and deleted. They can have tags attached to them to help in finding or organizing different policies. They can have a data stream that contains specific instructions or program extensions or that logs results as the policy executes. Just about any conceivable data management function could be implemented or at least facilitated using these special Didgets.

For example, an application could create a policy that automatically adds any new photos with a .event.Vacation tag to a List Didget called "Vacation Photo Album". At the same time it could search for another list Didget with a name matching the tag value (e.g. if .event.Vacation = "Hawaii" then it would look for a list where .didget.Name = "Hawaii Photo Album") and either add it to the existing list or create a new list if it did not exist and then add it.

In another example, an application could create a policy that would automatically backup all new or modified Private Didgets to a chamber located in the cloud every Monday morning. This would create an incremental backup of everything the user created on that system during the week.

In yet another example, an application could create a policy that automatically synchronized all new photos and documents with a chamber located on a phone every time the phone was connected to the desktop.

Policy Didgets could be built and maintained to enforce company policies governing data protection, retention, and validation. Entire workflow systems could be driven by carefully crafted Policy Didgets by having data created, tagged, and organized as each step in the workflow progresses.

Saturday, November 3, 2012

How Does it Scale

The Didget Manager is designed to perform a variety of data management functions against a set of storage containers that may be attached to a single system or spread across several separate systems.

These functions include:

1) Backup
2) Synchronization
3) Replication
4) Inventory
5) Search
6) Classification
7) Grouping
8) Activation (licensing)
9) Protection
10) Archiving
11) Configuration
12) Versioning
13) Ordering
14) Data Retention

In order to properly perform each of these functions, a system is needed that can operate against all kinds of data sets consisting of structured and/or unstructured data, from very small sets to extremely large sets (i.e. "Big Data"). A legitimate question for any system is "How does it scale?"

When it comes to the term "Scale", I define it in three dimensions -"Scale In", "Scale Out", and "Scale Up".

"Scale In" refers to the ability of the system's algorithms to properly handle large amounts of data within a single storage container given a fixed amount of hardware resources on a single system. File Systems have a limited ability to scale in this manner. For example: the NTFS File System was designed to hold just over 4 billion files in a single volume. However; each file requires a File Record Segment (FRS) that is 1024 bytes long. This means that if you have 1 billion files in a volume, you must read approximately 1 TB of data from that volume just to access all the file metadata. If you want to keep all that metadata in system memory in order to perform multiple searches through it at a faster rate, you would need to have a TB of RAM. Regular file searches through that metadata can also be painfully slow even if all the metadata is in RAM due to the antiquated algorithms of file system design.

The Didget system was designed to handle billions of Didgets and perform fast searches through the metadata even when limited RAM is available. If the same 1 billion files had been converted from files to Didgets, the system would only need to read 64 GB of metadata off the disk and have 64 GB of RAM to keep it in system memory. This is only 1/16 of the requirements needed for NTFS. Searches through that metadata would be hundreds of times faster than with file systems.

"Scale Out" refers to the ability of the system to improve performance by adding additional resources and performing operations in parallel. This can be accomplished in two ways. Multiple computing systems can operate against a single container, or a single container can be split into multiple pieces and distributed out to those systems. Hadoop is a popular open-source distributed file system that spreads file data across many separate systems in order to service data requests in parallel. It has a serious limitation in that file metadata is stored on a single "NameNode". This has both availability and performance ramifications. It was designed more for smaller sets of extremely large files rather than for extremely large sets of smaller files. Most of the other traditional file systems were never designed to either operate in parallel or to be split up.

The Didget system was designed for both kinds of parallel processing. Multiple systems can operate largely in parallel against a single container since all the metadata structures were designed for locking at the block level. When a system needs to update a piece of metadata, it does not need to establish a "global lock" on the container. It only needs to lock a small portion of the metadata where the update is applicable. This means that thousands of systems can be creating, deleting, and updating Didgets within a single container at the same time. Each container was also designed to be split up and distributed across multiple systems. Both the data streams and the Didget metadata can be split up and distributed. Map-Reduce algorithms are used to query against many of these container pieces in parallel.

"Scale Up" refers to the ability of a single management system to manage data from small sets on simple devices to extremely large data sets on very complex hardware systems. Most data management systems today don't scale up very well. For example, backup programs that work well on pushing data from a single PC to the cloud do not generally work well as enterprise solutions. Users typically need separate data management systems for their home environment and for their work environment. As a business grows from a small business to a medium sized business to a large enterprise, it often must abandon old systems and adopt new systems as its data set grows.

The Didget system was designed to work essentially the same whether it is managing a set of a few hundred Didgets on a mobile phone or it is managing billions of Didgets spread across thousands of different servers. Additional modules may be required and enhanced policies would need to be implemented for the larger environment to function effectively, but the two systems would function nearly identically from the user's (or application's) point of view. Applications that use the Didget system to store their data would not need to know which of the two environments was in play.