Saturday, July 13, 2013

Data Managers

Over the years, a number of systems have been created to help users manage their data. I call these systems "Data Managers". There are two types - primary data managers and secondary data managers.

Primary data managers are very general-purpose in nature and are widely adopted in the computing world. File systems, databases, and web servers fall into this category. More recent members of this category include distributed file systems like Hadoop and cloud offerings like Amazon S3. These newer systems are gaining greater acceptance as "Big Data" becomes more pervasive and as users demand more mobile access to all their data.

Secondary data managers are generally more specialized in the types of data they manage. They almost always utilize the services of a primary data manager to store their underlying data. Examples of these kinds of data managers include Apple's iTunes for managing music or Google's Picassa for managing photos. They typically keep most of their unstructured data as files in a file system and create a proprietary database for storing extra metadata. These data managers may also integrate with cloud services to give the user a virtual view of their data even when it may be spread across several systems. Unfortunately, these secondary data managers are nearly always in danger of interference from other programs and must rely upon the security measures offered by the primary data manager. If another application deletes, moves, or renames one or more of the files it manages, the secondary data manager can often have trouble reconciling those changes. If another program deletes one or more of its core metadata files (i.e. its database) then the secondary data manager can fail completely.

The Didget Management System is a primary data manager. It not only provides new functionality that previous data managers lack, but it has also been designed to supplant them. This is very different from other primary data managers like databases for example, which were designed to manage structured data in ways that file systems never could, but were never designed to handle unstructured data well enough to make file systems unnecessary. A consequence of that strategy is that as each of the other primary data managers entered the market, we ended up with yet another "silo" into which a portion of our data can be put.

That is why I designed the Didget Management System to manage both structured and unstructured data well. It is designed to manage that data in both simple configurations and in distributed cluster environments. When the amount of data grows from a few thousand pieces of information to billions of pieces utilizing petabytes of storage, there will not be a costly transition point where all the existing data must be migrated to an entirely new system. If we are successful, new data will not only be created as Didgets instead of as files or traditional database tables, but all the old data will be converted to Didgets as well. Our goal is to replace those other primary data managers completely.

In order to realize that goal, the Didget Management System has to do all the critical data management functions of the system it is replacing in addition to its new feature set. It cannot just be 5%, 10%, or even 50% better either. It has to be at least TWICE as good as the old system. When I designed it, that was my minimum threshold. If I couldn't make it dramatically better, it would not gain widespread adoption and would likely fall into a very narrow niche product and not be worth the effort.

Fortunately, the design has proven to work so well that I not only think we have met that 2x threshold, I think it has greatly exceeded it. I would not be surprised if once all the features are fully implemented, that we will have a system that is 10x better than those other systems. That does not mean that we will do everything 10x better than every feature found in those other systems (for example we will not be able to read a Didget ten times faster from disk than a file system can read a file), but rather that overall it will be that much better when all the factors of performance, feature set, ease of use, security, and flexibility are considered.