Saturday, January 19, 2013

Silos of Information

As I stated earlier, the Didget Management System was designed to offer an alternative to conventional data management systems that tend to manage just a subset of all data and to build walls around any extra metadata they may generate. With such systems, a given set of a few million pieces of data (files and/or database rows) will often be fragmented into several of these "silos of information".

To illustrate this using a real world example, consider the following:

A user has a 2TB hard drive nearly full of data. Since the average size of each piece of data on the drive is about 1 million bytes, this represents nearly 2 million different files. Out of all that data, there exists three important pieces of information that are from the user's friend Bob. Bob has sent the user an email; the user has taken a picture of Bob and transferred that picture from his camera to the hard drive; and Bob has also authored a document that the user downloaded from Bob's web site.

The email was transferred from the email server to the user's computer running Windows 7 by Microsoft Outlook and stored in a .pst file somewhere in the file system hierarchy. The picture was imported into Google's Picasa photo manager and was "tagged" with "Bob" using their facial recognition feature. This tag was embedded within the .jpg file using the EXIF space reserved for such tags. The document was also stored somewhere on the file system and the user set an extended attribute of "Author=Bob" on the document file using a special document manager program.

Now the user wants to do a general search for everything on his drive that has to do with his friend Bob and hopes to come up with all three pieces of information. He wants a program that will comb through all his data and find those pieces.

1) The program will prompt the user for what to base the search on. The user just types "Bob" since there is no standard schema that helps identify "Bob" as a person's first name.

2) The program must now be able to do a complete folder hierarchy traversal, looking for any instances of the string "Bob". It might find a bunch of files that contain "Bob" in their file name, like "bobset.exe" or "stuff.bob". It would need to show those to the user since it doesn't know if they might be relevant.

3) For every file the program searches, it would need to peek at its extended attributes to see if any contained the word "Bob". Like file names that matched, it would need to display a file called "Photo1.jpg" that had an extended attribute "Activity=Bobbing for apples". For every .jpg file, it would need to know how to open and search the EXIF data portions, also looking for any tags that might have "Bob" in them.

4) It would need to be able to parse through any .pst files by following the Microsoft specification, looking for any emails that might come from, or be about Bob.

Each of these things represents a different "silo" of information that would need to be accessed and understood by the program doing the search. The .pst database file; the file extended attributes; and the .jpg - EXIF file format information are examples of these silos. There are many other silos like .db files, html or xml files, registry files, INI files, and .doc files. Accessing each of them requires knowledge about their format and the rules for parsing them.

If instead of using those systems to store data; Microsoft's Outlook, Google's Picasa, and the document manager were all built on top of the Didget Management System then things could be much simpler. The email could be stored in a "Message Didget", the picture could be stored in a "Photo Didget", and the document would be stored in a "Document Didget". Each of these Didgets would have a tag ".person.FirstName = Bob" attached to it.

Now any application could look for stuff about Bob without missing anything or getting all kinds of unintentional results. It would also find all three items in less than 1 second instead of the painfully slow search in the previous example.

Sunday, January 6, 2013

A Non-Hierarchical Data Management System

Those of you who have been following this blog may be wondering why I have never called the Didget Management System an "Object Store". This has been intentional, since I believe that Didgets offer many features that other kinds of persistent objects simply do not, and I wanted to avoid confusion. Once someone hears the word "Object" they tend to get all kinds of notions in their head about what a Didget is and what it should do.

But the reality is that a Didget has more in common with an object than with a traditional file. One of the things that really sets our system apart from other kinds of object stores like Amazon S3, is that we are designing it to be a replacement for local file systems as well as a cloud storage system. If we are successful, in another ten years all the data stored on your laptop, desktop, mobile device, and in various cloud storage containers will be in the form of Didgets.

Although it will be very easy to overlay our system with a traditional hierarchical namespace to provide backward compatibility with legacy systems, our native storage design is anything but hierarchical. This is somewhat like having a file system volume with just a single folder (i.e. root) where every file in the volume is stored. Since files use simple names as their unique identifier, such a system is impractical (if not impossible) for a traditional file system with millions of files stored on it. With Didgets it is not only possible, but very practical to store a hundred million of them within a single container without a hierarchical naming model.

Every Didget within a container has its own unique 64-bit ID that is used to access it. For systems that interact with data without needing its identifier to be in human readable form, it is easier and faster to store IDs as numbers rather than names like "C:\Windows\System\Drivers\adpahci.sys". With our system, is also very easy to find groups of Didgets that match a given criteria or to narrow down a simple search to find the single Didget you are looking for.

But we have hierarchical namespaces for a reason, so it is worth reviewing how we got here.

With file systems, a file's identifier is its name. You create, open, move, copy, and delete a file by passing its name into a file API function. Since its name is its identifier, you can't have two different files with the exact same name. This means you have to come up with a unique name for every file - a task that gets increasingly harder as the number of files increases. Early file systems like FAT that were not case sensitive and restricted names to 8.3 format made this task even harder. Even with long file name support that allows very specific and descriptive names, devices like cameras tend to want to create your pictures with names like "Photo_001.jpg", "Photo_002.jpg", "Photo_003.jpg", etc..

To get around naming conflicts and to add a simple categorization facility, file system designers came up with a hierarchical directory (or folder) model. A file's name only needed to be unique within a given folder, and its full path name became its unique identifier. The file name and folder name could be easily human readable and provide clues for navigation that were intuitive for many users. The folder system also made it easy to copy, move, or delete whole folders or entire folder trees using simple commands.

But file names and folder hierarchies have a number of problems associated with them. Changing the name of any file or any folder in its path will change its unique identifier and thus invalidate any stored references to it. The human readable names cannot be translated from one language to another without causing the same problem. An unprotected file might have its contents overwritten by a completely unrelated file that just happens to have the same name. If I want to store photos I have downloaded, I might have both a "/home/photos/download" and a "/home/download/photos" folder and have files in both - causing confusion.

Didgets operate in a completely different manner. Each Didget can have a name (or multiple names) attached to it as a name tag. When a file is converted to a Didget, each folder name may be attached to it as a separate folder tag. Unlike file paths, the ordering of tags don't make any difference. So if we were to overlay a hierarchical namespace to the Didget system, a command like "ls /home/andy/documents/projects/projectX/*" would give the same results as "ls /documents/andy/projectX/project/home/*".

You could leave out folder tags in a search and just get more results. For example: "ls /andy/*.jpg" would return all the JPG photos that were stored in any path that had "andy" as one of its folders. New folder tags could be added or existing tags deleted at any time without having "moved" the Didget or changing its unique identifier in any way. Existing tags can also be modified or translated to another language with the same lack of consequences.

Such a system provides a much more flexible mechanism for categorizing and finding data. As previous posts have shown, we can find all matching Didgets much faster than conventional file systems can find matching files.