Thursday, May 10, 2012

What is a Didget?

This new data management architecture uses individual objects called Didgets. A Didget (short for Data Widget) has some properties of a conventional file, some properties of items stored in an object store, and a bunch of properties for which I can find no equivalent in any other system.

A Didget has a variable-length data stream just like a file. Any kind of serialized data can be written to this data stream. It can contain a photo, some software, a video stream, or any other structured or unstructured data that can be saved to a file. The number of bytes for this stream can range from zero bytes to just over 18 trillion bytes.

A Didget also has a small set of required metadata that is stored as a fixed-size record within a table. Just like file records like iNodes (Ext2, Ext3, Ext4) and FRS structures (NTFS) that file systems use to track files, the Didget Manager keeps track of all the Didgets within a Chamber using these Didget table entries.

The size of each entry is intentionally small. It is only 64 bytes in size. This allows extremely large numbers of Didgets to be managed using a minimal amount of disk reads and RAM for caching. By contrast, the default iNode size is 256 bytes and NTFS's records are 1024 bytes (4096 bytes on all the new advanced format hard disk drives). Some other file systems have even larger metadata records. In order for the NTFS file system to read in the entire MFT and store it into memory for quick searches, it would need to read in 10 GB from disk and have a 10 GB of RAM if the volume contained 10 million files. For a 100 million files, it would need ten times that much memory.

The Didget Manager, on the other hand, could read and store the entire Didget table in just 640 MB when 10 million Didgets are present. Even for 100 million Didgets, it is a very manageable 6.4 GB.

With only 64 bytes to work with, every single bit is important. Painstaking care was taken to insure that every byte was necessary and yet every field was sufficient size to ensure good limits. Fields within this structure include the Didget's ID, its type information, its attributes, its security keys, a tag count, and three separate event counter values (analogous to date and time stamps).

Absent from this structure is the name of the Didget. It is stored in another structure if the Didget even has one. Unlike files, a Didget does not need to have a name. Its unique identifier is a number. This 64 bit number (the Didget ID) is assigned during Didget creation; it never changes over the life of the Didget; and the number is never recycled if the Didget is deleted and purged from the chamber. This means that if the ID is stored within some other data stream for use by a program, it will always point to the right Didget (unless of course that Didget has been deleted).

A Didget can have a name. In fact it can have lots of them. The name is simply a tag that has been attached to a Didget. A tag is a simple Key:Value pair that is stored in such a way as to enable database-like speeds when searching for Didgets that have certain tags. Each Didget can have up to 255 tags attached to it.

Each tag within the system is defined in a schema. The user or an application can create a new tag definition using a simple API. Once defined, tags can be created and attached to any Didget(s) using that definition.

For example: a Didget containing a photograph taken of Bob in New York City in 2011 may have 3 tags attached to it (.person.FirstName = Bob, .place.City = "New York City", .date.Year = 2011). Any application can issue a query to the Didget Manager for a list of all photographs with these three tags and this Didget will be in the list. If the user wanted to later attach a new tag (e.g. .device.Camera) to this photograph, he could define the tag and then attach the value (.device.Camera = "Cannon EOS 7D") to it.

The Didget Manager is designed to be lightning quick at finding all Didgets that match a given query. For example: if a chamber contained 20 million Didgets and 5 million of them were photographs, it could return a list of the Didget IDs of all 5 million photographs in less than 1 second. If the query required matching several tag values, it would still take less than 10 seconds to return the complete list even if every photograph had dozens of tags attached to them.

The Didget Manager is able to accomplish this task without needing a separate database that can become out of sync with the Didget metadata. The entire system has been built around a "10 second rule". That is to say that the algorithms and structures of the system have been designed such that with the right hardware setup, no query should ever require more than 10 seconds to complete even if the chamber contains billions of Didgets.

No comments:

Post a Comment