Tuesday, May 15, 2012

The Didget Record

As stated earlier, every Didget has a 64 byte metadata record used to track it. The Didget Manager is software that manages all the Didgets in the system. Unlike a file system, the Didget Manager is able to distinguish between different kinds of Didgets.

A file has only one mechanism (outside of the actual bits stored in its data stream) used for classification. That mechanism is the file extension (typically a three or four character string appended to the end of the file name). The file extension may symbolize the format of the data stream but the file system does not try to interpret its meaning. It is completely up to applications to interpret which file extensions belong to a particular category.

For example, there are dozens of different data stream formats that are used to represent a still image (e.g. a photograph). JPG, PNG, GIF, TFF, BMP, and ICO are all examples of file extensions used to represent images. If a user wanted to know how many total image files were on a system, they would have to run an application that was programmed to find every type of file extension applicable to images. Since there is no way to ask a file system for a list or a count of "all image files", the application would need to perform a separate search for every file extension. If a volume contained millions of files, this simple search could take up to an hour or more to complete. If a new file extension was created to represent a new image format, the application would need to be updated so that it would look for files with that new extension.

Didgets on the other hand, have several mechanisms that are used to classify data. Every Didget has a Didget type and a Didget subtype. If the type is File Didget, then it also has a File Didget format assigned. The Didget type and subtype fields are bit fields of 16 bits each. Since each of the 16 Didget types can have 16 different subtypes there are 256 possible kinds of Didgets in the system.

One Didget type is "File". When files are converted into Didgets, they are assigned to be File Didgets. The other 15 Didget types have special purposes that apply only within the Didget Realm and I will discuss them in further posts.

Of the 16 File Didget subtypes, only 8 have been defined so far. They are Audio, Document, Image, Script, Software, Structured Data, Text, and Video. Each File Didget subtype can be further categorized into its various formats. Unlike the other two byte fields, this two byte field is not a bit field that can only have a single bit set. Instead it is a unsigned short int and can hold up to 65,534 different format types (zero is reserved).

Audio File Didgets include every format where the data stream is interpreted as sound. Formats for music, audio books, speeches, instruments, voice mail, and other noises all have the "Audio" bit set in the File Didget subtype field.

Software File Didgets include every kind of compiled computer code. Executable files, shared libraries, device drivers, and every other kind of software, regardless of targeted CPU or operating system, all have the "Software" bit set in the File Didget subtype field. Other kinds of code that must be interpreted like Python, Ruby, Perl, system commands, or shell scripts are categorized as "Script".

The other types of File Didgets are used to categorize the various document formats, still images formats, video formats, database formats, and plain text data formats.

Unlike file systems, the Didget Management System provides simple APIs used to search for all the Didgets that match a given set of search criteria. What this means is that an application can make a single call to the Didget Manager for a list of all the Video File Didgets and get a complete and accurate list very quickly no matter how many different kinds of video formats may be present.

Because the Didget Manager is able to quickly check bits in the bit fields described for every Didget Record in the system, it is able to sort out all the matching Didgets for any particular query in record time. On a system with a Quad core processor and 4 GB of RAM, I am able to sort through about 25 million Didget Records per second. This means I can find 9 million photographs mixed in with 16 million other kinds of Didgets in one second or less.

No comments:

Post a Comment