Welcome to the Realm, The World of Didgets: June 2012

Saturday, June 16, 2012

Synchronization

Didget Management is much more than just managing lots of Didgets within a given Chamber. It is about managing all the Didgets within a given user's Domain. Each Chamber within the global Didget Realm is a member of one and only one Domain. Since each Chamber within a user's Domain is probably located on a completely separate storage device there is a need to be able to manage the data across those devices.

Unlike file systems, the Didget Manager can perform operations against a set of Didgets without explicit commands from a running application. Policy Didgets created by the user can direct the Didget Manager to perform those operations automatically when certain events occur or when a specified amount of time has passed. Tasks like backup, replication, and synchronization can all be controlled using Policy Didgets.

One of the biggest challenges for existing applications that must try to synchronize data between two separate file system volumes today is in determining exactly which files are the same and which are different. If each volume has a large number of files, this task can also take a very long time. Even if two files have the same name, extra metadata and even the full contents of the data stream must be checked to make sure there are no differences between them. The challenge is even harder if most of the files are the same, but located in different folders on each system.

For example, suppose an application wanted to make sure two separate volumes both had the exact same copies of all photographs stored within them. It would need to first find every photograph in each volume and then compare it with each photograph in the other volume. If Volume A had some photographs that Volume B did not (or vice versa), then it would need to copy them. What should it do if all the pictures on Volume A were located under a /photos file folder hierarchy and all the pictures on Volume B were located under a /pictures folder? Should it synchronize by trying to replicate the folder structures or instead try to copy files to existing folders?

Synchronization between any two Chambers in the Didget Realm is almost trivial. The Didget Managers can quickly compare the two Chambers and find all the differences between them. The event counters and Marker Didgets discussed in an earlier post are tools the Didget Manager uses to figure out what has changed and what order things have happened. Didgets can be copied between two Chambers without needing to worry about folder structures.

For example, two Chambers that each have a million Didgets in them can be compared in just a few seconds and a complete list of all new or modified Didgets since the last synchronization event can be generated. Following the synchronization policy (or policies), the Didget Manager can copy any changes between the two Chambers so that they are completely in sync with each other.

Sunday, June 10, 2012

Public vs Private Data

Within the storage systems of any individual, small business, or large enterprise, there are two kinds of data. Data that was created by the user(s) of that system and data that was created somewhere else and copied into that system.

In the Didget Realm, Didgets can be classified as either Public or Private. Public data is that which was "published" by its creators for public consumption. Examples of public data are songs, movies, books, and software. Often their creators want the consumers of such data to pay for the privilege. Private data, on the other hand, was created within the data domain of the creator for their own private consumption.

File systems have no way to distinguish between the two types of data. File1.doc may be a popular document that I downloaded off the Internet and I have one of a million copies. File2.doc may be my own personal document that I spent 50 hours working on and I have the only copy. (Of course, it would not be wise for me to work so many hours on a document without making backup copies, but every once in a while you hear about some student losing such a thing.) Using a file system, I cannot tell which type of data is contained within either of the two files.

The simple fact is that these two types of data should be treated differently. I want to make regular backups of my private data and take extra security measures to insure that unauthorized access is prevented. If I lose some software I downloaded (public data), I can always replace it by just downloading it again. If I have a cloud backup solution, I don't want to use up all my bandwidth and storage space by pushing copies of a bunch of HD movies I downloaded instead of my important documents.

With Didgets, I can instantly see which data I have created and what I have copied from others. I can set policies dealing with replication, security, and backups based on those types. For example, I could have a policy that tells the Didget Manager to create two separate replicas of every private document I create.

Public Didgets are by default Immutable. This "Read-only" attribute prevents any changes to them thus preventing a virus from altering them and otherwise guarantees their integrity. If I want my own private copy of a Public Didget that I can alter, I need to copy its contents to a Private Didget. I can alter the Private Didget while keeping the original Public Didget intact.

Tuesday, June 5, 2012

Tags, Tags, and More Tags

In the Didget Realm, every single Didget can have lots of tags attached to it. Tags are similar to extended attributes that have been added to some file systems. It is extra metadata that exists outside of an object's regular metadata and separate from its data stream. While tagging data is nothing new, the approach we take to implement them with Didgets is very different than other previous solutions like extended attributes or database tags.

Extended Attributes

File system extended attributes are simple Key:Value pairs. The key is a simple string without any specific context involved. Just like file names, a file system will not attempt to interpret the meaning of a given key, it is just a simple lookup with no relationship between any two given keys. Likewise, a filesystem will not attempt to impose any restrictions of the value assigned to any given key other than making sure its length does not exceed any imposed limit.

File systems were not designed to allow fast, efficient searches for files based on the existence of extended attributes or based on any particular value assigned. For example, if an application wanted to find all the documents within a given file system volume that had the extended attribute "Author=John" attached to it, it would need to do a brute force search by finding every file with a document extension and examining each one individually to see if it had that particular extended attribute key and value. For a volume with a million or more files in it, such a search can be painfully slow.

Since many file systems do not support extended attributes and using them can be difficult, they are rarely used by applications. If a file with extended attributes is moved or copied to another file system, it is likely that the extended attributes will either be lost or altered in some way.

Database Tags

Some applications allow the user to tag data by storing information inside of a database managed exclusively by that application. Popular data management software like iTunes and Picasa use this technique to tag music and photos. These databases are not meant to be shared openly between applications and if a photo or music file is copied from one volume to another, the tags don't come with it. A user is only able to search based on the tags if the specific application supports it.

Didget Tags

Unlike these other approaches, our tags are designed to be widely used, shared, and searchable. Any application can use our simple API to get a list of tag definitions and attach tag values based on those definitions to any Didget. Any application can then find Didgets based on tags or add their own tags to make a Didget easier to find or manage.

Every tag within a Didget Chamber is defined using a simple schema. Once a tag is defined, any application can use any defined tag to attach a value to a Didget. If an application wants to use a tag that is not currently defined, it can quickly define a new tag which adds its definition to the schema. Applications can search for Didgets that have a certain defined tag attached to it or more specifically, have a certain value assigned.

For example, an application can define a new tag ".person.Nickname" and then attach that tag with the value of "Bubba" to a photograph Didget. Another application can later query the Didget Manager for a list of all Photograph or Document Didgets that have ".person.Nickname = Bubba" attached. The Didget Manager would be able to process that query in just a few seconds even if there were 2 million Photograph Didgets and 3 million Document Didgets mixed in with 5 million other kinds of Didgets and all of the Didgets had some tags attached to them.

Likewise, applications could search for all Didgets that had any tag of category ".person". It could find a list of all Music Didgets where ".person.musician=Billy Joel" and ".date.year=1980". The Didget Manager is able to perform these lightning fast queries without needing a separate database or implementing a complicated query language.

Unlike file extended attributes, tags are not lost when a Didget is copied or moved to another Chamber. This is because all Chambers support the tags and because applications do not perform the actual copy operation. An application will initiate the copy operation by telling the Didget Manager to copy a Didget, but it is the Didget Manager itself that makes sure nothing is lost during the copy. You never have to worry that your tags will be lost because the application forgot to copy them.

Tags are powerful tools to help users and applications to add meaningful metadata to any or all of the Didgets within a Chamber to enable fast searches based on specific values and build lists or menus from the results.