Welcome to the Realm, The World of Didgets

Saturday, November 3, 2012

How Does it Scale

The Didget Manager is designed to perform a variety of data management functions against a set of storage containers that may be attached to a single system or spread across several separate systems.

These functions include:

1) Backup
2) Synchronization
3) Replication
4) Inventory
5) Search
6) Classification
7) Grouping
8) Activation (licensing)
9) Protection
10) Archiving
11) Configuration
12) Versioning
13) Ordering
14) Data Retention

In order to properly perform each of these functions, a system is needed that can operate against all kinds of data sets consisting of structured and/or unstructured data, from very small sets to extremely large sets (i.e. "Big Data"). A legitimate question for any system is "How does it scale?"

When it comes to the term "Scale", I define it in three dimensions -"Scale In", "Scale Out", and "Scale Up".

"Scale In" refers to the ability of the system's algorithms to properly handle large amounts of data within a single storage container given a fixed amount of hardware resources on a single system. File Systems have a limited ability to scale in this manner. For example: the NTFS File System was designed to hold just over 4 billion files in a single volume. However; each file requires a File Record Segment (FRS) that is 1024 bytes long. This means that if you have 1 billion files in a volume, you must read approximately 1 TB of data from that volume just to access all the file metadata. If you want to keep all that metadata in system memory in order to perform multiple searches through it at a faster rate, you would need to have a TB of RAM. Regular file searches through that metadata can also be painfully slow even if all the metadata is in RAM due to the antiquated algorithms of file system design.

The Didget system was designed to handle billions of Didgets and perform fast searches through the metadata even when limited RAM is available. If the same 1 billion files had been converted from files to Didgets, the system would only need to read 64 GB of metadata off the disk and have 64 GB of RAM to keep it in system memory. This is only 1/16 of the requirements needed for NTFS. Searches through that metadata would be hundreds of times faster than with file systems.

"Scale Out" refers to the ability of the system to improve performance by adding additional resources and performing operations in parallel. This can be accomplished in two ways. Multiple computing systems can operate against a single container, or a single container can be split into multiple pieces and distributed out to those systems. Hadoop is a popular open-source distributed file system that spreads file data across many separate systems in order to service data requests in parallel. It has a serious limitation in that file metadata is stored on a single "NameNode". This has both availability and performance ramifications. It was designed more for smaller sets of extremely large files rather than for extremely large sets of smaller files. Most of the other traditional file systems were never designed to either operate in parallel or to be split up.

The Didget system was designed for both kinds of parallel processing. Multiple systems can operate largely in parallel against a single container since all the metadata structures were designed for locking at the block level. When a system needs to update a piece of metadata, it does not need to establish a "global lock" on the container. It only needs to lock a small portion of the metadata where the update is applicable. This means that thousands of systems can be creating, deleting, and updating Didgets within a single container at the same time. Each container was also designed to be split up and distributed across multiple systems. Both the data streams and the Didget metadata can be split up and distributed. Map-Reduce algorithms are used to query against many of these container pieces in parallel.

"Scale Up" refers to the ability of a single management system to manage data from small sets on simple devices to extremely large data sets on very complex hardware systems. Most data management systems today don't scale up very well. For example, backup programs that work well on pushing data from a single PC to the cloud do not generally work well as enterprise solutions. Users typically need separate data management systems for their home environment and for their work environment. As a business grows from a small business to a medium sized business to a large enterprise, it often must abandon old systems and adopt new systems as its data set grows.

The Didget system was designed to work essentially the same whether it is managing a set of a few hundred Didgets on a mobile phone or it is managing billions of Didgets spread across thousands of different servers. Additional modules may be required and enhanced policies would need to be implemented for the larger environment to function effectively, but the two systems would function nearly identically from the user's (or application's) point of view. Applications that use the Didget system to store their data would not need to know which of the two environments was in play.

Saturday, September 1, 2012

Configuration Didgets

Remember the good ol' days when configuration on a Windows PC (or DOS in those days) meant that you had a simple text file in the same directory as your application that controlled the behavior of that application. The file was given an extension of .INI and was easy to read and to edit. When you uninstalled your application (using del *.*), the configuration file was cleaned up along with all your other application files.

Unfortunately, this approach also had a number of drawbacks. If you had 1000 applications, you might also have 1000 little configuration files spread all over your folder hierarchy. They were difficult to find and edit when you wanted to manage a whole bunch of applications at at once.

Microsoft's answer to this problem was to create a central database called the Registry where all the configuration settings for the system and user applications could be stored. Unfortunately, this approach also had a number of drawbacks. If this single Registry was deleted or corrupted then everything was a mess; if an application was uninstalled, it didn't always clean up after itself in the Registry; it wasn't always obvious where all the keys for a particular application were stored within the new hierarchy of this database; and there was no way for an application to try and protect unauthorized changes to its configuration settings.

While several steps have been taken to keep the Registry from being corrupted and to be able to recover to a consistent state in the event something goes wrong, the Registry continues to be a bit of a headache when it comes to managing software. Special programs written to help clean up problems with the Registry have become more popular in recent times.

With Didgets, we take a different approach. Just like the old .INI files, each application can have its own set of configuration settings stored in one or more Configuration Didgets. Just like all the other Didgets in our system, you can get a list of all these Configuration Didgets in just a second or two even if there are millions of them.

Each Configuration Didget has some special fields designating what type of software they are used to configure so you can narrow your query to only look for the ones that configure word processors, for example.

Just like the .INI files, if one of these Didgets becomes corrupt, none of the others are affected. Each Configuration Didget can be protected with the Read-Only attribute or with security keys.

Just like the editing tool for the Registry (regedit), a configuration viewer/editor can be built to give the user a unified view to a whole host of Configuration Didgets. It can do this by consolidating all the data from each individual Didget into a single virtual view. Any changes made in the editor would not be made to a central database, but rather made directly into the Configuration Didget where the change was made.

This is like having a word processor display a document where every page in the document was stored in a separate file. Any changes to one of the pages would be made to just the file that held that page. To the user, it looks like a single file but its really just a unified view of a whole bunch of separate files.

Sunday, July 8, 2012

Digital Rights Management - Part 1 (Design)

Let me start off by saying that Digital Rights Management (DRM) implementations are generally despised by many users, myself included. If you don't believe me just Google "DRM" and "Stinks", "Sucks", or other appropriate negative word and you will get plenty of hits. The technical press is full of stories about Draconian measures, discontinued services, and software implementations that more closely resemble malware than anything else. In short, many implementations do little to stop piracy but in the attempt, tend to aggravate legitimate customers.

Although I don't like it, I understand the reason for it. Content owners who deliver popular movies, music, software, and books often lose lots of money when their stuff is widely pirated. (Although I don't buy their argument that every pirated copy is a lost sale.) I have worked for software companies where we estimated that there were in excess of 10 illegal copies of our stuff for every one we sold. When such conditions exist, it is perfectly understandable that measures are often taken to try and prevent it.

The main problem is that everyone seems to take a different approach, and most of the implementations are bad. Legitimate customers of digital content are often faced with several dozen techniques to activate their operating systems, application software, and the various forms of digital media content. License restrictions are often hidden deep within some "End User License Agreement" that was written by lawyers for lawyers. Some activations require dongles, constant Internet access, credit cards, or subscription services. The user may need a dozen different UserName/Password combinations to keep track of all their stuff.

Even the user who is willing and able to jump through all the hoops necessary to get legitimate copies of everything on their system, will find it difficult to remain legal or discover what is legal after the fact. Just try and browse through all the files on a large hard drive and figure out what is legal and what is not. If the computer breaks, can you legally transfer your stuff to a replacement computer? If you buy a second computer, how much of the stuff you purchased for the first one can be shared with the second one without an additional license purchase? If you upgrade hardware, operating systems, or change services is the stuff you previously purchased still legal? Can you make backup copies without violating the terms of the contract?

The average user often gets completely lost in the maze and ends up with either illegal stuff or simply never purchases in the first place because the terms were never clear. Staying legal is a huge headache for businesses and individuals.

Users are often left out in the cold when their subscription service goes out of business or the content owner disables a necessary Internet server that enables legally purchased content to continue to be accessed. Some license agreements and software implementations are way too restrictive and you often have to purchase something before you can even figure out what you are buying.

I could go on all day and cite examples of DRM implementations that aggravated me personally or someone I knew, but let me just say that I have yet to see a version that I have liked.

When I designed the Didget Management System, content protection and activation were built into the core architecture. They are purely optional features. The average user can set up a personal Didget Domain with several Chambers and use millions of Didgets without ever wanting to activate any restricted content, but if they choose to, the features are there to support it.

When designing the features, I had to take into consideration a number of factors. I decided that if the features were to gain acceptance and be widely used they had to meet the following design goals.

1) The implementation has to work. Content owners will not release their stuff using this system if it doesn't protect the data from unauthorized access in the vast majority of cases. No implementation is perfect and given enough resources, some people will try to figure a way around its protections, but it has to be effective in 95%+ of the cases.

2) The system must make it extremely easy for the end user to figure out what has already been activated, what is available for activation, and what are the exact terms for each individual activation.

3) It has to provide a single activation process that allows for multiple payment methods. The end user must be able to activate software or a book using the same technique he used to activate his music or a movie. He should be able to pay for each activation using cash, a credit card, or some kind of account.

4) The system must provide flexible terms for activation so that content owners can provide a variety of ways to access their wares. One time use, unlimited use, limited term (e.g. 24 hours or one month), or a set number of accesses (e.g. 100 uses) are all examples of ways a merchant and their customers may want to conduct business for digital content.

5) The system must provide ways for content owners to allow existing customers to upgrade for a reduced price. It must be able to verify that the customer has a legitimate version that qualifies for the upgrade.

6) The system must provide ways for the customer to purchase content without ever revealing their identity to the merchant. The customer needs the option of an anonymous purchase using cash or an account where the account manager will see that funds are given to the merchant without purchaser information.

7) Any activations must result in the content being accessible for the full term of the contract without any further actions by the merchant. An Internet server cannot be required. Internet access cannot be required. A subscription service does not need to be current.

8) All activations must be valid for a set number of devices. When a user buys a song or a movie, it must play on all his devices without further activations. A simple synchronization is all that should be necessary to share or transfer access rights from one device to another. This mechanism must not work if the device is not one of the user's, however.

9) There are two ways most users are able to get access to restricted content - pay for it directly or get someone else to pay on your behalf (e.g. advertisers). Our system must enable both methods for activation.

My next post will describe our implementation and how it meets the requirements listed above.

Sunday, July 1, 2012

Didget Attributes

In most file systems each file or directory can be assigned a few attributes by applications either during file creation or at a later time. Directories are given the "Directory" attribute. Hidden files are given the "Hidden" attribute and static files are given the "Read-Only" attribute.

It is important to note that each of these attributes are just a mechanism to hint to any application how the file should be treated. Applications can ignore these attributes or change them at any time so they may not accurately reflect the user's wishes for the file or provide any meaningful security for the file stream data or file metadata.

In the Didget world, Didgets may also be assigned a number of special attributes that can be used to identify, search, or perform operations against any Didget. Some of them are like file attributes in that they are merely hints to applications and can be changed at will. Others provide meaningful protection and additional capabilities since an operating system or application cannot change them directly.

Didgets have 32 separate attributes. Some of them provide features that I have not seen anywhere else before. I will enumerate and explain each of them.

1) Prepended. Didgets have the unique ability to add additional data to the byte stream before the first data byte. Data must be prepended in 4096 byte chunks (the block size). Bytes in these prepended blocks can only be accessed using negative offsets. Byte 0 remains the traditional start of the file so that prepending data will not effect legacy applications. This allows extra metadata to be added to any given byte stream without worrying about breaking compatibility with an application that is not addressed to handle it.

2) Versioned. The Didget Manager has been designed to handle versioning of individual data streams. Unlike traditional Copy On Write (COW) file systems that are designed to version everything, the versioning capability in our system can be restricted to a small subset of Didgets. Didgets can have this attribute added or deleted at any time (with proper access rights) so you can turn versioning on or off for a single Didget or a whole group of Didgets. Snapshots can be taken any time the versioning is enabled.

3) Metered. This attribute is a critical piece of our "Digital Rights Management" capabilities. As a side note: I think DRM is generally a dirty word since it has been implemented so poorly (technically and administratively) in so many cases. Any Didget can be classified as "Metered" when it is published by the content owner to become a Public Didget. The terms for activation are clearly spelled out in the activation contract that is prepended to the data stream. Anyone who agrees to the terms can activate any Didget using the exact same set of activation procedures. This means that the process to activate music, movies, software, and books is exactly the same. I will address our whole new activation system in a later post.

4) Point Generator. Metered Didgets are activated using "Media Points". These points can be either bought or earned. Users are able to earn points by accessing Didgets with this attribute. Advertisers can produce digital content (i.e. advertisements) that a user can view or interact with to earn points that can in turn be spend towards any kind of other media.

5) Deleted. When a Didget is deleted, it is assigned this attribute (similar to moving a file to the trash bin). Deleted Didgets can be recovered until they are purged from the system. Purging requires special user rights so an application can delete Didgets but not destroy them.

6) Encrypted. This is just a hint to any application accessing the data that it has been encrypted. The application must be able to decrypt the data in order to use it.

7) Compressed. Just like the Encrypted attribute only for compression.

8) Sparse. Data streams can contain holes. Any Didget with a sparse data stream will have this attribute set.

9) Immutable. Data streams can be set with this "Read-Only" attribute to protect them from alteration. Public Didgets have this attribute set by default. Once this attribute is set, it cannot be cleared. Once immutable, always immutable. If you need a copy that is alterable, you can clone it into another Private Didget and change the copy all you want, but the original remains intact. Since Digits are accessed through their Didget IDs, you can't fool an application into reading your altered copy like you can with files by simply replacing a read-only file with an altered file with the same name.

10) Appendable. Immutable Didgets cannot have their existing data streams altered. However, with this attribute, additional data can be appended to the end of the data stream. Used in combination, it will be popular for logs that want new data added without the ability to change data previously written.

11) Self-Destruct. Any Didget with this attribute will be automatically deleted and purged from the system by the Didget Manager when the conditions for destruction have been met. This can be a specified period of time or a number of accesses. This will allow users to activate (e.g. rent) content for a specified period of time. When the period for activation is passed, the Activation Didget will be automatically be destroyed and the permission to access its Metered Didget with it.

12) Multiple Tags. This is a system attribute maintained by the Didget Manager. It is set when a Didget has two or more tags with the same key attached. For example, a photograph of three people may have three ".person.First Name" tags attached, each with a value corresponding to the first names of each person in the photograph.

13) Single Copy. Didgets with this tag are deleted and purged from the system when they are copied. This creates a software "Dongle" mechanism that enforces a single copy of any given Didget within the system.

14) Disposable. This attribute is somewhat similar to temporary files. Didgets with this attribute can have the space occupied by their data stream confiscated by the system when disk space runs out. An application does not need to come clean them up when disk space is low. This allows the user to fill up their disk with lots of HD video that they may never view without worrying that it will result in a "Out of Disk Space" error. As long as the space is not needed, the video is accessible. Backup policies can completely ignore disposable data.

15) Activated. Metered Didgets that have been activated by the user will have this attribute set. It is not a security mechanism since other measures are checked to insure that the activation is valid, but it is a quick way to see what has been activated and what has not.

16) Quarantined. Didgets that have yet to be scanned for viruses or other malware can have this attribute set. It may result in a warning to the user when it is accessed. (This can also be controlled through policies.)

Saturday, June 16, 2012

Synchronization

Didget Management is much more than just managing lots of Didgets within a given Chamber. It is about managing all the Didgets within a given user's Domain. Each Chamber within the global Didget Realm is a member of one and only one Domain. Since each Chamber within a user's Domain is probably located on a completely separate storage device there is a need to be able to manage the data across those devices.

Unlike file systems, the Didget Manager can perform operations against a set of Didgets without explicit commands from a running application. Policy Didgets created by the user can direct the Didget Manager to perform those operations automatically when certain events occur or when a specified amount of time has passed. Tasks like backup, replication, and synchronization can all be controlled using Policy Didgets.

One of the biggest challenges for existing applications that must try to synchronize data between two separate file system volumes today is in determining exactly which files are the same and which are different. If each volume has a large number of files, this task can also take a very long time. Even if two files have the same name, extra metadata and even the full contents of the data stream must be checked to make sure there are no differences between them. The challenge is even harder if most of the files are the same, but located in different folders on each system.

For example, suppose an application wanted to make sure two separate volumes both had the exact same copies of all photographs stored within them. It would need to first find every photograph in each volume and then compare it with each photograph in the other volume. If Volume A had some photographs that Volume B did not (or vice versa), then it would need to copy them. What should it do if all the pictures on Volume A were located under a /photos file folder hierarchy and all the pictures on Volume B were located under a /pictures folder? Should it synchronize by trying to replicate the folder structures or instead try to copy files to existing folders?

Synchronization between any two Chambers in the Didget Realm is almost trivial. The Didget Managers can quickly compare the two Chambers and find all the differences between them. The event counters and Marker Didgets discussed in an earlier post are tools the Didget Manager uses to figure out what has changed and what order things have happened. Didgets can be copied between two Chambers without needing to worry about folder structures.

For example, two Chambers that each have a million Didgets in them can be compared in just a few seconds and a complete list of all new or modified Didgets since the last synchronization event can be generated. Following the synchronization policy (or policies), the Didget Manager can copy any changes between the two Chambers so that they are completely in sync with each other.

Sunday, June 10, 2012

Public vs Private Data

Within the storage systems of any individual, small business, or large enterprise, there are two kinds of data. Data that was created by the user(s) of that system and data that was created somewhere else and copied into that system.

In the Didget Realm, Didgets can be classified as either Public or Private. Public data is that which was "published" by its creators for public consumption. Examples of public data are songs, movies, books, and software. Often their creators want the consumers of such data to pay for the privilege. Private data, on the other hand, was created within the data domain of the creator for their own private consumption.

File systems have no way to distinguish between the two types of data. File1.doc may be a popular document that I downloaded off the Internet and I have one of a million copies. File2.doc may be my own personal document that I spent 50 hours working on and I have the only copy. (Of course, it would not be wise for me to work so many hours on a document without making backup copies, but every once in a while you hear about some student losing such a thing.) Using a file system, I cannot tell which type of data is contained within either of the two files.

The simple fact is that these two types of data should be treated differently. I want to make regular backups of my private data and take extra security measures to insure that unauthorized access is prevented. If I lose some software I downloaded (public data), I can always replace it by just downloading it again. If I have a cloud backup solution, I don't want to use up all my bandwidth and storage space by pushing copies of a bunch of HD movies I downloaded instead of my important documents.

With Didgets, I can instantly see which data I have created and what I have copied from others. I can set policies dealing with replication, security, and backups based on those types. For example, I could have a policy that tells the Didget Manager to create two separate replicas of every private document I create.

Public Didgets are by default Immutable. This "Read-only" attribute prevents any changes to them thus preventing a virus from altering them and otherwise guarantees their integrity. If I want my own private copy of a Public Didget that I can alter, I need to copy its contents to a Private Didget. I can alter the Private Didget while keeping the original Public Didget intact.

Tuesday, June 5, 2012

Tags, Tags, and More Tags

In the Didget Realm, every single Didget can have lots of tags attached to it. Tags are similar to extended attributes that have been added to some file systems. It is extra metadata that exists outside of an object's regular metadata and separate from its data stream. While tagging data is nothing new, the approach we take to implement them with Didgets is very different than other previous solutions like extended attributes or database tags.

Extended Attributes

File system extended attributes are simple Key:Value pairs. The key is a simple string without any specific context involved. Just like file names, a file system will not attempt to interpret the meaning of a given key, it is just a simple lookup with no relationship between any two given keys. Likewise, a filesystem will not attempt to impose any restrictions of the value assigned to any given key other than making sure its length does not exceed any imposed limit.

File systems were not designed to allow fast, efficient searches for files based on the existence of extended attributes or based on any particular value assigned. For example, if an application wanted to find all the documents within a given file system volume that had the extended attribute "Author=John" attached to it, it would need to do a brute force search by finding every file with a document extension and examining each one individually to see if it had that particular extended attribute key and value. For a volume with a million or more files in it, such a search can be painfully slow.

Since many file systems do not support extended attributes and using them can be difficult, they are rarely used by applications. If a file with extended attributes is moved or copied to another file system, it is likely that the extended attributes will either be lost or altered in some way.

Database Tags

Some applications allow the user to tag data by storing information inside of a database managed exclusively by that application. Popular data management software like iTunes and Picasa use this technique to tag music and photos. These databases are not meant to be shared openly between applications and if a photo or music file is copied from one volume to another, the tags don't come with it. A user is only able to search based on the tags if the specific application supports it.

Didget Tags

Unlike these other approaches, our tags are designed to be widely used, shared, and searchable. Any application can use our simple API to get a list of tag definitions and attach tag values based on those definitions to any Didget. Any application can then find Didgets based on tags or add their own tags to make a Didget easier to find or manage.

Every tag within a Didget Chamber is defined using a simple schema. Once a tag is defined, any application can use any defined tag to attach a value to a Didget. If an application wants to use a tag that is not currently defined, it can quickly define a new tag which adds its definition to the schema. Applications can search for Didgets that have a certain defined tag attached to it or more specifically, have a certain value assigned.

For example, an application can define a new tag ".person.Nickname" and then attach that tag with the value of "Bubba" to a photograph Didget. Another application can later query the Didget Manager for a list of all Photograph or Document Didgets that have ".person.Nickname = Bubba" attached. The Didget Manager would be able to process that query in just a few seconds even if there were 2 million Photograph Didgets and 3 million Document Didgets mixed in with 5 million other kinds of Didgets and all of the Didgets had some tags attached to them.

Likewise, applications could search for all Didgets that had any tag of category ".person". It could find a list of all Music Didgets where ".person.musician=Billy Joel" and ".date.year=1980". The Didget Manager is able to perform these lightning fast queries without needing a separate database or implementing a complicated query language.

Unlike file extended attributes, tags are not lost when a Didget is copied or moved to another Chamber. This is because all Chambers support the tags and because applications do not perform the actual copy operation. An application will initiate the copy operation by telling the Didget Manager to copy a Didget, but it is the Didget Manager itself that makes sure nothing is lost during the copy. You never have to worry that your tags will be lost because the application forgot to copy them.

Tags are powerful tools to help users and applications to add meaningful metadata to any or all of the Didgets within a Chamber to enable fast searches based on specific values and build lists or menus from the results.