Saturday, December 15, 2012

Cloud Based Solutions

When I bought my first computer back in 1986, I splurged for the 10 MB hard drive option. It cost nearly $800 and was incredibly slow by today's standards, but compared to the rest of my data storage (a handful of 5 1/4 inch floppy disks) it was a huge leap forward. That hard drive and my floppies together totaled less than 20 MB and comprised my entire data storage capacity.

As time went on, I replaced each of my storage devices with larger capacity and faster units. Sometime when I bought a new device, it became a completely separate storage system instead of just replacing an existing one. Today, I have over 20 different storage devices (hard drives, flash drives and cards, NAS boxes, SSDs, and cloud buckets) each with a set of files stored on it. Total capacity is somewhere around 12 TB and I have a lot of data stored on them.

Having lots of separate storage devices is both good and bad. I have storage directly attached to many of the devices I am working on so I can access information even when the Internet is not accessible - good. I try to spread my data around and keep redundant copies or backups of important data in case any individual storage device fails or is lost or stolen - better. If I have the right procedures in place, I ultimately control all the data that I have stored - best of all.

But it can be difficult to figure out which of my many devices has a piece of data that I am looking for - bad. I have to remember to backup or replicate data that might be unique to any given device - worse. I might be on a trip and remember that the data I need is on a flash drive in a drawer at home - also worse. I might have multiple copies of a given piece of data and if I update one copy, I need to remember to update all the copies, otherwise I have multiple working sets of data that are not synchronized - worst of all.

Recent offering by Cloud Storage providers such as Dropbox, Google Drive, SugarSync, or Amazon S3 have attempted to solve some of these problems and a few others. Unfortunately, they also introduce a number of problems or challenges as well.

Keeping your data in the "Cloud" can be beneficial in many instances. Redundant copies or backups are handled automatically by the storage provider. The data can be accessed by nearly any device with an Internet connection. Storage capacity can be very flexible and grow to meet your storage needs without having to purchase new units and migrate your data. It is easy to share your data with others. All these features offer compelling reasons to put data in the cloud.

But cloud storage is currently much more expensive than just buying a new hard drive. If you have many terabytes of data, it can be incredibly expensive to store all that data in the cloud. Data transfer speeds can also be very slow when compared to local storage. Sometimes users experience extremely slow speeds when performing a backup or restore operation. Slow performance and costs make it critical to be able to eliminate large quantities of unimportant data from cloud backup or synchronization functions. Finding stuff stored in the cloud can also be a slow and difficult process. If you have a few million pieces of data stored in one of those cloud buckets, it might take quite awhile to find it if you have forgotten its unique key name. Likewise, finding all pieces of data that meet some kind of specific criteria can also take a very long time.

The most troubling part of cloud storage seems to be a lack of control over your own data. If your only copy of a valuable piece of data is out in the cloud, you are completely dependent upon the cloud provider to make sure you have unimpeded access; that the data is free from corruption; and that it is secure from unauthorized access. Recently, even Steve Wozniak expressed great concern about the recent trend for individuals and businesses to store large amounts of their important data on a system controlled by someone else.

Personally, I think all the current cloud offerings represent a half-way solution. Universal access, flexible storage capacity, and automatic redundancy are great features. But I think the real, full solution is to have just a copy of important data (and only important data) stored in the cloud that is easily synchronized with other copies of that same data on local systems where the user has complete control.

This is one of the compelling features of the Didget Management System.

Thursday, December 6, 2012

Extreme Performance Demonstration

I created a third demonstration video of the Didget Management System in Action. This one shows how fast we can find things even when the number of Didgets gets very high.

See it at www.screenr.com/5Zx7

In this video I create nearly 10 million Didgets in a Chamber and automatically attach a set of tags to each one. Each tag has a value associated with it. I then performed queries against that Chamber for all Didgets of a certain type. I then performed an additional query for the Didgets that have a certain tag attached to it regardless of its value. Finally, I performed a couple of queries where we are looking for Didgets with that tag attached but also have a value that starts with an input string.


Again, I was running this demonstration on the same low-end PC as in the previous two videos. If I were to attempt to find all the video files on my NTFS file system and if there were 10 million files on it, that query would take nearly an hour using a regular program calling the file API. With the Didget Management System, the slowest query took about 3 seconds.

Monday, December 3, 2012

Demo Video Part 2

I added another short video of a demonstration of tags used in the Didget Management System.

View at: www.screenr.com/fXd7

This video emphasizes the creation of tags and attaching them to a set of Didgets so that we can query based on them or create lists (e.g. Albums) from the query results.

Each Didget can have up to 255 different tags attached to it.  There can be tens of thousand of different tags to choose from and each tag value can be a string, a number, a boolean, or other value type. We have a set of predefined tags such as .event.Holiday, .date.Year, and .activity.Sport but the user is free to define any additional tags and immediately begin attaching them to any Didget.

Attaching tags to Didgets and performing queries based on them, works exactly the same way for photos, documents, music, videos, or any other type of Didget.

Sunday, December 2, 2012

Video Demonstration of our Browser

After much trial and error, I was finally able to capture a video of our Didget Browser in action. The video was limited to only 5 minutes, so I had to move fast and could only show a few features, but it gives a good demonstration of the speed at which we can query any given Chamber populated with lots of Didgets.

You can watch the video at: www.screenr.com/XV17

The Didget Browser was running on a Windows 7 PC and was created using the open-source, cross-platform GUI library called Qt. It can easily be ported to the Linux and Mac OSX operating systems. It sits on top of our Didget Management System using its API to perform much of its work.

The PC I used was a 3 year old Gateway machine I bought at Costco for $500. It has an Intel Core 2 processor, 4 GB of DDR2 RAM, and a 750 GB HDD. This was not a high-end box even when I bought it, let alone now. If you are impressed with the speed at which we are able to perform queries and to display large lists of tag values, please keep in mind it is NOT due to speedy hardware.

Whenever we perform a query, we look at the metadata records for each Didget within the Chamber. This would be analogous to checking each iNode in an Ext3 file system when querying files. The same is true whenever we refresh the contents of the Status Tab. We look at each and every Didget metadata record and tally up a total of all the different categories displayed.

It is important to know that we do not have a separate database that we are querying like indexing services such as Apple's Spotlight or Microsoft's Windows Search do. Such databases can take hours to create and can easily become out of sync with the file metadata that they index.

Some of the query operations that we perform could be accomplished on a regular file system using command line utilities. For example, I can get a list of all .JPG files on my file system by entering the command:

 C:>Dir *.jpg /s

The main difference is that on that same machine with the 500,000 files, this command takes nearly 3 minutes to complete. If my NTFS volume had 3 million files on it, the same command would take approximately 20 minutes to complete. Using the Didget Browser, we are able to accomplish the same task in under ONE second. In fact, we can get a list of all the JPG Photo Didgets in under one second even if there are 25 million of them.

The differences in speed between our system and conventional file systems is even more pronounced when we must do even more complicated queries. Try to find all .JPG photos in a file system that have two extended attributes attached with the key:values of Place=Hawaii and Event=Vacation. We can find all the Didgets with those two tags attached in just a couple of seconds. File systems (the ones that even support extended attributes) will require a very long time.

Sunday, November 18, 2012

The Big Picture

So far, I have posted several blogs that explain certain pieces of the Didget Management System and how each feature adds specific benefits over conventional file system or database architectures. I thought I would devote this 20th blog to explaining the entire system once all the pieces are put together to give the reader an idea of how it will look once completed.

The Didget Realm represents a world-wide collection of individual Didget containers called Chambers. Each Chamber is managed by its own instance of the Didget Manager and together they represent a single node in this global data storage network. Each node can communicate with every other node to exchange Didget information. With the use of Policy Didgets, this information can be exchanged automatically without direct commands from a running application. Nodes can be grouped into domains or federations so that they can exchange even more information between them than can two nodes that are not in the same domain.

Each Chamber can store several billion individual Didgets. The system is designed to effectively manage huge numbers of Didgets without sacrificing speed. Simple queries to a Chamber with over 10 million Didgets in it are designed to execute in under one second. Even the most complex queries are designed to execute in under ten seconds when the Didget Manager is running on a single desktop system. For Chambers with hundreds of millions or with billions of Didgets, the Chamber can be split into many individual pieces and managed by lots of separate systems in a distributed environment to perform lightning fast queries using map-reduce algorithms.

A Chamber that has been converted to a distributed system looks exactly the same to an application or to another node in the global network, as does a Chamber that has not been split into several pieces and distributed. In other words, applications do not need to know if they are communicating with a single piece Chamber running on a laptop computer or if they are communicating with a Chamber that has been split into 100 different pieces and managed by 1000 different servers. The only difference will be the speed at which a query or other command may execute when the number of Didgets in the Chamber is extraordinarily large.

Using Policy Didgets and Security Didgets, operations against all the Didgets with a Chamber can be tightly controlled. Sensitive information can be protected and a whole host of data management functions can happen automatically when either a certain amount of time has expired or when certain events happen.

Individual Didgets can be classified, tagged, and grouped together in ways files or database rows never could. Copying or moving a Didget from one Chamber to another does not cause it to lose any of its metadata or to become any less secure than the original. Special attributes can be assigned to each Didget that enable it to be managed by the Didget Manager in very specific ways. Several of these attributes represent unique features that I have not seen on any other system.

Applications can query for a set of Didgets based on any of these metadata fields and perform operations against the whole set (if permissions allow).

Didgets can represent either structured and unstructured data. All the management functions work the same, regardless of the data type. Didgets can be accessed using file-like APIs or database-like queries.

Inventory, search, backup, recovery, synchronization, organization, version control, and licensing are just a few of the management functions that are provided by the system. In every case, the functions will perform faster and with simpler mechanisms than with conventional systems.

In summary, I think this system offers a far superior data management environment than do conventional file systems or NoSQL database environments. Once data is created as Didgets (or converted from legacy systems) it will be far easier to manage and provide significantly greater value to the end user than it would be as files or as database rows.

The Didget Management system will revolutionize the way the whole world looks at data going forward. (You heard it here first!)

Saturday, November 17, 2012

Structured vs Unstructured Data

Persistent data seems to fall into one of two categories. 1) Structured Data (like cells in a spreadsheet or a row/column intersection in a database table) that must adhere to some fairly strict rules regarding type, size, or valid ranges; or 2) Unstructured Data like photos, documents, or software where the data can be much more free-form.

Databases are well equipped to handle structured data but generally do a poor job of managing large amounts of unstructured data (or blobs in database speak). File systems, on the other hand, were designed for large numbers of unstructured data wrapped in a metadata package called a file, but generally do a poor job of trying to handle structured data (although technically, databases themselves are almost always stored as a set of files in a file system volume).

When I first designed the Didget Management System, I concentrated solely on improving the handling of unstructured data. It was designed to be a replacement for file systems. Databases could be stored in a set of Didgets just as easily as in a set of files, but I planned to largely ignore structured data the way file systems do.

But with the introduction of the Didget Tags, I had to figure out how to handle large amounts of structured data as part of Didget metadata since each tag is defined with a schema and each tag value must adhere to this definition. I had to be able to assign each Didget a bunch of tags and then make it so I could query against the whole set of Didgets based on specific tag values. For example, "Find all Photo Didgets where .event.Vacation = Hawaii" would need to return a list of all photos that had been assigned this tag value. This feature is strikingly similar to executing an SQL query against a relational database.

I still didn't make the connection of how this feature could add a whole new dimension to the Didget Management System until one of the programmers helping me with this project pointed out how similar a Didget is to a row in a NoSQL database table. In fact, the entire Didget Chamber could be thought of as a huge table of columns and rows where every column is a tag and every row is a Didget. In our system there can be tens of thousands of different tags defined (columns) and billions of Didgets (rows). Each Didget can have up to 255 different tag/value assignments.

Since each Didget can also have a data stream assigned to it, this data stream could be thought of as just another column in the table (although it is a very special column in that its contents are not defined in a schema and its value can be unstructured and up to 16 TB in length). The Didget metadata record, likewise could be thought of as special columns in this huge table. We can query based on Didget type, stream length, events stamps, attributes, and the like.

What this means is that every Didget could be treated kind of like a file or kind of like a row in a database. Applications can perform operations against a set of Didgets using an API that is very file oriented or by using one more familiar to database operators.

Since the Didget Management System was designed to scale out by breaking a single chamber into multiple pieces and distributing them across a set of servers (local or remote), it could compete directly against large distributed NoSQL systems like CouchDB, MongoDB, Cassandra, or BigTable just as easily as it could against Hadoop in the distributed file system arena.

Companies or individuals that work with large amounts of "Big Data" would no longer need two separate systems, one to handle their unstructured data and another to handle their structured data. With the Didget Management System, all their data (structured and unstructured) could be handled in a single distributed system and managed with the same set of tools and policies.

Monday, November 12, 2012

Policies

In the conventional file system world, file systems treat all files like black boxes and almost never perform any direct manipulation of files. If any file is created, modified, moved, or deleted it is done as a direct command from either the operating system or an application. All file management functions such as organization, backup, synchronization, or cleanup are performed by something other than the file system itself.

In the Didget system, many of these management tasks can also be performed by the Didget Manager independent of another running program. Programs can schedule specific tasks to execute at specific times or when certain events occur with the use of Policy Didgets. These Didgets are somewhat similar to database triggers. They can cause the Didget Manager to manipulate data even while the application that scheduled the task is no longer available to the system.

Just like all the other Didgets in the system, Policy Didgets can be created, protected, queried, synchronized, and deleted. They can have tags attached to them to help in finding or organizing different policies. They can have a data stream that contains specific instructions or program extensions or that logs results as the policy executes. Just about any conceivable data management function could be implemented or at least facilitated using these special Didgets.

For example, an application could create a policy that automatically adds any new photos with a .event.Vacation tag to a List Didget called "Vacation Photo Album". At the same time it could search for another list Didget with a name matching the tag value (e.g. if .event.Vacation = "Hawaii" then it would look for a list where .didget.Name = "Hawaii Photo Album") and either add it to the existing list or create a new list if it did not exist and then add it.

In another example, an application could create a policy that would automatically backup all new or modified Private Didgets to a chamber located in the cloud every Monday morning. This would create an incremental backup of everything the user created on that system during the week.

In yet another example, an application could create a policy that automatically synchronized all new photos and documents with a chamber located on a phone every time the phone was connected to the desktop.

Policy Didgets could be built and maintained to enforce company policies governing data protection, retention, and validation. Entire workflow systems could be driven by carefully crafted Policy Didgets by having data created, tagged, and organized as each step in the workflow progresses.

Saturday, November 3, 2012

How Does it Scale

The Didget Manager is designed to perform a variety of data management functions against a set of storage containers that may be attached to a single system or spread across several separate systems.

These functions include:

1) Backup
2) Synchronization
3) Replication
4) Inventory
5) Search
6) Classification
7) Grouping
8) Activation (licensing)
9) Protection
10) Archiving
11) Configuration
12) Versioning
13) Ordering
14) Data Retention

In order to properly perform each of these functions, a system is needed that can operate against all kinds of data sets consisting of structured and/or unstructured data, from very small sets to extremely large sets (i.e. "Big Data"). A legitimate question for any system is "How does it scale?"

When it comes to the term "Scale", I define it in three dimensions -"Scale In", "Scale Out", and "Scale Up".

"Scale In" refers to the ability of the system's algorithms to properly handle large amounts of data within a single storage container given a fixed amount of hardware resources on a single system. File Systems have a limited ability to scale in this manner. For example: the NTFS File System was designed to hold just over 4 billion files in a single volume. However; each file requires a File Record Segment (FRS) that is 1024 bytes long. This means that if you have 1 billion files in a volume, you must read approximately 1 TB of data from that volume just to access all the file metadata. If you want to keep all that metadata in system memory in order to perform multiple searches through it at a faster rate, you would need to have a TB of RAM. Regular file searches through that metadata can also be painfully slow even if all the metadata is in RAM due to the antiquated algorithms of file system design.

The Didget system was designed to handle billions of Didgets and perform fast searches through the metadata even when limited RAM is available. If the same 1 billion files had been converted from files to Didgets, the system would only need to read 64 GB of metadata off the disk and have 64 GB of RAM to keep it in system memory. This is only 1/16 of the requirements needed for NTFS. Searches through that metadata would be hundreds of times faster than with file systems.

"Scale Out" refers to the ability of the system to improve performance by adding additional resources and performing operations in parallel. This can be accomplished in two ways. Multiple computing systems can operate against a single container, or a single container can be split into multiple pieces and distributed out to those systems. Hadoop is a popular open-source distributed file system that spreads file data across many separate systems in order to service data requests in parallel. It has a serious limitation in that file metadata is stored on a single "NameNode". This has both availability and performance ramifications. It was designed more for smaller sets of extremely large files rather than for extremely large sets of smaller files. Most of the other traditional file systems were never designed to either operate in parallel or to be split up.

The Didget system was designed for both kinds of parallel processing. Multiple systems can operate largely in parallel against a single container since all the metadata structures were designed for locking at the block level. When a system needs to update a piece of metadata, it does not need to establish a "global lock" on the container. It only needs to lock a small portion of the metadata where the update is applicable. This means that thousands of systems can be creating, deleting, and updating Didgets within a single container at the same time. Each container was also designed to be split up and distributed across multiple systems. Both the data streams and the Didget metadata can be split up and distributed. Map-Reduce algorithms are used to query against many of these container pieces in parallel.

"Scale Up" refers to the ability of a single management system to manage data from small sets on simple devices to extremely large data sets on very complex hardware systems. Most data management systems today don't scale up very well. For example, backup programs that work well on pushing data from a single PC to the cloud do not generally work well as enterprise solutions. Users typically need separate data management systems for their home environment and for their work environment. As a business grows from a small business to a medium sized business to a large enterprise, it often must abandon old systems and adopt new systems as its data set grows.

The Didget system was designed to work essentially the same whether it is managing a set of a few hundred Didgets on a mobile phone or it is managing billions of Didgets spread across thousands of different servers. Additional modules may be required and enhanced policies would need to be implemented for the larger environment to function effectively, but the two systems would function nearly identically from the user's (or application's) point of view. Applications that use the Didget system to store their data would not need to know which of the two environments was in play.

Saturday, September 1, 2012

Configuration Didgets

Remember the good ol' days when configuration on a Windows PC (or DOS in those days) meant that you had a simple text file in the same directory as your application that controlled the behavior of that application. The file was given an extension of .INI and was easy to read and to edit. When you uninstalled your application (using del *.*), the configuration file was cleaned up along with all your other application files.

Unfortunately, this approach also had a number of drawbacks. If you had 1000 applications, you might also have 1000 little configuration files spread all over your folder hierarchy. They were difficult to find and edit when you wanted to manage a whole bunch of applications at at once.

Microsoft's answer to this problem was to create a central database called the Registry where all the configuration settings for the system and user applications could be stored. Unfortunately, this approach also had a number of drawbacks. If this single Registry was deleted or corrupted then everything was a mess; if an application was uninstalled, it didn't always clean up after itself in the Registry; it wasn't always obvious where all the keys for a particular application were stored within the new hierarchy of this database; and there was no way for an application to try and protect unauthorized changes to its configuration settings.

While several steps have been taken to keep the Registry from being corrupted and to be able to recover to a consistent state in the event something goes wrong, the Registry continues to be a bit of a headache when it comes to managing software. Special programs written to help clean up problems with the Registry have become more popular in recent times.

With Didgets, we take a different approach. Just like the old .INI files, each application can have its own set of configuration settings stored in one or more Configuration Didgets. Just like all the other Didgets in our system, you can get a list of all these Configuration Didgets in just a second or two even if there are millions of them.

Each Configuration Didget has some special fields designating what type of software they are used to configure so you can narrow your query to only look for the ones that configure word processors, for example.

Just like the .INI files, if one of these Didgets becomes corrupt, none of the others are affected. Each Configuration Didget can be protected with the Read-Only attribute or with security keys.

Just like the editing tool for the Registry (regedit), a configuration viewer/editor can be built to give the user a unified view to a whole host of Configuration Didgets. It can do this by consolidating all the data from each individual Didget into a single virtual view. Any changes made in the editor would not be made to a central database, but rather made directly into the Configuration Didget where the change was made.

This is like having a word processor display a document where every page in the document was stored in a separate file. Any changes to one of the pages would be made to just the file that held that page. To the user, it looks like a single file but its really just a unified view of a whole bunch of separate files.

Sunday, July 8, 2012

Digital Rights Management - Part 1 (Design)

Let me start off by saying that Digital Rights Management (DRM) implementations are generally despised by many users, myself included. If you don't believe me just Google "DRM" and "Stinks", "Sucks", or other appropriate negative word and you will get plenty of hits. The technical press is full of stories about Draconian measures, discontinued services, and software implementations that more closely resemble malware than anything else. In short, many implementations do little to stop piracy but in the attempt, tend to aggravate legitimate customers.

Although I don't like it, I understand the reason for it. Content owners who deliver popular movies, music, software, and books often lose lots of money when their stuff is widely pirated. (Although I don't buy their argument that every pirated copy is a lost sale.) I have worked for software companies where we estimated that there were in excess of 10 illegal copies of our stuff for every one we sold. When such conditions exist, it is perfectly understandable that measures are often taken to try and prevent it.

The main problem is that everyone seems to take a different approach, and most of the implementations are bad. Legitimate customers of digital content are often faced with several dozen techniques to activate their operating systems, application software, and the various forms of digital media content. License restrictions are often hidden deep within some "End User License Agreement" that was written by lawyers for lawyers. Some activations require dongles, constant Internet access, credit cards, or subscription services. The user may need a dozen different UserName/Password combinations to keep track of all their stuff.

Even the user who is willing and able to jump through all the hoops necessary to get legitimate copies of everything on their system, will find it difficult to remain legal or discover what is legal after the fact. Just try and browse through all the files on a large hard drive and figure out what is legal and what is not. If the computer breaks, can you legally transfer your stuff to a replacement computer? If you buy a second computer, how much of the stuff you purchased for the first one can be shared with the second one without an additional license purchase? If you upgrade hardware, operating systems, or change services is the stuff you previously purchased still legal? Can you make backup copies without violating the terms of the contract?

The average user often gets completely lost in the maze and ends up with either illegal stuff or simply never purchases in the first place because the terms were never clear. Staying legal is a huge headache for businesses and individuals.

Users are often left out in the cold when their subscription service goes out of business or the content owner disables a necessary Internet server that enables legally purchased content to continue to be accessed. Some license agreements and software implementations are way too restrictive and you often have to purchase something before you can even figure out what you are buying.

I could go on all day and cite examples of DRM implementations that aggravated me personally or someone I knew, but let me just say that I have yet to see a version that I have liked.

When I designed the Didget Management System, content protection and activation were built into the core architecture. They are purely optional features. The average user can set up a personal Didget Domain with several Chambers and use millions of Didgets without ever wanting to activate any restricted content, but if they choose to, the features are there to support it.

When designing the features, I had to take into consideration a number of factors. I decided that if the features were to gain acceptance and be widely used they had to meet the following design goals.

1) The implementation has to work. Content owners will not release their stuff using this system if it doesn't protect the data from unauthorized access in the vast majority of cases. No implementation is perfect and given enough resources, some people will try to figure a way around its protections, but it has to be effective in 95%+ of the cases.

2) The system must make it extremely easy for the end user to figure out what has already been activated, what is available for activation, and what are the exact terms for each individual activation.

3) It has to provide a single activation process that allows for multiple payment methods. The end user must be able to activate software or  a book using the same technique he used to activate his music or a movie. He should be able to pay for each activation using cash, a credit card, or some kind of account.

4) The system must provide flexible terms for activation so that content owners can provide a variety of ways to access their wares. One time use, unlimited use, limited term (e.g. 24 hours or one month), or a set number of accesses (e.g. 100 uses) are all examples of ways a merchant and their customers may want to conduct business for digital content.

5) The system must provide ways for content owners to allow existing customers to upgrade for a reduced price. It must be able to verify that the customer has a legitimate version that qualifies for the upgrade.

6) The system must provide ways for the customer to purchase content without ever revealing their identity to the merchant. The customer needs the option of an anonymous purchase using cash or an account where the account manager will see that funds are given to the merchant without purchaser information.

7) Any activations must result in the content being accessible for the full term of the contract without any further actions by the merchant. An Internet server cannot be required. Internet access cannot be required. A subscription service does not need to be current.

8) All activations must be valid for a set number of devices. When a user buys a song or a movie, it must play on all his devices without further activations. A simple synchronization is all that should be necessary to share or transfer access rights from one device to another. This mechanism must not work if the device is not one of the user's, however.

9) There are two ways most users are able to get access to restricted content - pay for it directly or get someone else to pay on your behalf (e.g. advertisers). Our system must enable both methods for activation.

My next post will describe our implementation and how it meets the requirements listed above.

Sunday, July 1, 2012

Didget Attributes

In most file systems each file or directory can be assigned a few attributes by applications either during file creation or at a later time. Directories are given the "Directory" attribute. Hidden files are given the "Hidden" attribute and static files are given the "Read-Only" attribute.

It is important to note that each of these attributes are just a mechanism to hint to any application how the file should be treated. Applications can ignore these attributes or change them at any time so they may not accurately reflect the user's wishes for the file or provide any meaningful security for the file stream data or file metadata.

In the Didget world, Didgets may also be assigned a number of special attributes that can be used to identify, search, or perform operations against any Didget. Some of them are like file attributes in that they are merely hints to applications and can be changed at will. Others provide meaningful protection and additional capabilities since an operating system or application cannot change them directly.

Didgets have 32 separate attributes. Some of them provide features that I have not seen anywhere else before. I will enumerate and explain each of them.

1) Prepended. Didgets have the unique ability to add additional data to the byte stream before the first data byte. Data must be prepended in 4096 byte chunks (the block size). Bytes in these prepended blocks can only be accessed using negative offsets. Byte 0 remains the traditional start of the file so that prepending data will not effect legacy applications. This allows extra metadata to be added to any given byte stream without worrying about breaking compatibility with an application that is not addressed to handle it.

2) Versioned. The Didget Manager has been designed to handle versioning of individual data streams. Unlike traditional Copy On Write (COW) file systems that are designed to version everything, the versioning capability in our system can be restricted to a small subset of Didgets. Didgets can have this attribute added or deleted at any time (with proper access rights) so you can turn versioning on or off for a single Didget or a whole group of Didgets. Snapshots can be taken any time the versioning is enabled.

3) Metered. This attribute is a critical piece of our "Digital Rights Management" capabilities. As a side note: I think DRM is generally a dirty word since it has been implemented so poorly (technically and administratively) in so many cases. Any Didget can be classified as "Metered" when it is published by the content owner to become a Public Didget. The terms for activation are clearly spelled out in the activation contract that is prepended to the data stream. Anyone who agrees to the terms can activate any Didget using the exact same set of activation procedures. This means that the process to activate music, movies, software, and books is exactly the same. I will address our whole new activation system in a later post.

4) Point Generator. Metered Didgets are activated using "Media Points". These points can be either bought or earned. Users are able to earn points by accessing Didgets with this attribute. Advertisers can produce digital content (i.e. advertisements) that a user can view or interact with to earn points that can in turn be spend towards any kind of other media.

5) Deleted. When a Didget is deleted, it is assigned this attribute (similar to moving a file to the trash bin). Deleted Didgets can be recovered until they are purged from the system. Purging requires special user rights so an application can delete Didgets but not destroy them.

6) Encrypted. This is just a hint to any application accessing the data that it has been encrypted. The application must be able to decrypt the data in order to use it.

7) Compressed. Just like the Encrypted attribute only for compression.

8) Sparse. Data streams can contain holes. Any Didget with a sparse data stream will have this attribute set.

9) Immutable. Data streams can be set with this "Read-Only" attribute to protect them from alteration. Public Didgets have this attribute set by default. Once this attribute is set, it cannot be cleared. Once immutable, always immutable. If you need a copy that is alterable, you can clone it into another Private Didget and change the copy all you want, but the original remains intact. Since Digits are accessed through their Didget IDs, you can't fool an application into reading your altered copy like you can with files by simply replacing a read-only file with an altered file with the same name.

10) Appendable. Immutable Didgets cannot have their existing data streams altered. However, with this attribute, additional data can be appended to the end of the data stream. Used in combination, it will be popular for logs that want new data added without the ability to change data previously written.

11) Self-Destruct. Any Didget with this attribute will be automatically deleted and purged from the system by the Didget Manager when the conditions for destruction have been met. This can be a specified period of time or a number of accesses. This will allow users to activate (e.g. rent) content for a specified period of time. When the period for activation is passed, the Activation Didget will be automatically be destroyed and the permission to access its Metered Didget with it.

12) Multiple Tags. This is a system attribute maintained by the Didget Manager. It is set when a Didget has two or more tags with the same key attached. For example, a photograph of three people may have three ".person.First Name" tags attached, each with a value corresponding to the first names of each person in the photograph.

13) Single Copy. Didgets with this tag are deleted and purged from the system when they are copied. This creates a software "Dongle" mechanism that enforces a single copy of any given Didget within the system.

14) Disposable. This attribute is somewhat similar to temporary files. Didgets with this attribute can have the space occupied by their data stream confiscated by the system when disk space runs out. An application does not need to come clean them up when disk space is low. This allows the user to fill up their disk with lots of HD video that they may never view without worrying that it will result in a "Out of Disk Space" error. As long as the space is not needed, the video is accessible. Backup policies can completely ignore disposable data.

15) Activated. Metered Didgets that have been activated by the user will have this attribute set. It is not a security mechanism since other measures are checked to insure that the activation is valid, but it is a quick way to see what has been activated and what has not.

16) Quarantined. Didgets that have yet to be scanned for viruses or other malware can have this attribute set. It may result in a warning to the user when it is accessed. (This can also be controlled through policies.)

Saturday, June 16, 2012

Synchronization

Didget Management is much more than just managing lots of Didgets within a given Chamber. It is about managing all the Didgets within a given user's Domain. Each Chamber within the global Didget Realm is a member of one and only one Domain. Since each Chamber within a user's Domain is probably located on a completely separate storage device there is a need to be able to manage the data across those devices.

Unlike file systems, the Didget Manager can perform operations against a set of Didgets without explicit commands from a running application. Policy Didgets created by the user can direct the Didget Manager to perform those operations automatically when certain events occur or when a specified amount of time has passed. Tasks like backup, replication, and synchronization can all be controlled using Policy Didgets.

One of the biggest challenges for existing applications that must try to synchronize data between two separate file system volumes today is in determining exactly which files are the same and which are different. If each volume has a large number of files, this task can also take a very long time. Even if two files have the same name, extra metadata and even the full contents of the data stream must be checked to make sure there are no differences between them. The challenge is even harder if most of the files are the same, but located in different folders on each system.

For example, suppose an application wanted to make sure two separate volumes both had the exact same copies of all photographs stored within them. It would need to first find every photograph in each volume and then compare it with each photograph in the other volume. If Volume A had some photographs that Volume B did not (or vice versa), then it would need to copy them. What should it do if all the pictures on Volume A were located under a /photos file folder hierarchy and all the pictures on Volume B were located under a /pictures folder? Should it synchronize by trying to replicate the folder structures or instead try to copy files to existing folders?

Synchronization between any two Chambers in the Didget Realm is almost trivial. The Didget Managers can quickly compare the two Chambers and find all the differences between them. The event counters and Marker Didgets discussed in an earlier post are tools the Didget Manager uses to figure out what has changed and what order things have happened. Didgets can be copied between two Chambers without needing to worry about folder structures.

For example, two Chambers that each have a million Didgets in them can be compared in just a few seconds and a complete list of all new or modified Didgets since the last synchronization event can be generated. Following the synchronization policy (or policies), the Didget Manager can copy any changes between the two Chambers so that they are completely in sync with each other.

Sunday, June 10, 2012

Public vs Private Data

Within the storage systems of any individual, small business, or large enterprise, there are two kinds of data. Data that was created by the user(s) of that system and data that was created somewhere else and copied into that system.

In the Didget Realm, Didgets can be classified as either Public or Private. Public data is that which was "published" by its creators for public consumption. Examples of public data are songs, movies, books, and software. Often their creators want the consumers of such data to pay for the privilege. Private data, on the other hand, was created within the data domain of the creator for their own private consumption.

File systems have no way to distinguish between the two types of data. File1.doc may be a popular document that I downloaded off the Internet and I have one of a million copies. File2.doc may be my own personal document that I spent 50 hours working on and I have the only copy. (Of course, it would not be wise for me to work so many hours on a document without making backup copies, but every once in a while you hear about some student losing such a thing.) Using a file system, I cannot tell which type of data is contained within either of the two files.

The simple fact is that these two types of data should be treated differently. I want to make regular backups of my private data and take extra security measures to insure that unauthorized access is prevented. If I lose some software I downloaded (public data), I can always replace it by just downloading it again. If I have a cloud backup solution, I don't want to use up all my bandwidth and storage space by pushing copies of a bunch of HD movies I downloaded instead of my important documents.

With Didgets, I can instantly see which data I have created and what I have copied from others. I can set policies dealing with replication, security, and backups based on those types. For example, I could have a policy that tells the Didget Manager to create two separate replicas of every private document I create.

Public Didgets are by default Immutable. This "Read-only" attribute prevents any changes to them thus preventing a virus from altering them and otherwise guarantees their integrity. If I want my own private copy of a Public Didget that I can alter, I need to copy its contents to a Private Didget. I can alter the Private Didget while keeping the original Public Didget intact.

Tuesday, June 5, 2012

Tags, Tags, and More Tags

In the Didget Realm, every single Didget can have lots of tags attached to it. Tags are similar to extended attributes that have been added to some file systems. It is extra metadata that exists outside of an object's regular metadata and separate from its data stream. While tagging data is nothing new, the approach we take to implement them with Didgets is very different than other previous solutions like extended attributes or database tags.

Extended Attributes

File system extended attributes are simple Key:Value pairs. The key is a simple string without any specific context involved. Just like file names, a file system will not attempt to interpret the meaning of a given key, it is just a simple lookup with no relationship between any two given keys. Likewise, a filesystem will not attempt to impose any restrictions of the value assigned to any given key other than making sure its length does not exceed any imposed limit.

File systems were not designed to allow fast, efficient searches for files based on the existence of extended attributes or based on any particular value assigned. For example, if an application wanted to find all the documents within a given file system volume that had the extended attribute "Author=John" attached to it, it would need to do a brute force search by finding every file with a document extension and examining each one individually to see if it had that particular extended attribute key and value. For a volume with a million or more files in it, such a search can be painfully slow.

Since many file systems do not support extended attributes and using them can be difficult, they are rarely used by applications. If a file with extended attributes is moved or copied to another file system, it is likely that the extended attributes will either be lost or altered in some way.

Database Tags

Some applications allow the user to tag data by storing information inside of a database managed exclusively by that application. Popular data management software like iTunes and Picasa use this technique to tag music and photos. These databases are not meant to be shared openly between applications and if a photo or music file is copied from one volume to another, the tags don't come with it. A user is only able to search based on the tags if the specific application supports it.

Didget Tags

Unlike these other approaches, our tags are designed to be widely used, shared, and searchable. Any application can use our simple API to get a list of tag definitions and attach tag values based on those definitions to any Didget. Any application can then find Didgets based on tags or add their own tags to make a Didget easier to find or manage.

Every tag within a Didget Chamber is defined using a simple schema. Once a tag is defined, any application can use any defined tag to attach a value to a Didget. If an application wants to use a tag that is not currently defined, it can quickly define a new tag which adds its definition to the schema. Applications can search for Didgets that have a certain defined tag attached to it or more specifically, have a certain value assigned.

For example, an application can define a new tag ".person.Nickname" and then attach that tag with the value of "Bubba" to a photograph Didget. Another application can later query the Didget Manager for a list of all Photograph or Document Didgets that have ".person.Nickname = Bubba" attached. The Didget Manager would be able to process that query in just a few seconds even if there were 2 million Photograph Didgets and 3 million Document Didgets mixed in with 5 million other kinds of Didgets and all of the Didgets had some tags attached to them.

Likewise, applications could search for all Didgets that had any tag of category ".person". It could find a list of all Music Didgets where ".person.musician=Billy Joel" and ".date.year=1980". The Didget Manager is able to perform these lightning fast queries without needing a separate database or implementing a complicated query language.

Unlike file extended attributes, tags are not lost when a Didget is copied or moved to another Chamber. This is because all Chambers support the tags and because applications do not perform the actual copy operation. An application will initiate the copy operation by telling the Didget Manager to copy a Didget, but it is the Didget Manager itself that makes sure nothing is lost during the copy. You never have to worry that your tags will be lost because the application forgot to copy them.

Tags are powerful tools to help users and applications to add meaningful metadata to any or all of the Didgets within a Chamber to enable fast searches based on specific values and build lists or menus from the results.

Wednesday, May 30, 2012

Tracking Changes With Event Counters and Marker Didgets

File systems use date and time stamps to keep track of when files are created, last modified and last accessed. As stated earlier, these values are stored within each file metadata record and reflect the value of the system clock running on the host system. Any application can later change these values to anything they want using the file API.

As computers and storage devices became faster and faster, many different files could be created and modified within a single second. This can make it difficult to determine in what order various operations occurred. The solution for file system designers was to increase the granularity of the values stored in the date and time stamps from milliseconds to microseconds and finally to nanoseconds. If the system clock was always guaranteed to be accurate and no applications could set these values to any arbitrary number then this approach would always allow for an accurate accounting of the order of events within a file system. Unfortunately, that is not the case.

The Didget Management System takes an entirely different approach. Each chamber has its own "Event Counter". This event counter is a 48 bit number maintained exclusively by the Didget Manager and that starts with 1 and is incremented each time some event within the chamber occurs. If the current event counter has a value of 100 and ten new Didgets are created, then first Didget created gets the value of 101 in its "Create" field. The second Didget created gets the value of 102 in its field. This continues until the tenth Didget created gets the value of 109. There is no API that allows any application or operating system to directly change the value of any of the event counter fields stored in the Didget records or mess with the event counter itself.

Using this technique, it is always possible to list every Didget in the chamber in the exact order in which they were created. Likewise, you can find the last 100 Didgets that were accessed or the last 10,000 that were modified. You can tell if Didget X was created before or after Didget Y was last modified.

Just like every other field in the Didget table record, these event counter fields can be specified when performing a general search. For example, in a chamber with millions of Didgets, I can get a list of all the Didgets that were created between event counter 100,000 and event counter 300,000 in under a second. A backup program that knows it performed its last backup at event number 1,000,000 can quickly get a list of all Didgets that were either created or modified after that point.

While the event counter allows you to know that event X happened exactly one event before event Y, it doesn't tell you when either of those events actually occurred or how much time passed between them. That functionality is left to special Didgets called Marker Didgets.

A Marker Didget is used to match either a specific date and time (as recorded by the system clock) or a specific type of event with an event counter value. These Marker Didgets are created either by the Didget Manager itself or by applications. Each Marker Didget has a type associated with it. Some types that have been defined so far are "Chamber Created", "Chamber Mounted", "Chamber Dismounted", "Backup Started", "Backup Ended", "Virus Scan Started", "Virus Scan Ended", and "Time Stamp". The Didget Manager code automatically creates one of these markers when the chamber is created, mounted, or dismounted. Applications like backup programs and virus scanners can create them when they start and when they finish. Any application can create a Time Stamp Marker at any time.

When a Marker Didget is created, the current value of the chamber's event counter is stored in that Marker's creation event counter field (just like it is whenever any other kind of Didget is created). In addition, the Didget Manager queries the system clock and stores the current date and time value also in the Marker Didget's metadata record. This allows us to know that event 1000 occurred at 10:00 am on June 4, 2011 and event 2000 occurred two days later at 1:00 (or at least that what the system clock said the times were). Even if the system clock values are not accurate, the order of events always stay intact.

The Didget Manager will not only create these Marker Didgets whenever an application specifically commands it, it can also be set to do it automatically by using Policy Didgets (another type of special "Managed Didget" that I will detail in a later post). A user or application could create a policy that commands the Didget Manager to automatically create a "Time Stamp" Marker Didget every 15 minutes. Likewise, a policy could direct it to create a "New Month" Marker every time the Didget Manger detects that a new month has began.

Marker Didgets can be used in queries to find Didgets that have been created, modified, or accessed either before, after or between Markers. For example, I could query for a list of all Photo Didgets that were created before the chamber was last mounted. I could ask for a list of all Document Didgets that have been accessed since the last backup or last virus scan. I could ask for a list of all Software Didgets that were created after the New Year Marker in 2010 but before the New Year Marker in 2011.

Sunday, May 27, 2012

Lists, Menus, and Collections

In addition to File Didgets (things like photos, documents, music, or video) there are also a number of "Managed Didgets" where the Didget Manager controls the contents of their data streams. I would like to talk about three of them that are used to organize various other groups of Didgets - List Didgets, Menu Didgets, and Collection Didgets.

List Didgets

A List Didget is a Didget that has, as its contents, a simple list of other Didgets to form a logical group. If a Didget has been added to one of these List Didgets, it could be said that it is "a member of that group". Unlike files that must reside within a single folder or directory (the exception being file systems that support hard links), Didgets are generally expected to be members of several different groups. A single music Didget could be members of the "Music", "Rock and Roll Music", "80's Music", and "My Favorite Songs" groups.

Just like every other Didget in the system, List Didgets can have attributes and tags assigned to them. In addition, List Didgets can be assigned "rules" about what kinds of other Didgets can be added as members. So I could create a List Didget with a name tag of "My Hawaii Vacation Photos" attached; will only allow Photo Didgets with the format of JPEG as members; and contains a current list of 100 Photo Didget IDs. This Didget would effectively represent a "Photo Album" with 100 photos in it of my vacation.

List Didgets are especially suitable for lists where you don't care about each member having some kind of label associated with it. In the example given, it is not important that each of the 100 photos are given labels like "Day at the Beach", "Snorkeling" or "Whale Watching". They could be identified only as "Photo 1", "Photo 2", "Photo 3", ... , "Photo 100", or they could have no names at all.

Menu Didgets

Just like List Didgets, Menu Didgets contain references to other Didgets. Their uniqueness lies in the requirement that every member Didget also be given a short label within the menu. I could create a Menu Didget called "Clint Eastwood Movies" and have 10 members that are all Video Didgets. Each member would have a label in the Menu like "Kelly's Heroes", "Magnum Force", or "Space Cowboys".

Collection Didgets

A Collection Didget is also very similar to a List Didget in that it can have lots of members that do not need any kind of label associated with each of them. With Collection Didgets, members are divided into two distinct categories - Mandatory Members and Optional Members. Collection Didgets are primarily used to track data set completeness. I could have a Collection Didget called "Microsoft Office Software" that contained a list of all the Software Didgets that are required to run that software package as well as another list of Didgets that are nice to have (spell checkers, thesaurus, help files, etc.) but are not needed to run it. A simple utility could be built that checks to see if a given Didget Chamber has everything it needs to run without having to start up the program. It will be possible to build a sophisticated "Package Management System" using these Collection Didgets.

Hierarchy

All these kinds of Didgets can be nested within each other. I can build a hierarchy of menus that had menus inside of other menus. I could mimic a traditional file system folder hierarchy using this technique. "List of Lists" could likewise be made by nesting List Didgets within other List Didgets. Same for "Collection of Collections". I could mix them up a bit by creating a "List of Collections" or a "Menu of Lists".

Each of these "Managed Didgets" represents a powerful tool for organizing groups of data and giving applications an elegant way of visually representing them to the end user.

Didget Organization

Each Chamber within the Didget Realm is capable of storing billions of Didgets. Unlike files in a file system, a Didget does not have to be "located" within a folder. Each Didget can be assigned certain types, attributes, and tags that can be used to distinguish it from all the other Didgets in the system. Simple queries can quickly sort out all the Didgets that match a given search criteria.

Each Didget can be a member of one or more "data sets" but there is no requirement to do so. This means that I could have a Chamber with 10 million Didgets in it and have none of them categorized into a specific data set. This would be like the early days of file systems where there were no folders and all files were in the "root directory".

Even with folders, it is possible to have lots of files in any given folder. For example, some folders like "Windows", "bin", or "My Documents" can sometimes get populated with several thousand files. In file systems, such a situation can cause some real problems. Name conflicts are more likely to arise as the number of files within a single folder rises, since you can't have two files with the same name in the same folder. Performance is also slow when you try to load the contents of the directory into a file manager or dump the list to a terminal screen when there are so many files.

There have been tools available for a long time to help users pick a subset of files out of a long list of available files within such a "crowded folder".  To pick out a smaller subset of all those files a user may issue a command like "Dir *.exe" or "ls *.cpp". Just those files with that extension will be listed, giving the user a much more manageable list to navigate.

In the Didget Realm, things are substantially different. Since each Didget has a unique number as its identifier, there is no problem with having 10 million Didgets all in the same "root directory". The interface is designed to perform lightning-fast searches based on criteria much more powerful than just file names or extensions. In this respect it has a lot more in common with a database than a file system.

So if I want to get a quick list of all the Didgets that are photos in JPEG format of my vacation in Hawaii last year, I can just do a simple query against all 10 million Didgets and get a list of all 100 photos in less than a second. I don't have to navigate down through a directory hierarchy to try and find the C:\Photos\Vacations\Hawaii\2011 directory that has all 100 photo files in it. If I want a Didget to be a part of several different queries, I can just attach additional tags like ".people.Group = Family" and ".activity.Sport = Surfing" and it will appear when I search for those things. I don't need to create separate folders like "C:\Photos\Family" or "C:\Photos\Surfing" and either put copies of the photos in them or create hard or soft links to the original photos.

In addition to the quick query capability of our system, Didgets can be organized into different groups or sets that are more persistent than ad hoc queries. There are three special kinds of Didgets that help organize them - List Didgets, Menu Didgets, and Collection Didgets. I will discuss them further in my next post.

Sunday, May 20, 2012

Past Solutions

The Didget Manager was created to solve a number of data management problems. I will attempt to state the biggest problem as best I can and then illustrate a number of different solutions that have either been attempted in the past, or are currently in use.

Problem: How do you properly manage many millions of pieces of structured and unstructured data especially when they are spread across several storage devices?

It is very easy for an individual or business to buy sufficient storage capacity to hold tens of millions of pieces of data. A 3 TB hard disk drive can be purchased for about $150. Small RAID storage  devices that can hold 12 TB of data can be built for less than $1000. Flash based USB drives or Solid State Drives cost about $1 per GB.

In my home, I have counted over 20 different storage devices that have a capacity of at least 8 GB. Cameras, computers, PVRs, phones, iPads, and video cameras all come with built-in storage. In addition I have several external storage devices like thumb drives, backup drives, and a NAS.

They are all filling up with data. Photos, home video, documents, downloaded software, music, and other stuff seem to slowly fill any available storage. Portable devices like cameras and phones are often synchronized with my laptop or desktop computer. It can be difficult to tell sometimes if I have only one copy of a given photo or if I have dozens of copies spread around all my storage devices.

So what falls under the definition of "Manage" when it comes to data?

1) Backup. Naturally, we want to insure that we have proper backups of all important data. The backup can be located locally in case a storage device just fails or it can be pushed to a remote site to help insure a successful disaster recovery procedure.

2) Replication. Backup is a form of replication, but it also includes the placement of data on several different devices to enable convenient access. We always want to be able to access our important data no matter which device we have with us at the time.

3) Search. Even if we have a storage device with us, it doesn't help us if we can't find the document we are looking for among several million others. We want to be able to search for it based on its name, some attributes it may have, or by a set of keywords.

4) Protection. We want to be able to prevent important information from being altered or destroyed by accident or by a malicious program.

5) Data Sets. We want to be able to organize data by placing various pieces of it into different sets. A set can be an album full of pictures, a play list with dozens of songs, or a software package containing a hundred different programs, libraries, or configuration settings. It would be very helpful if every piece of data could be a member of more than just one data set.

6) Synchronization. If we have more than one copy of something, it would be nice to be able to have changes made to one of the copies be synchronized to all the copies. If a new element is added to a replicated set, it would be helpful to also add it to all the copies of that set.

7) Security. We want to make sure a piece of data can be only accessed by those with permission. We want to make sure that security is not compromised just because the data is moved or copied to another location.

8) Inventory. We want to get an accurate accounting of all our pieces of data. We want to know if any of our data sets are incomplete. We want to know how many documents, photos, or videos we have. We want to know if there are any security holes. We want to know what has changed and what devices have not yet been synchronized, backed up, or replicated.

9) Completeness. We want to make sure when a piece of data is copied from one place to another that it is a complete copy. The data stream as well as all metadata including things like extended attributes need to be copied to assure that the clone is an exact duplicate of the original.

What attempts have been made so far to accomplish some of these data management tasks and what are their limitations?

1) A well-organized file directory tree. This is where applications and users must adhere to a clearly defined plan for grouping all files into appropriate folders or directories. All the operating system files go in C:\Windows; all the user utility programs go in /bin or /usr/bin; or all the user documents go into the C:\users or /home areas. This approach can work fairly well when there are only a few thousand files to deal with. Unfortunately, it requires a lot of work to keep all the files in their proper directory. It also makes it difficult to decide where to put that photo you just downloaded - in the C:\Photos directory or in the C:\Downloads directory.

2) Lots of databases. Since most databases were not built to manage lots of unstructured data like photos, video, and documents (Blobs in database speak), databases are generally used to track and manage files instead. Windows Search, OS X Spotlight, Google's Picasa, iTunes, and iMovie are examples of programs that store file information within a database. When a file is created, its full path along with additional metadata are stored in the database. This allows the user to keep track of millions of files and do very fast queries based on things like keywords or tags. Incremental backups, replication and synchronization functions, and data sets can be tracked using these databases as well. Unfortunately, the databases are completely separate from the file system. It is possible for users or applications to create new files and delete or modify existing files without the database being updated as well. Even if there is a background monitoring tool that has a file system filter driver informing it of every change, it is possible to make changes while that tool is not running. Even if every file system change is accompanied by an accurate update to one or more databases, it can be difficult to manage lots of files when there are dozens of separate databases, each keeping track of just a subset of the whole file system. A separate application is probably managing each database independently and those applications seldom talk to each other.

3) Embedded Data. Some file formats like JPEG allow metadata to be embedded within a file's data stream without disrupting its normal processing. Things like the camera info, data and time the picture was taken, and the GPS coordinates of the location where the picture was taken can be stored within the Exif data portion of .JPG files. Unfortunately, this data is not always accessible or searchable by all applications and some applications can alter the metadata unintentionally. Few data formats allow this behavior so it has limited application.

4) Extended Attributes. This external metadata can be attached to any file within a file system that supports them. Unfortunately, they are generally not searchable and are not universally available. Only some file systems support them and not all supporting file systems have the same rules for implementation. When an application copies a file from one file system to another, the extended attributes can be lost, altered, or just stripped off because the application forgot to copy them.

Tuesday, May 15, 2012

The Didget Record

As stated earlier, every Didget has a 64 byte metadata record used to track it. The Didget Manager is software that manages all the Didgets in the system. Unlike a file system, the Didget Manager is able to distinguish between different kinds of Didgets.

A file has only one mechanism (outside of the actual bits stored in its data stream) used for classification. That mechanism is the file extension (typically a three or four character string appended to the end of the file name). The file extension may symbolize the format of the data stream but the file system does not try to interpret its meaning. It is completely up to applications to interpret which file extensions belong to a particular category.

For example, there are dozens of different data stream formats that are used to represent a still image (e.g. a photograph). JPG, PNG, GIF, TFF, BMP, and ICO are all examples of file extensions used to represent images. If a user wanted to know how many total image files were on a system, they would have to run an application that was programmed to find every type of file extension applicable to images. Since there is no way to ask a file system for a list or a count of "all image files", the application would need to perform a separate search for every file extension. If a volume contained millions of files, this simple search could take up to an hour or more to complete. If a new file extension was created to represent a new image format, the application would need to be updated so that it would look for files with that new extension.

Didgets on the other hand, have several mechanisms that are used to classify data. Every Didget has a Didget type and a Didget subtype. If the type is File Didget, then it also has a File Didget format assigned. The Didget type and subtype fields are bit fields of 16 bits each. Since each of the 16 Didget types can have 16 different subtypes there are 256 possible kinds of Didgets in the system.

One Didget type is "File". When files are converted into Didgets, they are assigned to be File Didgets. The other 15 Didget types have special purposes that apply only within the Didget Realm and I will discuss them in further posts.

Of the 16 File Didget subtypes, only 8 have been defined so far. They are Audio, Document, Image, Script, Software, Structured Data, Text, and Video. Each File Didget subtype can be further categorized into its various formats. Unlike the other two byte fields, this two byte field is not a bit field that can only have a single bit set. Instead it is a unsigned short int and can hold up to 65,534 different format types (zero is reserved).

Audio File Didgets include every format where the data stream is interpreted as sound. Formats for music, audio books, speeches, instruments, voice mail, and other noises all have the "Audio" bit set in the File Didget subtype field.

Software File Didgets include every kind of compiled computer code. Executable files, shared libraries, device drivers, and every other kind of software, regardless of targeted CPU or operating system, all have the "Software" bit set in the File Didget subtype field. Other kinds of code that must be interpreted like Python, Ruby, Perl, system commands, or shell scripts are categorized as "Script".

The other types of File Didgets are used to categorize the various document formats, still images formats, video formats, database formats, and plain text data formats.

Unlike file systems, the Didget Management System provides simple APIs used to search for all the Didgets that match a given set of search criteria. What this means is that an application can make a single call to the Didget Manager for a list of all the Video File Didgets and get a complete and accurate list very quickly no matter how many different kinds of video formats may be present.

Because the Didget Manager is able to quickly check bits in the bit fields described for every Didget Record in the system, it is able to sort out all the matching Didgets for any particular query in record time. On a system with a Quad core processor and 4 GB of RAM, I am able to sort through about 25 million Didget Records per second. This means I can find 9 million photographs mixed in with 16 million other kinds of Didgets in one second or less.

Thursday, May 10, 2012

What is a Didget?

This new data management architecture uses individual objects called Didgets. A Didget (short for Data Widget) has some properties of a conventional file, some properties of items stored in an object store, and a bunch of properties for which I can find no equivalent in any other system.

A Didget has a variable-length data stream just like a file. Any kind of serialized data can be written to this data stream. It can contain a photo, some software, a video stream, or any other structured or unstructured data that can be saved to a file. The number of bytes for this stream can range from zero bytes to just over 18 trillion bytes.

A Didget also has a small set of required metadata that is stored as a fixed-size record within a table. Just like file records like iNodes (Ext2, Ext3, Ext4) and FRS structures (NTFS) that file systems use to track files, the Didget Manager keeps track of all the Didgets within a Chamber using these Didget table entries.

The size of each entry is intentionally small. It is only 64 bytes in size. This allows extremely large numbers of Didgets to be managed using a minimal amount of disk reads and RAM for caching. By contrast, the default iNode size is 256 bytes and NTFS's records are 1024 bytes (4096 bytes on all the new advanced format hard disk drives). Some other file systems have even larger metadata records. In order for the NTFS file system to read in the entire MFT and store it into memory for quick searches, it would need to read in 10 GB from disk and have a 10 GB of RAM if the volume contained 10 million files. For a 100 million files, it would need ten times that much memory.

The Didget Manager, on the other hand, could read and store the entire Didget table in just 640 MB when 10 million Didgets are present. Even for 100 million Didgets, it is a very manageable 6.4 GB.

With only 64 bytes to work with, every single bit is important. Painstaking care was taken to insure that every byte was necessary and yet every field was sufficient size to ensure good limits. Fields within this structure include the Didget's ID, its type information, its attributes, its security keys, a tag count, and three separate event counter values (analogous to date and time stamps).

Absent from this structure is the name of the Didget. It is stored in another structure if the Didget even has one. Unlike files, a Didget does not need to have a name. Its unique identifier is a number. This 64 bit number (the Didget ID) is assigned during Didget creation; it never changes over the life of the Didget; and the number is never recycled if the Didget is deleted and purged from the chamber. This means that if the ID is stored within some other data stream for use by a program, it will always point to the right Didget (unless of course that Didget has been deleted).

A Didget can have a name. In fact it can have lots of them. The name is simply a tag that has been attached to a Didget. A tag is a simple Key:Value pair that is stored in such a way as to enable database-like speeds when searching for Didgets that have certain tags. Each Didget can have up to 255 tags attached to it.

Each tag within the system is defined in a schema. The user or an application can create a new tag definition using a simple API. Once defined, tags can be created and attached to any Didget(s) using that definition.

For example: a Didget containing a photograph taken of Bob in New York City in 2011 may have 3 tags attached to it (.person.FirstName = Bob, .place.City = "New York City", .date.Year = 2011). Any application can issue a query to the Didget Manager for a list of all photographs with these three tags and this Didget will be in the list. If the user wanted to later attach a new tag (e.g. .device.Camera) to this photograph, he could define the tag and then attach the value (.device.Camera = "Cannon EOS 7D") to it.

The Didget Manager is designed to be lightning quick at finding all Didgets that match a given query. For example: if a chamber contained 20 million Didgets and 5 million of them were photographs, it could return a list of the Didget IDs of all 5 million photographs in less than 1 second. If the query required matching several tag values, it would still take less than 10 seconds to return the complete list even if every photograph had dozens of tags attached to them.

The Didget Manager is able to accomplish this task without needing a separate database that can become out of sync with the Didget metadata. The entire system has been built around a "10 second rule". That is to say that the algorithms and structures of the system have been designed such that with the right hardware setup, no query should ever require more than 10 seconds to complete even if the chamber contains billions of Didgets.

Wednesday, May 9, 2012

Prerequisites

As stated earlier, I have invented a new system (Didget Management) that I think can eventually replace conventional file systems. Before going into any details about the various features of my new system, I thought I would first discuss the requirements of any data management system that hopes to have any chance of unseating the reigning champion - the traditional file system.

Of course, any departure from over 50 years of computing tradition will be met with a certain amount of pain. No matter how many processes are put in place to ease the migration of data from one system to another, existing users and programs must adapt to a whole new way of managing data. Some features of conventional systems can be emulated to provide some level of backward compatibility, but nevertheless there will be a learning curve and the new system will break some old ways of doing things.

Obviously, the new system must offer some very compelling features in order to make the pain worth it. The new system must not only solve a number of existing problems, but it must also open up lots of new opportunities. As the Internet has proven, if you provide enough value, widespread adoption is possible in spite of many hurdles. The Internet went from a curiosity to an integral part of the computing landscape in a relatively short time once users and developers realized the power of these inter-connected servers and the opportunities they opened up.

If it aint broke don't fix it...

Although in a previous post I enumerated quite a few problems with conventional file systems, they also have a number of features that I think work very well. My new system must be able to provide these features with equivalent speeds and ease of use.

1) Block Based Storage.

File systems rely heavily on the block storage nature of the physical storage devices they control. Hard drives, flash drives, and optical disks are all block based storage mediums. Like file systems, the Didget Manager makes heavy use of a block based architecture.

2) Variable-Length Data Streams.

Each file in a file system has a data stream that consists of a set of bytes arranged serially that represents information stored in a digital format. The data stream may contain structured or unstructured data. Numerous formats have been invented over the years that programs rely heavily upon to work properly. Just like a file, a Didget has a variable length byte stream that can contain any kind of data. Any existing file can be converted to a Didget without modifying its data stream.

3) Robust, Yet Simple API.

Applications must be able to create, delete, access, modify, and perform queries against any number of data elements (e.g. files). Like file systems, the Didget Management System will release a robust set of APIs that make it very easy for applications to create new Didgets and query or otherwise manipulate existing ones.

4) Support for Massive Numbers.

Modern file systems like NTFS can handle billions of files within a single volume if the underlying storage is sufficiently large enough to hold them. Like volumes do with files, each Chamber can handle billions of individual Didgets.

5) Fast Access.

File systems have been finely tuned over the years to provide quick response to commands from applications to create, open, read, write, and close files. The Didget Manager is able to perform similar operations with just as much speed as conventional file systems. For batch operations where thousands of new Didgets are created at once, we can even do it faster.

Who am I?

The most obvious questions anyone who is taking a serious look at this technology would ask are: Who is this guy? and Does he know what he is talking about?

My name is Andy Lawrence. I have over 20 years experience designing and implementing file system drivers, custom file systems, disk utilities, and cloud storage solutions.

Hopefully, my posts on this blog will speak for themselves as to whether my ideas have merit. I will leave it up to the readers to judge for themselves if the problems I discuss are real and whether or not my solutions are valid.

As far as my qualifications for delivering storage solutions go, here are my credentials. After graduating from college in the late 80s with a BS in computer science, I joined Novell where among other things I worked on device drivers for the DOS, Windows, and OS/2 operating systems. Here I learned in great detail about how disk drives worked and how file systems handle data streams.

In 1995, I joined a small startup called PowerQuest where just a few of us engineers worked on a disk partitioning product called Partition Magic. During my nearly 8 years at the company, I lead the development of a couple other products, Drive Copy and Drive Image. These were among the first disk imaging solutions to enter the market.

I have written custom file systems, worked on cloud based backup solutions, and designed my own general-purpose data management solution. My current "Day Job" is at Move Networks. I am a Principal Engineer at this company which was acquired by Echostar early last year.

I have recruited a small team of former colleagues to assist me in implementing the various features of this new architecture. They also have regular jobs, so we are working on this project in our spare time.

Tuesday, May 8, 2012

The Problem(s) with File Systems

File systems have been the backbone of data storage systems since the early days of computing. I use the plural term because as everyone knows there isn't just one file system, but there are lots of them. FAT, FAT32, NTFS, Ext2, Ext3, Ext4, HFS+, ZFS, etc., are all examples of such file systems and every few years, one of the operating system vendors or someone in academia comes up with a new one.

Over the years more than 100 different file systems have joined the ranks. Some fill very niche applications, others have gained moderate market acceptance, while yet others are running on computers numbering over 100 million. Each new file system offers at least a few unique features (e.g. long file names, access control lists, extended attributes, journaling, or hard links) that set it apart from the others in the field, but all file systems are constrained by the general file system architecture.

Backward compatibility issues and the desire of application designers to write to a single, unified, file API make it extremely difficult for new file systems to introduce compelling, original features without breaking the mold. Over the years, numerous problems dealing with data storage have surfaced. Some problems have been solved or at least mitigated by the introduction of newer file systems. Other problems continue to plague data managers and require a radical new approach to solve.

In spite of numerous problems, file systems work reasonably well and their endurance is a testament to their designers. However, I believe the time has come to replace file systems with something better. By that, I don't mean we need to just build another file system that does a few things differently than the others. I mean that we need a radically new general-purpose data management system that is not limited by the conventional file system architecture.

So, what's wrong with today's file systems? Let me count the ways...

The biggest problem is that file systems don't actually "manage" files. Sure, they enable hundreds or thousands of different applications to create lots of files, but they don't actually help manage them. File systems only do what applications tell them to do and nothing more. A file system won't create, copy, move, or delete a file without an explicit command to do so from the operating system or an application.

Any application with access can create one or more files within the file system hierarchy and fill them with structured or unstructured data. With today's cheap, high-capacity storage devices, file system volumes can be created which will hold many millions of files. Some file systems are capable of storing several billions of individual files.

While a file system will make sure that every file's data stream is properly stored and kept intact and that every file's metadata fields maintain the last values set by applications, the file system itself knows almost nothing about the files it stores. Every file system treats each of its files like a "black box". A file system can't tell a photo from a document or a database from a video. It doesn't make sure that the file's unique identifier (its full path including the file name) is in any way related to the data it contains. It doesn't care a wit if an application creates a file containing a resume' and names it C:\photos\vacations\MyGame.exe. It also will let a user store music files in their /bin directory or put critical operating system files in a folder called /downloads/tmp.

What this means is that if the user, for example, wants to find all photo files that were created in 2011, the file system will do little to help find them. An application must examine each and every file within the system and compare its data type with known photo formats and then check its date and time stamps to see if it was created in that year. Unlike databases that have sophisticated query languages for lightning fast searches, file systems have things like findFirst and findNext.

When the number of files within a file system grows beyond several thousand, it becomes increasingly difficult for the average user to try and manage them using a file browser and a well defined folder structure. Once the number exceeds a million, the user is generally completely lost without special file management applications to help organize all the files. Basic searches for either a single file or for groups of files can take a very long time since directory tree traversals using string comparison functions are inherently slow. As the number of files grows, the queries take longer and longer.

To combat this problem, users are turning to special purpose data management applications to help them manage a certain subset of all their files. To manage their music files, they get iTunes. To manage their photos they try Picasa or Photoshop. To manage their video streams they install iMovie. Each of these applications offers ways to organize and keep track of their respective data sets. They often allow the user to tag or otherwise add special metadata to every file they manage to help the user classify files or put them into playlists, favorites, or albums. This extra metadata is often stored in a proprietary format or in a special database managed exclusively by the application.

This solution to managing data generally results in a collection of separate "silos of information" that do not interoperate with each other very well. Other applications are not able to easily take advantage of the extra metadata generated by the various data management applications. Many files within a given volume are not part of one of these silos and must be managed independently. Movement of data from one system to another often requires special import and export functions that don't always work. Finally, the management applications often just maintain references to the files they manage. If another application moves, renames, or deletes the underlying files, the management application often runs into problems as it tries to resolve the inconsistencies.

Operating systems like Windows 7 and Mac OS X include special file indexing solutions (Windows Search and Spotlight) to help the user find files. The indexer will comb through some or all of the files in a volume and "Index" the metadata and/or file content it can identify. It will store all the index information within special database files so that the user or applications can quickly find files based on keywords. Unfortunately, these indexers are not tied directly to the file systems they index. It is often the case (especially with portable storage devices) that changes are made to files while the indexer is not currently controlling or monitoring changes. This can happen if the user boots another operating system or plugs the portable drive into another computer. Once the indexer resumes operation, it must go through an extensive operation to try and figure out what changed. In some cases, it just deletes its index and starts over. For volumes with millions of files, it can take many hours to re-index.

Some file systems allow extended attributes to be created by applications and maintained by that file system. Unfortunately, extended attributes are not universally supported and each file system's implementation is different. Copying or moving files with extended attributes between file systems can result in the loss of information or its unexpected alteration. Even those file systems that allow extended attributes do not provide any fast way for applications to search for files based on them. Other than through the indexing services mentioned earlier, it is nearly impossible to find a set of files based on a common extended attribute value.

Another persistent problem is that every file within the system is subject to changes initiated by any application with access. The file API is very open and allows almost every piece of metadata or byte stream to be modified at will. Malicious or inadvertent changes can wreck havoc on a system. A virus that manages to run under the logged on user is able to modify any file that user has rights to. Such malicious programs could, for example, make random alterations to filenames and/or folder names and thus invalidate any stored path names. A program can change date and time stamps, file attributes, access permissions, and file locations. The file attribute "Read-only" is just a suggestion for applications to leave the data stream alone. Any program with rights can simply change the attribute to "Read-Write", modify the file contents at will, and even change the attribute back once it is finished. What this means is that no file metadata or data stream can be trusted to be either accurate or even reflect its original state. Only a bit-by-bit comparison with another set of original data can assure that any file has not been altered.

For many operations such as file backup or synchronization, a knowledge of the order of operations against a particular data set is crucial. File systems use date and time stamps to keep track of when files are created, accessed, or modified. As was previously pointed out, because each of these time stamps can be altered at will, they may not be accurate. Even if no applications alters them, the values they contain may not reflect the proper order of operations. The file system simply queries the value of the system clock controlled by the running operating system when it records date and time stamps. The clock may be off by a few minutes, hours, days, or even longer. The clock can be reset by the user or by synchronization with another computer. A portable drive that is plugged into two different computers during the course of a day, each with a clock that is different, may not record the proper sequence of events with regards to file operations.

Lastly, one of the biggest weaknesses of file systems is the unique identifier that is used for files. The file name and the folder names in its hierarchy make up each file's unique identifier. Every file must have one and only one full path name and it must be unique. Some file names are human readable, others are generated by software and may look like "RcXz12p20.rxz". The human readable names are generally in the language of the creator and cannot be translated without altering the file's unique identifier. Various file organizers and any other application that wants to keep track of one or more files, often stores the full path to the files either within a database or within another file's data stream. If the original file name is altered or any folder in its path is either renamed or the file is moved to a new folder, the stored path becomes invalid. "File Not Found" is among the most common error conditions encountered by users or applications.

Computers are much faster at crunching numbers than they are at string comparisons. It will always be much faster for a file system to find a million files if it is given their iNode numbers than it would be to find them based on a million different full path names.

As block storage devices like hard disk drives and flash memory drives continue to expand in capacity, the number of files within any given file system volume will continue to increase dramatically. As the average number of files a user or business has grows, the issues identified here will become even more problematic.