Welcome to the Realm, The World of Didgets

Sunday, February 24, 2013

Connecting the Dots

If you look up at the sky on a moonless night, far away from any city lights, you will see many thousands of individual stars. An asterism is a group of those stars that can be connected together in our minds to form a stick figure. Constellations are ancient asterisms that gained popular names like Virgo or Ursa Major. Other asterisms that just make up a portion of a constellation have also been given popular names like "The Big Dipper" or "Orion's Belt". People who star gaze and either find some of these popular asterisms or form their own, are looking for "patterns" among the thousands of stars.

Searching for patterns is also common when we deal with all that data that exists as individual files or database records on our hard drives, flash memory cards, or DVDs. Sometimes these patterns are already established for us. A popular software package may consist of a dozen separate executable files along with their configuration files and documentation. They are often copied into one or two folders or directories during an installation process to keep them together. Sometimes installation programs copy them into common folders like /usr/bin so that they get all mixed in with other programs and they are not so easy to sort out and figure which files belong to which programs.

But even files that seem to be completely independent of other kinds of data (e.g. a photo or a song) can often be grouped together with other files to form ad hoc groups (e.g. a photo or a music album). We are constantly trying to make connections between different data points to form new and interesting patterns. Facebook and other social media sites provide mechanisms to form some of these patterns. A user posts messages, pictures, documents, videos, and other personal information in order to tell a story about their life, their interests, and their friends. It is the connections between lots of individual pieces of data that can lead to new interactions and help us make decisions.

The current trend in "Big Data" and various forms of analytics is all about finding patterns in large amounts of data to drive business decisions. Analyze a million customer orders to look for patterns of shopping behaviour when it is cold outside in order to figure out what items to put on sale when the next big storm hits. Analyze emails sent by everyone over 65 years old in Florida to figure out what political messages will most likely sway the most voters.

The trick to establishing meaningful patterns among millions or billions of individual data points lies in the ability to quickly analyze each point and determine if it has a significant connection to another point. The system that is used to store the information is a critical component to being able to quickly check lots of data points for a certain condition in order to sift the wheat from the chaff. The system must not only be able to match things like strings or numbers, but it must provide some kind of context in order to make more meaningful connections.

For example, if someone wanted to analyze a group of messages to gain intelligence about military hardware, the word "Tank" would be a meaningful keyword to search for. However, such a "brute force" search might turn up every message that deals with water tanks, gas tanks, and R&B music. It is much more meaningful if the search was conducted using "Vehicle=Tank" instead.

The Didget Management System was designed to not only manage large numbers of data points, but to also aid in making connections between points in order to find new patterns. By attaching many searchable tags to any given piece of data and by providing context for every single tag, the system makes it easy to find all the data that share a common attribute. It can also rank various connections between any two points based on the number of attributes they share in order to give hints about more relevant connections.

Big Data Analytics is all about finding hidden patterns and unknown correlations in large amounts of data. This means that specialized queries must be conducted against all that data to try and find meaningful patterns. When the data is created and stored, the nature of such queries is largely unknown. In other words, the data must be stored in such a way as to make as wide as possible, a variety of potential queries.

The speed at which a query can execute is a major factor in finding that "needle in a haystack". If a big data set consists of 10 billion data points and every query takes several hours to complete, then it becomes very hard to conduct lots of different types of queries, looking for a pattern. If, on the other hand, such a query can execute in a minute or less, then it becomes practical to conduct a wide variety of queries hoping that a meaningful pattern just "pops out in front of you".

Several other "big data" projects like Hadoop, MapReduce, HBase, Cassandra, and MongoDB have been structured to be spread across a cluster of nodes so that the processing of data can occur in parallel. This can greatly reduce the time necessary to perform a query. Such systems can be very complex to set up and administer, however. Our system has been designed to greatly simplify such configurations.

But finding patterns should not just be exclusive to large companies with big data sets. Individual users could greatly benefit from finding meaningful patterns among a few million pieces of information. If I got a message from Mary about her vacation in Hawaii, it would be helpful if there was an "about" button next to her name that when pushed would bring up a list of every message, photo, and document that she had sent me or was about her. Likewise it would be helpful if the message itself had hyper-links in it that when clicked would bring up my own photos of Hawaii or information about scuba diving or whale watching. These links could be generated automatically by the system based on tags already present on other Didgets.

Saturday, January 19, 2013

Silos of Information

As I stated earlier, the Didget Management System was designed to offer an alternative to conventional data management systems that tend to manage just a subset of all data and to build walls around any extra metadata they may generate. With such systems, a given set of a few million pieces of data (files and/or database rows) will often be fragmented into several of these "silos of information".

To illustrate this using a real world example, consider the following:

A user has a 2TB hard drive nearly full of data. Since the average size of each piece of data on the drive is about 1 million bytes, this represents nearly 2 million different files. Out of all that data, there exists three important pieces of information that are from the user's friend Bob. Bob has sent the user an email; the user has taken a picture of Bob and transferred that picture from his camera to the hard drive; and Bob has also authored a document that the user downloaded from Bob's web site.

The email was transferred from the email server to the user's computer running Windows 7 by Microsoft Outlook and stored in a .pst file somewhere in the file system hierarchy. The picture was imported into Google's Picasa photo manager and was "tagged" with "Bob" using their facial recognition feature. This tag was embedded within the .jpg file using the EXIF space reserved for such tags. The document was also stored somewhere on the file system and the user set an extended attribute of "Author=Bob" on the document file using a special document manager program.

Now the user wants to do a general search for everything on his drive that has to do with his friend Bob and hopes to come up with all three pieces of information. He wants a program that will comb through all his data and find those pieces.

1) The program will prompt the user for what to base the search on. The user just types "Bob" since there is no standard schema that helps identify "Bob" as a person's first name.

2) The program must now be able to do a complete folder hierarchy traversal, looking for any instances of the string "Bob". It might find a bunch of files that contain "Bob" in their file name, like "bobset.exe" or "stuff.bob". It would need to show those to the user since it doesn't know if they might be relevant.

3) For every file the program searches, it would need to peek at its extended attributes to see if any contained the word "Bob". Like file names that matched, it would need to display a file called "Photo1.jpg" that had an extended attribute "Activity=Bobbing for apples". For every .jpg file, it would need to know how to open and search the EXIF data portions, also looking for any tags that might have "Bob" in them.

4) It would need to be able to parse through any .pst files by following the Microsoft specification, looking for any emails that might come from, or be about Bob.

Each of these things represents a different "silo" of information that would need to be accessed and understood by the program doing the search. The .pst database file; the file extended attributes; and the .jpg - EXIF file format information are examples of these silos. There are many other silos like .db files, html or xml files, registry files, INI files, and .doc files. Accessing each of them requires knowledge about their format and the rules for parsing them.

If instead of using those systems to store data; Microsoft's Outlook, Google's Picasa, and the document manager were all built on top of the Didget Management System then things could be much simpler. The email could be stored in a "Message Didget", the picture could be stored in a "Photo Didget", and the document would be stored in a "Document Didget". Each of these Didgets would have a tag ".person.FirstName = Bob" attached to it.

Now any application could look for stuff about Bob without missing anything or getting all kinds of unintentional results. It would also find all three items in less than 1 second instead of the painfully slow search in the previous example.

Sunday, January 6, 2013

A Non-Hierarchical Data Management System

Those of you who have been following this blog may be wondering why I have never called the Didget Management System an "Object Store". This has been intentional, since I believe that Didgets offer many features that other kinds of persistent objects simply do not, and I wanted to avoid confusion. Once someone hears the word "Object" they tend to get all kinds of notions in their head about what a Didget is and what it should do.

But the reality is that a Didget has more in common with an object than with a traditional file. One of the things that really sets our system apart from other kinds of object stores like Amazon S3, is that we are designing it to be a replacement for local file systems as well as a cloud storage system. If we are successful, in another ten years all the data stored on your laptop, desktop, mobile device, and in various cloud storage containers will be in the form of Didgets.

Although it will be very easy to overlay our system with a traditional hierarchical namespace to provide backward compatibility with legacy systems, our native storage design is anything but hierarchical. This is somewhat like having a file system volume with just a single folder (i.e. root) where every file in the volume is stored. Since files use simple names as their unique identifier, such a system is impractical (if not impossible) for a traditional file system with millions of files stored on it. With Didgets it is not only possible, but very practical to store a hundred million of them within a single container without a hierarchical naming model.

Every Didget within a container has its own unique 64-bit ID that is used to access it. For systems that interact with data without needing its identifier to be in human readable form, it is easier and faster to store IDs as numbers rather than names like "C:\Windows\System\Drivers\adpahci.sys". With our system, is also very easy to find groups of Didgets that match a given criteria or to narrow down a simple search to find the single Didget you are looking for.

But we have hierarchical namespaces for a reason, so it is worth reviewing how we got here.

With file systems, a file's identifier is its name. You create, open, move, copy, and delete a file by passing its name into a file API function. Since its name is its identifier, you can't have two different files with the exact same name. This means you have to come up with a unique name for every file - a task that gets increasingly harder as the number of files increases. Early file systems like FAT that were not case sensitive and restricted names to 8.3 format made this task even harder. Even with long file name support that allows very specific and descriptive names, devices like cameras tend to want to create your pictures with names like "Photo_001.jpg", "Photo_002.jpg", "Photo_003.jpg", etc..

To get around naming conflicts and to add a simple categorization facility, file system designers came up with a hierarchical directory (or folder) model. A file's name only needed to be unique within a given folder, and its full path name became its unique identifier. The file name and folder name could be easily human readable and provide clues for navigation that were intuitive for many users. The folder system also made it easy to copy, move, or delete whole folders or entire folder trees using simple commands.

But file names and folder hierarchies have a number of problems associated with them. Changing the name of any file or any folder in its path will change its unique identifier and thus invalidate any stored references to it. The human readable names cannot be translated from one language to another without causing the same problem. An unprotected file might have its contents overwritten by a completely unrelated file that just happens to have the same name. If I want to store photos I have downloaded, I might have both a "/home/photos/download" and a "/home/download/photos" folder and have files in both - causing confusion.

Didgets operate in a completely different manner. Each Didget can have a name (or multiple names) attached to it as a name tag. When a file is converted to a Didget, each folder name may be attached to it as a separate folder tag. Unlike file paths, the ordering of tags don't make any difference. So if we were to overlay a hierarchical namespace to the Didget system, a command like "ls /home/andy/documents/projects/projectX/*" would give the same results as "ls /documents/andy/projectX/project/home/*".

You could leave out folder tags in a search and just get more results. For example: "ls /andy/*.jpg" would return all the JPG photos that were stored in any path that had "andy" as one of its folders. New folder tags could be added or existing tags deleted at any time without having "moved" the Didget or changing its unique identifier in any way. Existing tags can also be modified or translated to another language with the same lack of consequences.

Such a system provides a much more flexible mechanism for categorizing and finding data. As previous posts have shown, we can find all matching Didgets much faster than conventional file systems can find matching files.

Saturday, December 15, 2012

Cloud Based Solutions

When I bought my first computer back in 1986, I splurged for the 10 MB hard drive option. It cost nearly $800 and was incredibly slow by today's standards, but compared to the rest of my data storage (a handful of 5 1/4 inch floppy disks) it was a huge leap forward. That hard drive and my floppies together totaled less than 20 MB and comprised my entire data storage capacity.

As time went on, I replaced each of my storage devices with larger capacity and faster units. Sometime when I bought a new device, it became a completely separate storage system instead of just replacing an existing one. Today, I have over 20 different storage devices (hard drives, flash drives and cards, NAS boxes, SSDs, and cloud buckets) each with a set of files stored on it. Total capacity is somewhere around 12 TB and I have a lot of data stored on them.

Having lots of separate storage devices is both good and bad. I have storage directly attached to many of the devices I am working on so I can access information even when the Internet is not accessible - good. I try to spread my data around and keep redundant copies or backups of important data in case any individual storage device fails or is lost or stolen - better. If I have the right procedures in place, I ultimately control all the data that I have stored - best of all.

But it can be difficult to figure out which of my many devices has a piece of data that I am looking for - bad. I have to remember to backup or replicate data that might be unique to any given device - worse. I might be on a trip and remember that the data I need is on a flash drive in a drawer at home - also worse. I might have multiple copies of a given piece of data and if I update one copy, I need to remember to update all the copies, otherwise I have multiple working sets of data that are not synchronized - worst of all.

Recent offering by Cloud Storage providers such as Dropbox, Google Drive, SugarSync, or Amazon S3 have attempted to solve some of these problems and a few others. Unfortunately, they also introduce a number of problems or challenges as well.

Keeping your data in the "Cloud" can be beneficial in many instances. Redundant copies or backups are handled automatically by the storage provider. The data can be accessed by nearly any device with an Internet connection. Storage capacity can be very flexible and grow to meet your storage needs without having to purchase new units and migrate your data. It is easy to share your data with others. All these features offer compelling reasons to put data in the cloud.

But cloud storage is currently much more expensive than just buying a new hard drive. If you have many terabytes of data, it can be incredibly expensive to store all that data in the cloud. Data transfer speeds can also be very slow when compared to local storage. Sometimes users experience extremely slow speeds when performing a backup or restore operation. Slow performance and costs make it critical to be able to eliminate large quantities of unimportant data from cloud backup or synchronization functions. Finding stuff stored in the cloud can also be a slow and difficult process. If you have a few million pieces of data stored in one of those cloud buckets, it might take quite awhile to find it if you have forgotten its unique key name. Likewise, finding all pieces of data that meet some kind of specific criteria can also take a very long time.

The most troubling part of cloud storage seems to be a lack of control over your own data. If your only copy of a valuable piece of data is out in the cloud, you are completely dependent upon the cloud provider to make sure you have unimpeded access; that the data is free from corruption; and that it is secure from unauthorized access. Recently, even Steve Wozniak expressed great concern about the recent trend for individuals and businesses to store large amounts of their important data on a system controlled by someone else.

Personally, I think all the current cloud offerings represent a half-way solution. Universal access, flexible storage capacity, and automatic redundancy are great features. But I think the real, full solution is to have just a copy of important data (and only important data) stored in the cloud that is easily synchronized with other copies of that same data on local systems where the user has complete control.

This is one of the compelling features of the Didget Management System.

Thursday, December 6, 2012

Extreme Performance Demonstration

I created a third demonstration video of the Didget Management System in Action. This one shows how fast we can find things even when the number of Didgets gets very high.

See it at www.screenr.com/5Zx7

In this video I create nearly 10 million Didgets in a Chamber and automatically attach a set of tags to each one. Each tag has a value associated with it. I then performed queries against that Chamber for all Didgets of a certain type. I then performed an additional query for the Didgets that have a certain tag attached to it regardless of its value. Finally, I performed a couple of queries where we are looking for Didgets with that tag attached but also have a value that starts with an input string.

Again, I was running this demonstration on the same low-end PC as in the previous two videos. If I were to attempt to find all the video files on my NTFS file system and if there were 10 million files on it, that query would take nearly an hour using a regular program calling the file API. With the Didget Management System, the slowest query took about 3 seconds.

Monday, December 3, 2012

Demo Video Part 2

I added another short video of a demonstration of tags used in the Didget Management System.

View at: www.screenr.com/fXd7

This video emphasizes the creation of tags and attaching them to a set of Didgets so that we can query based on them or create lists (e.g. Albums) from the query results.

Each Didget can have up to 255 different tags attached to it. There can be tens of thousand of different tags to choose from and each tag value can be a string, a number, a boolean, or other value type. We have a set of predefined tags such as .event.Holiday, .date.Year, and .activity.Sport but the user is free to define any additional tags and immediately begin attaching them to any Didget.

Attaching tags to Didgets and performing queries based on them, works exactly the same way for photos, documents, music, videos, or any other type of Didget.

Sunday, December 2, 2012

Video Demonstration of our Browser

After much trial and error, I was finally able to capture a video of our Didget Browser in action. The video was limited to only 5 minutes, so I had to move fast and could only show a few features, but it gives a good demonstration of the speed at which we can query any given Chamber populated with lots of Didgets.

You can watch the video at: www.screenr.com/XV17

The Didget Browser was running on a Windows 7 PC and was created using the open-source, cross-platform GUI library called Qt. It can easily be ported to the Linux and Mac OSX operating systems. It sits on top of our Didget Management System using its API to perform much of its work.

The PC I used was a 3 year old Gateway machine I bought at Costco for $500. It has an Intel Core 2 processor, 4 GB of DDR2 RAM, and a 750 GB HDD. This was not a high-end box even when I bought it, let alone now. If you are impressed with the speed at which we are able to perform queries and to display large lists of tag values, please keep in mind it is NOT due to speedy hardware.

Whenever we perform a query, we look at the metadata records for each Didget within the Chamber. This would be analogous to checking each iNode in an Ext3 file system when querying files. The same is true whenever we refresh the contents of the Status Tab. We look at each and every Didget metadata record and tally up a total of all the different categories displayed.

It is important to know that we do not have a separate database that we are querying like indexing services such as Apple's Spotlight or Microsoft's Windows Search do. Such databases can take hours to create and can easily become out of sync with the file metadata that they index.

Some of the query operations that we perform could be accomplished on a regular file system using command line utilities. For example, I can get a list of all .JPG files on my file system by entering the command:

C:>Dir *.jpg /s

The main difference is that on that same machine with the 500,000 files, this command takes nearly 3 minutes to complete. If my NTFS volume had 3 million files on it, the same command would take approximately 20 minutes to complete. Using the Didget Browser, we are able to accomplish the same task in under ONE second. In fact, we can get a list of all the JPG Photo Didgets in under one second even if there are 25 million of them.

The differences in speed between our system and conventional file systems is even more pronounced when we must do even more complicated queries. Try to find all .JPG photos in a file system that have two extended attributes attached with the key:values of Place=Hawaii and Event=Vacation. We can find all the Didgets with those two tags attached in just a couple of seconds. File systems (the ones that even support extended attributes) will require a very long time.