Saturday, December 15, 2012

Cloud Based Solutions

When I bought my first computer back in 1986, I splurged for the 10 MB hard drive option. It cost nearly $800 and was incredibly slow by today's standards, but compared to the rest of my data storage (a handful of 5 1/4 inch floppy disks) it was a huge leap forward. That hard drive and my floppies together totaled less than 20 MB and comprised my entire data storage capacity.

As time went on, I replaced each of my storage devices with larger capacity and faster units. Sometime when I bought a new device, it became a completely separate storage system instead of just replacing an existing one. Today, I have over 20 different storage devices (hard drives, flash drives and cards, NAS boxes, SSDs, and cloud buckets) each with a set of files stored on it. Total capacity is somewhere around 12 TB and I have a lot of data stored on them.

Having lots of separate storage devices is both good and bad. I have storage directly attached to many of the devices I am working on so I can access information even when the Internet is not accessible - good. I try to spread my data around and keep redundant copies or backups of important data in case any individual storage device fails or is lost or stolen - better. If I have the right procedures in place, I ultimately control all the data that I have stored - best of all.

But it can be difficult to figure out which of my many devices has a piece of data that I am looking for - bad. I have to remember to backup or replicate data that might be unique to any given device - worse. I might be on a trip and remember that the data I need is on a flash drive in a drawer at home - also worse. I might have multiple copies of a given piece of data and if I update one copy, I need to remember to update all the copies, otherwise I have multiple working sets of data that are not synchronized - worst of all.

Recent offering by Cloud Storage providers such as Dropbox, Google Drive, SugarSync, or Amazon S3 have attempted to solve some of these problems and a few others. Unfortunately, they also introduce a number of problems or challenges as well.

Keeping your data in the "Cloud" can be beneficial in many instances. Redundant copies or backups are handled automatically by the storage provider. The data can be accessed by nearly any device with an Internet connection. Storage capacity can be very flexible and grow to meet your storage needs without having to purchase new units and migrate your data. It is easy to share your data with others. All these features offer compelling reasons to put data in the cloud.

But cloud storage is currently much more expensive than just buying a new hard drive. If you have many terabytes of data, it can be incredibly expensive to store all that data in the cloud. Data transfer speeds can also be very slow when compared to local storage. Sometimes users experience extremely slow speeds when performing a backup or restore operation. Slow performance and costs make it critical to be able to eliminate large quantities of unimportant data from cloud backup or synchronization functions. Finding stuff stored in the cloud can also be a slow and difficult process. If you have a few million pieces of data stored in one of those cloud buckets, it might take quite awhile to find it if you have forgotten its unique key name. Likewise, finding all pieces of data that meet some kind of specific criteria can also take a very long time.

The most troubling part of cloud storage seems to be a lack of control over your own data. If your only copy of a valuable piece of data is out in the cloud, you are completely dependent upon the cloud provider to make sure you have unimpeded access; that the data is free from corruption; and that it is secure from unauthorized access. Recently, even Steve Wozniak expressed great concern about the recent trend for individuals and businesses to store large amounts of their important data on a system controlled by someone else.

Personally, I think all the current cloud offerings represent a half-way solution. Universal access, flexible storage capacity, and automatic redundancy are great features. But I think the real, full solution is to have just a copy of important data (and only important data) stored in the cloud that is easily synchronized with other copies of that same data on local systems where the user has complete control.

This is one of the compelling features of the Didget Management System.

Thursday, December 6, 2012

Extreme Performance Demonstration

I created a third demonstration video of the Didget Management System in Action. This one shows how fast we can find things even when the number of Didgets gets very high.

See it at www.screenr.com/5Zx7

In this video I create nearly 10 million Didgets in a Chamber and automatically attach a set of tags to each one. Each tag has a value associated with it. I then performed queries against that Chamber for all Didgets of a certain type. I then performed an additional query for the Didgets that have a certain tag attached to it regardless of its value. Finally, I performed a couple of queries where we are looking for Didgets with that tag attached but also have a value that starts with an input string.


Again, I was running this demonstration on the same low-end PC as in the previous two videos. If I were to attempt to find all the video files on my NTFS file system and if there were 10 million files on it, that query would take nearly an hour using a regular program calling the file API. With the Didget Management System, the slowest query took about 3 seconds.

Monday, December 3, 2012

Demo Video Part 2

I added another short video of a demonstration of tags used in the Didget Management System.

View at: www.screenr.com/fXd7

This video emphasizes the creation of tags and attaching them to a set of Didgets so that we can query based on them or create lists (e.g. Albums) from the query results.

Each Didget can have up to 255 different tags attached to it.  There can be tens of thousand of different tags to choose from and each tag value can be a string, a number, a boolean, or other value type. We have a set of predefined tags such as .event.Holiday, .date.Year, and .activity.Sport but the user is free to define any additional tags and immediately begin attaching them to any Didget.

Attaching tags to Didgets and performing queries based on them, works exactly the same way for photos, documents, music, videos, or any other type of Didget.

Sunday, December 2, 2012

Video Demonstration of our Browser

After much trial and error, I was finally able to capture a video of our Didget Browser in action. The video was limited to only 5 minutes, so I had to move fast and could only show a few features, but it gives a good demonstration of the speed at which we can query any given Chamber populated with lots of Didgets.

You can watch the video at: www.screenr.com/XV17

The Didget Browser was running on a Windows 7 PC and was created using the open-source, cross-platform GUI library called Qt. It can easily be ported to the Linux and Mac OSX operating systems. It sits on top of our Didget Management System using its API to perform much of its work.

The PC I used was a 3 year old Gateway machine I bought at Costco for $500. It has an Intel Core 2 processor, 4 GB of DDR2 RAM, and a 750 GB HDD. This was not a high-end box even when I bought it, let alone now. If you are impressed with the speed at which we are able to perform queries and to display large lists of tag values, please keep in mind it is NOT due to speedy hardware.

Whenever we perform a query, we look at the metadata records for each Didget within the Chamber. This would be analogous to checking each iNode in an Ext3 file system when querying files. The same is true whenever we refresh the contents of the Status Tab. We look at each and every Didget metadata record and tally up a total of all the different categories displayed.

It is important to know that we do not have a separate database that we are querying like indexing services such as Apple's Spotlight or Microsoft's Windows Search do. Such databases can take hours to create and can easily become out of sync with the file metadata that they index.

Some of the query operations that we perform could be accomplished on a regular file system using command line utilities. For example, I can get a list of all .JPG files on my file system by entering the command:

 C:>Dir *.jpg /s

The main difference is that on that same machine with the 500,000 files, this command takes nearly 3 minutes to complete. If my NTFS volume had 3 million files on it, the same command would take approximately 20 minutes to complete. Using the Didget Browser, we are able to accomplish the same task in under ONE second. In fact, we can get a list of all the JPG Photo Didgets in under one second even if there are 25 million of them.

The differences in speed between our system and conventional file systems is even more pronounced when we must do even more complicated queries. Try to find all .JPG photos in a file system that have two extended attributes attached with the key:values of Place=Hawaii and Event=Vacation. We can find all the Didgets with those two tags attached in just a couple of seconds. File systems (the ones that even support extended attributes) will require a very long time.