Sunday, May 20, 2012

Past Solutions

The Didget Manager was created to solve a number of data management problems. I will attempt to state the biggest problem as best I can and then illustrate a number of different solutions that have either been attempted in the past, or are currently in use.

Problem: How do you properly manage many millions of pieces of structured and unstructured data especially when they are spread across several storage devices?

It is very easy for an individual or business to buy sufficient storage capacity to hold tens of millions of pieces of data. A 3 TB hard disk drive can be purchased for about $150. Small RAID storage  devices that can hold 12 TB of data can be built for less than $1000. Flash based USB drives or Solid State Drives cost about $1 per GB.

In my home, I have counted over 20 different storage devices that have a capacity of at least 8 GB. Cameras, computers, PVRs, phones, iPads, and video cameras all come with built-in storage. In addition I have several external storage devices like thumb drives, backup drives, and a NAS.

They are all filling up with data. Photos, home video, documents, downloaded software, music, and other stuff seem to slowly fill any available storage. Portable devices like cameras and phones are often synchronized with my laptop or desktop computer. It can be difficult to tell sometimes if I have only one copy of a given photo or if I have dozens of copies spread around all my storage devices.

So what falls under the definition of "Manage" when it comes to data?

1) Backup. Naturally, we want to insure that we have proper backups of all important data. The backup can be located locally in case a storage device just fails or it can be pushed to a remote site to help insure a successful disaster recovery procedure.

2) Replication. Backup is a form of replication, but it also includes the placement of data on several different devices to enable convenient access. We always want to be able to access our important data no matter which device we have with us at the time.

3) Search. Even if we have a storage device with us, it doesn't help us if we can't find the document we are looking for among several million others. We want to be able to search for it based on its name, some attributes it may have, or by a set of keywords.

4) Protection. We want to be able to prevent important information from being altered or destroyed by accident or by a malicious program.

5) Data Sets. We want to be able to organize data by placing various pieces of it into different sets. A set can be an album full of pictures, a play list with dozens of songs, or a software package containing a hundred different programs, libraries, or configuration settings. It would be very helpful if every piece of data could be a member of more than just one data set.

6) Synchronization. If we have more than one copy of something, it would be nice to be able to have changes made to one of the copies be synchronized to all the copies. If a new element is added to a replicated set, it would be helpful to also add it to all the copies of that set.

7) Security. We want to make sure a piece of data can be only accessed by those with permission. We want to make sure that security is not compromised just because the data is moved or copied to another location.

8) Inventory. We want to get an accurate accounting of all our pieces of data. We want to know if any of our data sets are incomplete. We want to know how many documents, photos, or videos we have. We want to know if there are any security holes. We want to know what has changed and what devices have not yet been synchronized, backed up, or replicated.

9) Completeness. We want to make sure when a piece of data is copied from one place to another that it is a complete copy. The data stream as well as all metadata including things like extended attributes need to be copied to assure that the clone is an exact duplicate of the original.

What attempts have been made so far to accomplish some of these data management tasks and what are their limitations?

1) A well-organized file directory tree. This is where applications and users must adhere to a clearly defined plan for grouping all files into appropriate folders or directories. All the operating system files go in C:\Windows; all the user utility programs go in /bin or /usr/bin; or all the user documents go into the C:\users or /home areas. This approach can work fairly well when there are only a few thousand files to deal with. Unfortunately, it requires a lot of work to keep all the files in their proper directory. It also makes it difficult to decide where to put that photo you just downloaded - in the C:\Photos directory or in the C:\Downloads directory.

2) Lots of databases. Since most databases were not built to manage lots of unstructured data like photos, video, and documents (Blobs in database speak), databases are generally used to track and manage files instead. Windows Search, OS X Spotlight, Google's Picasa, iTunes, and iMovie are examples of programs that store file information within a database. When a file is created, its full path along with additional metadata are stored in the database. This allows the user to keep track of millions of files and do very fast queries based on things like keywords or tags. Incremental backups, replication and synchronization functions, and data sets can be tracked using these databases as well. Unfortunately, the databases are completely separate from the file system. It is possible for users or applications to create new files and delete or modify existing files without the database being updated as well. Even if there is a background monitoring tool that has a file system filter driver informing it of every change, it is possible to make changes while that tool is not running. Even if every file system change is accompanied by an accurate update to one or more databases, it can be difficult to manage lots of files when there are dozens of separate databases, each keeping track of just a subset of the whole file system. A separate application is probably managing each database independently and those applications seldom talk to each other.

3) Embedded Data. Some file formats like JPEG allow metadata to be embedded within a file's data stream without disrupting its normal processing. Things like the camera info, data and time the picture was taken, and the GPS coordinates of the location where the picture was taken can be stored within the Exif data portion of .JPG files. Unfortunately, this data is not always accessible or searchable by all applications and some applications can alter the metadata unintentionally. Few data formats allow this behavior so it has limited application.

4) Extended Attributes. This external metadata can be attached to any file within a file system that supports them. Unfortunately, they are generally not searchable and are not universally available. Only some file systems support them and not all supporting file systems have the same rules for implementation. When an application copies a file from one file system to another, the extended attributes can be lost, altered, or just stripped off because the application forgot to copy them.

No comments:

Post a Comment