Saturday, January 19, 2013

Silos of Information

As I stated earlier, the Didget Management System was designed to offer an alternative to conventional data management systems that tend to manage just a subset of all data and to build walls around any extra metadata they may generate. With such systems, a given set of a few million pieces of data (files and/or database rows) will often be fragmented into several of these "silos of information".

To illustrate this using a real world example, consider the following:

A user has a 2TB hard drive nearly full of data. Since the average size of each piece of data on the drive is about 1 million bytes, this represents nearly 2 million different files. Out of all that data, there exists three important pieces of information that are from the user's friend Bob. Bob has sent the user an email; the user has taken a picture of Bob and transferred that picture from his camera to the hard drive; and Bob has also authored a document that the user downloaded from Bob's web site.

The email was transferred from the email server to the user's computer running Windows 7 by Microsoft Outlook and stored in a .pst file somewhere in the file system hierarchy. The picture was imported into Google's Picasa photo manager and was "tagged" with "Bob" using their facial recognition feature. This tag was embedded within the .jpg file using the EXIF space reserved for such tags. The document was also stored somewhere on the file system and the user set an extended attribute of "Author=Bob" on the document file using a special document manager program.

Now the user wants to do a general search for everything on his drive that has to do with his friend Bob and hopes to come up with all three pieces of information. He wants a program that will comb through all his data and find those pieces.

1) The program will prompt the user for what to base the search on. The user just types "Bob" since there is no standard schema that helps identify "Bob" as a person's first name.

2) The program must now be able to do a complete folder hierarchy traversal, looking for any instances of the string "Bob". It might find a bunch of files that contain "Bob" in their file name, like "bobset.exe" or "stuff.bob". It would need to show those to the user since it doesn't know if they might be relevant.

3) For every file the program searches, it would need to peek at its extended attributes to see if any contained the word "Bob". Like file names that matched, it would need to display a file called "Photo1.jpg" that had an extended attribute "Activity=Bobbing for apples". For every .jpg file, it would need to know how to open and search the EXIF data portions, also looking for any tags that might have "Bob" in them.

4) It would need to be able to parse through any .pst files by following the Microsoft specification, looking for any emails that might come from, or be about Bob.

Each of these things represents a different "silo" of information that would need to be accessed and understood by the program doing the search. The .pst database file; the file extended attributes; and the .jpg - EXIF file format information are examples of these silos. There are many other silos like .db files, html or xml files, registry files, INI files, and .doc files. Accessing each of them requires knowledge about their format and the rules for parsing them.

If instead of using those systems to store data; Microsoft's Outlook, Google's Picasa, and the document manager were all built on top of the Didget Management System then things could be much simpler. The email could be stored in a "Message Didget", the picture could be stored in a "Photo Didget", and the document would be stored in a "Document Didget". Each of these Didgets would have a tag ".person.FirstName = Bob" attached to it.

Now any application could look for stuff about Bob without missing anything or getting all kinds of unintentional results. It would also find all three items in less than 1 second instead of the painfully slow search in the previous example.

No comments:

Post a Comment