Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.


Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.

Sunday, July 6, 2014

Quick links to all the videos so far...

Database 1
Database 2

Didgets 1
Didgets 2
Didgets 3

Relational Database Tables Using Didgets

It has been nearly a year since I last posted, but I haven't been lazy. I have been busy in my spare time improving the performance of the database operations and adding lots of new features. The relational database table operations are significantly faster now (as are the tag look-ups for Didgets).

A couple of posts ago entitled "Another Piece to the Puzzle" gave the time it used to take to query a 1 million row table with 6 columns at 25 seconds. Now I am able to query a 10 column table (also with a million rows) in under a single second. The time it takes to import a .CSV file has also been greatly reduced.

I can now import a 5 million row, 10 column table in 1 minute and 6 seconds. Most queries against that table now take at most about 2 seconds.

See a video demonstration of queries against the 1 million row table at http://screenr.com/xKmN

Since every column in each table is stored within a pair of Tag Didgets, they each become a separate key/value store. All values are de-duped, so if the column "First Name" has 10,000 rows with the value of "Fred", the actual string value is only stored once with 10,000 references to it. The user can select the Tag Didget containing all the values and view them along with the reference count for each.

Each row in a database table is just a set of values from the separate key/value stores that all map to the same key (e.g. the row number).

Another video demonstration of ways to view all the values in each key/value store is at http://screenr.com/AKmN

We have several additional features that we are working on, but this will give you a taste of how fast our database inserts and queries are so you can compare this against existing database managers.

Remember, we are not running on top of MySQL, SQLite, PostgreSQL, or any other commercial or open source database manager. This is all running just on the Didget Management System. You get all this functionality without needing to install a separate RDBMS.

Saturday, July 13, 2013

Data Managers

Over the years, a number of systems have been created to help users manage their data. I call these systems "Data Managers". There are two types - primary data managers and secondary data managers.

Primary data managers are very general-purpose in nature and are widely adopted in the computing world. File systems, databases, and web servers fall into this category. More recent members of this category include distributed file systems like Hadoop and cloud offerings like Amazon S3. These newer systems are gaining greater acceptance as "Big Data" becomes more pervasive and as users demand more mobile access to all their data.

Secondary data managers are generally more specialized in the types of data they manage. They almost always utilize the services of a primary data manager to store their underlying data. Examples of these kinds of data managers include Apple's iTunes for managing music or Google's Picassa for managing photos. They typically keep most of their unstructured data as files in a file system and create a proprietary database for storing extra metadata. These data managers may also integrate with cloud services to give the user a virtual view of their data even when it may be spread across several systems. Unfortunately, these secondary data managers are nearly always in danger of interference from other programs and must rely upon the security measures offered by the primary data manager. If another application deletes, moves, or renames one or more of the files it manages, the secondary data manager can often have trouble reconciling those changes. If another program deletes one or more of its core metadata files (i.e. its database) then the secondary data manager can fail completely.

The Didget Management System is a primary data manager. It not only provides new functionality that previous data managers lack, but it has also been designed to supplant them. This is very different from other primary data managers like databases for example, which were designed to manage structured data in ways that file systems never could, but were never designed to handle unstructured data well enough to make file systems unnecessary. A consequence of that strategy is that as each of the other primary data managers entered the market, we ended up with yet another "silo" into which a portion of our data can be put.

That is why I designed the Didget Management System to manage both structured and unstructured data well. It is designed to manage that data in both simple configurations and in distributed cluster environments. When the amount of data grows from a few thousand pieces of information to billions of pieces utilizing petabytes of storage, there will not be a costly transition point where all the existing data must be migrated to an entirely new system. If we are successful, new data will not only be created as Didgets instead of as files or traditional database tables, but all the old data will be converted to Didgets as well. Our goal is to replace those other primary data managers completely.

In order to realize that goal, the Didget Management System has to do all the critical data management functions of the system it is replacing in addition to its new feature set. It cannot just be 5%, 10%, or even 50% better either. It has to be at least TWICE as good as the old system. When I designed it, that was my minimum threshold. If I couldn't make it dramatically better, it would not gain widespread adoption and would likely fall into a very narrow niche product and not be worth the effort.

Fortunately, the design has proven to work so well that I not only think we have met that 2x threshold, I think it has greatly exceeded it. I would not be surprised if once all the features are fully implemented, that we will have a system that is 10x better than those other systems. That does not mean that we will do everything 10x better than every feature found in those other systems (for example we will not be able to read a Didget ten times faster from disk than a file system can read a file), but rather that overall it will be that much better when all the factors of performance, feature set, ease of use, security, and flexibility are considered.

Tuesday, April 30, 2013

Another Piece to the Puzzle

Didgets provide new and innovative ways to store, search and organize unstructured data that would normally be stored in files. They have also proven useful for storing structured data that is well suited for entry into a NoSQL database. A missing piece was to use them to store structured data that has been traditionally stored within tables in a regular Relational DataBase Management System (RDBMS) and accessed via a Structured Query Language (SQL).

Since our tags had been effective in implementing NoSQL columns in a sparse table, we decided to use them to try and implement a regular relational table. While I had little hope that it could match the performance of a finely tuned RDBMS like MySQL, I at least wanted to implement something that would be acceptable and maybe provide a few unique features or an easier way of managing the data.

To my surprise, it has not only matched the performance of MySQL in preliminary tests, it was 17% faster on many queries. I created a table with six columns and inserted 1 million rows of random data to test the performance of each system. Using the MyISAM storage engine under MySQL, a "Select *" query took 30 seconds to execute on my old test machine. The same query using the Didget Management System only took 25 seconds to complete.

If I switched out the storage engine under MySQL to use the InnoDB engine, the same query took 1 minute and 20 seconds. I was surprised that the InnoDB engine with transactional support was so much slower than the MyISAM engine for this simple query. I have yet to implement the transaction feature using Didgets so I could not do a comparison test but I am confident that our transaction overhead will not be as dramatic as it was under MySQL.

I am also confident that the Didget Management System will provide a very easy mechanism to create, query, and share database tables. It will also be much easier to administer since we can provide lightning fast queries without having to index columns or do complicated joins across multiple tables.

In essence, the Didget Management System is a radically different architecture to the traditional RDBMS way of storing structured data in multiple tables. Since development of the database features are still in its infancy, there is much work yet to be done but I am confident that this will become a major feature of our new general-purpose data management system.

Stay tuned for further developments....

Sunday, February 24, 2013

Connecting the Dots

If you look up at the sky on a moonless night, far away from any city lights, you will see many thousands of individual stars. An asterism is a group of those stars that can be connected together in our minds to form a stick figure. Constellations are ancient asterisms that gained popular names like Virgo or Ursa Major. Other asterisms that just make up a portion of a constellation have also been given popular names like "The Big Dipper" or "Orion's Belt". People who star gaze and either find some of these popular asterisms or form their own, are looking for "patterns" among the thousands of stars.

Searching for patterns is also common when we deal with all that data that exists as individual files or database records on our hard drives, flash memory cards, or DVDs. Sometimes these patterns are already established for us. A popular software package may consist of a dozen separate executable files along with their configuration files and documentation. They are often copied into one or two folders or directories during an installation process to keep them together. Sometimes installation programs copy them into common folders like /usr/bin so that they get all mixed in with other programs and they are not so easy to sort out and figure which files belong to which programs.

But even files that seem to be completely independent of other kinds of data (e.g. a photo or a song) can often be grouped together with other files to form ad hoc groups (e.g. a photo or a music album). We are constantly trying to make connections between different data points to form new and interesting patterns. Facebook and other social media sites provide mechanisms to form some of these patterns. A user posts messages, pictures, documents, videos, and other personal information in order to tell a story about their life, their interests, and their friends. It is the connections between lots of individual pieces of data that can lead to new interactions and help us make decisions.

The current trend in "Big Data" and various forms of analytics is all about finding patterns in large amounts of data to drive business decisions. Analyze a million customer orders to look for patterns of shopping behaviour when it is cold outside in order to figure out what items to put on sale when the next big storm hits. Analyze emails sent by everyone over 65 years old in Florida to figure out what political messages will most likely sway the most voters.

The trick to establishing meaningful patterns among millions or billions of individual data points lies in the ability to quickly analyze each point and determine if it has a significant connection to another point. The system that is used to store the information is a critical component to being able to quickly check lots of data points for a certain condition in order to sift the wheat from the chaff. The system must not only be able to match things like strings or numbers, but it must provide some kind of context in order to make more meaningful connections.

For example, if someone wanted to analyze a group of messages to gain intelligence about military hardware, the word "Tank" would be a meaningful keyword to search for. However, such a "brute force" search might turn up every message that deals with water tanks, gas tanks, and R&B music. It is much more meaningful if the search was conducted using "Vehicle=Tank" instead.

The Didget Management System was designed to not only manage large numbers of data points, but to also aid in making connections between points in order to find new patterns. By attaching many searchable tags to any given piece of data and by providing context for every single tag, the system makes it easy to find all the data that share a common attribute. It can also rank various connections between any two points based on the number of attributes they share in order to give hints about more relevant connections.

Big Data Analytics is all about finding hidden patterns and unknown correlations in large amounts of data. This means that specialized queries must be conducted against all that data to try and find meaningful patterns. When the data is created and stored, the nature of such queries is largely unknown. In other words, the data must be stored in such a way as to make as wide as possible, a variety of potential queries.

The speed at which a query can execute is a major factor in finding that "needle in a haystack". If a big data set consists of 10 billion data points and every query takes several hours to complete, then it becomes very hard to conduct lots of different types of queries, looking for a pattern. If, on the other hand, such a query can execute in a minute or less, then it becomes practical to conduct a wide variety of queries hoping that a meaningful pattern just "pops out in front of you".

Several other "big data" projects like Hadoop, MapReduce, HBase, Cassandra, and MongoDB have been structured to be spread across a cluster of nodes so that the processing of data can occur in parallel. This can greatly reduce the time necessary to perform a query. Such systems can be very complex to set up and administer, however. Our system has been designed to greatly simplify such configurations.

But finding patterns should not just be exclusive to large companies with big data sets. Individual users could greatly benefit from finding meaningful patterns among a few million pieces of information. If I got a message from Mary about her vacation in Hawaii, it would be helpful if there was an "about" button next to her name that when pushed would bring up a list of every message, photo, and document that she had sent me or was about her. Likewise it would be helpful if the message itself had hyper-links in it that when clicked would bring up my own photos of Hawaii or information about scuba diving or whale watching. These links could be generated automatically by the system based on tags already present on other Didgets.