Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.


Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.