Tuesday, May 2, 2017

Progress

I recently left my day job to work on the Didget system full time. My small team has made a number of changes to the project since the last blog post, but progress has been slow when so many other things get in the way. I have now been able to accelerate the development and testing greatly.

Here is a list of some new features that have been added over the past two years.

1) Added ability to do JOIN operations on our database tables.

2) Added direct connectors to external databases so we can create DB tables by querying those databases directly. In earlier versions, the user had to export the data into CSV files from those databases and then import those files into our system.

3) Added ability to transform data within database tables. We can now create new columns that are transformations of other columns. For example we can uppercase, lowercase, substitute, split, combine, convert, truncate, trim, etc.

4) Added a 'Folder' container type so that the data stream for every Didget is stored in a separate file. This helps with testing and lets us do more 'apples to apples' comparisons with file system operations.

5) Added more complex SQL query operations. We can now combine lots of AND and OR operations like "SELECT * FROM myTable WHERE Name LIKE '%son' AND Address ILIKE '%123%' OR ZipCode < 10000;" We made it very easy to create and save these queries.

6) Tuned a bunch of operations to be faster. SQL queries now execute even faster and often require less data to be read from disk. Many of our database queries are now about twice as fast as MySQL or PostgreSQL when performed on the same data set on the same machine. Again, we don't need any indexes to often outperform a fully indexed table in those other database systems.

Tuesday, September 22, 2015

2015 Demo Videos on YouTube

Links to latest demo videos.

10 minute video that shows file stuff as well as database operations:
https://www.youtube.com/watch?v=2uUvGMUyFhY

Shorter, 5 minute video that just shows the database operations, Latest and fastest code in that area is on display here:
https://youtu.be/0X02xpy8ygc

Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.


Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.

Sunday, July 6, 2014

Quick links to all the videos so far...

Database 1
Database 2

Didgets 1
Didgets 2
Didgets 3

Relational Database Tables Using Didgets

It has been nearly a year since I last posted, but I haven't been lazy. I have been busy in my spare time improving the performance of the database operations and adding lots of new features. The relational database table operations are significantly faster now (as are the tag look-ups for Didgets).

A couple of posts ago entitled "Another Piece to the Puzzle" gave the time it used to take to query a 1 million row table with 6 columns at 25 seconds. Now I am able to query a 10 column table (also with a million rows) in under a single second. The time it takes to import a .CSV file has also been greatly reduced.

I can now import a 5 million row, 10 column table in 1 minute and 6 seconds. Most queries against that table now take at most about 2 seconds.

See a video demonstration of queries against the 1 million row table at http://screenr.com/xKmN

Since every column in each table is stored within a pair of Tag Didgets, they each become a separate key/value store. All values are de-duped, so if the column "First Name" has 10,000 rows with the value of "Fred", the actual string value is only stored once with 10,000 references to it. The user can select the Tag Didget containing all the values and view them along with the reference count for each.

Each row in a database table is just a set of values from the separate key/value stores that all map to the same key (e.g. the row number).

Another video demonstration of ways to view all the values in each key/value store is at http://screenr.com/AKmN

We have several additional features that we are working on, but this will give you a taste of how fast our database inserts and queries are so you can compare this against existing database managers.

Remember, we are not running on top of MySQL, SQLite, PostgreSQL, or any other commercial or open source database manager. This is all running just on the Didget Management System. You get all this functionality without needing to install a separate RDBMS.

Saturday, July 13, 2013

Data Managers

Over the years, a number of systems have been created to help users manage their data. I call these systems "Data Managers". There are two types - primary data managers and secondary data managers.

Primary data managers are very general-purpose in nature and are widely adopted in the computing world. File systems, databases, and web servers fall into this category. More recent members of this category include distributed file systems like Hadoop and cloud offerings like Amazon S3. These newer systems are gaining greater acceptance as "Big Data" becomes more pervasive and as users demand more mobile access to all their data.

Secondary data managers are generally more specialized in the types of data they manage. They almost always utilize the services of a primary data manager to store their underlying data. Examples of these kinds of data managers include Apple's iTunes for managing music or Google's Picassa for managing photos. They typically keep most of their unstructured data as files in a file system and create a proprietary database for storing extra metadata. These data managers may also integrate with cloud services to give the user a virtual view of their data even when it may be spread across several systems. Unfortunately, these secondary data managers are nearly always in danger of interference from other programs and must rely upon the security measures offered by the primary data manager. If another application deletes, moves, or renames one or more of the files it manages, the secondary data manager can often have trouble reconciling those changes. If another program deletes one or more of its core metadata files (i.e. its database) then the secondary data manager can fail completely.

The Didget Management System is a primary data manager. It not only provides new functionality that previous data managers lack, but it has also been designed to supplant them. This is very different from other primary data managers like databases for example, which were designed to manage structured data in ways that file systems never could, but were never designed to handle unstructured data well enough to make file systems unnecessary. A consequence of that strategy is that as each of the other primary data managers entered the market, we ended up with yet another "silo" into which a portion of our data can be put.

That is why I designed the Didget Management System to manage both structured and unstructured data well. It is designed to manage that data in both simple configurations and in distributed cluster environments. When the amount of data grows from a few thousand pieces of information to billions of pieces utilizing petabytes of storage, there will not be a costly transition point where all the existing data must be migrated to an entirely new system. If we are successful, new data will not only be created as Didgets instead of as files or traditional database tables, but all the old data will be converted to Didgets as well. Our goal is to replace those other primary data managers completely.

In order to realize that goal, the Didget Management System has to do all the critical data management functions of the system it is replacing in addition to its new feature set. It cannot just be 5%, 10%, or even 50% better either. It has to be at least TWICE as good as the old system. When I designed it, that was my minimum threshold. If I couldn't make it dramatically better, it would not gain widespread adoption and would likely fall into a very narrow niche product and not be worth the effort.

Fortunately, the design has proven to work so well that I not only think we have met that 2x threshold, I think it has greatly exceeded it. I would not be surprised if once all the features are fully implemented, that we will have a system that is 10x better than those other systems. That does not mean that we will do everything 10x better than every feature found in those other systems (for example we will not be able to read a Didget ten times faster from disk than a file system can read a file), but rather that overall it will be that much better when all the factors of performance, feature set, ease of use, security, and flexibility are considered.