Tuesday, September 22, 2015

2015 Demo Videos on YouTube

Links to latest demo videos.

10 minute video that shows file stuff as well as database operations:
https://www.youtube.com/watch?v=2uUvGMUyFhY

Shorter, 5 minute video that just shows the database operations, Latest and fastest code in that area is on display here:
https://youtu.be/0X02xpy8ygc

Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.


Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.

Sunday, July 6, 2014

Quick links to all the videos so far...

Database 1
Database 2

Didgets 1
Didgets 2
Didgets 3

Relational Database Tables Using Didgets

It has been nearly a year since I last posted, but I haven't been lazy. I have been busy in my spare time improving the performance of the database operations and adding lots of new features. The relational database table operations are significantly faster now (as are the tag look-ups for Didgets).

A couple of posts ago entitled "Another Piece to the Puzzle" gave the time it used to take to query a 1 million row table with 6 columns at 25 seconds. Now I am able to query a 10 column table (also with a million rows) in under a single second. The time it takes to import a .CSV file has also been greatly reduced.

I can now import a 5 million row, 10 column table in 1 minute and 6 seconds. Most queries against that table now take at most about 2 seconds.

See a video demonstration of queries against the 1 million row table at http://screenr.com/xKmN

Since every column in each table is stored within a pair of Tag Didgets, they each become a separate key/value store. All values are de-duped, so if the column "First Name" has 10,000 rows with the value of "Fred", the actual string value is only stored once with 10,000 references to it. The user can select the Tag Didget containing all the values and view them along with the reference count for each.

Each row in a database table is just a set of values from the separate key/value stores that all map to the same key (e.g. the row number).

Another video demonstration of ways to view all the values in each key/value store is at http://screenr.com/AKmN

We have several additional features that we are working on, but this will give you a taste of how fast our database inserts and queries are so you can compare this against existing database managers.

Remember, we are not running on top of MySQL, SQLite, PostgreSQL, or any other commercial or open source database manager. This is all running just on the Didget Management System. You get all this functionality without needing to install a separate RDBMS.

Saturday, July 13, 2013

Data Managers

Over the years, a number of systems have been created to help users manage their data. I call these systems "Data Managers". There are two types - primary data managers and secondary data managers.

Primary data managers are very general-purpose in nature and are widely adopted in the computing world. File systems, databases, and web servers fall into this category. More recent members of this category include distributed file systems like Hadoop and cloud offerings like Amazon S3. These newer systems are gaining greater acceptance as "Big Data" becomes more pervasive and as users demand more mobile access to all their data.

Secondary data managers are generally more specialized in the types of data they manage. They almost always utilize the services of a primary data manager to store their underlying data. Examples of these kinds of data managers include Apple's iTunes for managing music or Google's Picassa for managing photos. They typically keep most of their unstructured data as files in a file system and create a proprietary database for storing extra metadata. These data managers may also integrate with cloud services to give the user a virtual view of their data even when it may be spread across several systems. Unfortunately, these secondary data managers are nearly always in danger of interference from other programs and must rely upon the security measures offered by the primary data manager. If another application deletes, moves, or renames one or more of the files it manages, the secondary data manager can often have trouble reconciling those changes. If another program deletes one or more of its core metadata files (i.e. its database) then the secondary data manager can fail completely.

The Didget Management System is a primary data manager. It not only provides new functionality that previous data managers lack, but it has also been designed to supplant them. This is very different from other primary data managers like databases for example, which were designed to manage structured data in ways that file systems never could, but were never designed to handle unstructured data well enough to make file systems unnecessary. A consequence of that strategy is that as each of the other primary data managers entered the market, we ended up with yet another "silo" into which a portion of our data can be put.

That is why I designed the Didget Management System to manage both structured and unstructured data well. It is designed to manage that data in both simple configurations and in distributed cluster environments. When the amount of data grows from a few thousand pieces of information to billions of pieces utilizing petabytes of storage, there will not be a costly transition point where all the existing data must be migrated to an entirely new system. If we are successful, new data will not only be created as Didgets instead of as files or traditional database tables, but all the old data will be converted to Didgets as well. Our goal is to replace those other primary data managers completely.

In order to realize that goal, the Didget Management System has to do all the critical data management functions of the system it is replacing in addition to its new feature set. It cannot just be 5%, 10%, or even 50% better either. It has to be at least TWICE as good as the old system. When I designed it, that was my minimum threshold. If I couldn't make it dramatically better, it would not gain widespread adoption and would likely fall into a very narrow niche product and not be worth the effort.

Fortunately, the design has proven to work so well that I not only think we have met that 2x threshold, I think it has greatly exceeded it. I would not be surprised if once all the features are fully implemented, that we will have a system that is 10x better than those other systems. That does not mean that we will do everything 10x better than every feature found in those other systems (for example we will not be able to read a Didget ten times faster from disk than a file system can read a file), but rather that overall it will be that much better when all the factors of performance, feature set, ease of use, security, and flexibility are considered.

Tuesday, April 30, 2013

Another Piece to the Puzzle

Didgets provide new and innovative ways to store, search and organize unstructured data that would normally be stored in files. They have also proven useful for storing structured data that is well suited for entry into a NoSQL database. A missing piece was to use them to store structured data that has been traditionally stored within tables in a regular Relational DataBase Management System (RDBMS) and accessed via a Structured Query Language (SQL).

Since our tags had been effective in implementing NoSQL columns in a sparse table, we decided to use them to try and implement a regular relational table. While I had little hope that it could match the performance of a finely tuned RDBMS like MySQL, I at least wanted to implement something that would be acceptable and maybe provide a few unique features or an easier way of managing the data.

To my surprise, it has not only matched the performance of MySQL in preliminary tests, it was 17% faster on many queries. I created a table with six columns and inserted 1 million rows of random data to test the performance of each system. Using the MyISAM storage engine under MySQL, a "Select *" query took 30 seconds to execute on my old test machine. The same query using the Didget Management System only took 25 seconds to complete.

If I switched out the storage engine under MySQL to use the InnoDB engine, the same query took 1 minute and 20 seconds. I was surprised that the InnoDB engine with transactional support was so much slower than the MyISAM engine for this simple query. I have yet to implement the transaction feature using Didgets so I could not do a comparison test but I am confident that our transaction overhead will not be as dramatic as it was under MySQL.

I am also confident that the Didget Management System will provide a very easy mechanism to create, query, and share database tables. It will also be much easier to administer since we can provide lightning fast queries without having to index columns or do complicated joins across multiple tables.

In essence, the Didget Management System is a radically different architecture to the traditional RDBMS way of storing structured data in multiple tables. Since development of the database features are still in its infancy, there is much work yet to be done but I am confident that this will become a major feature of our new general-purpose data management system.

Stay tuned for further developments....