Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.

No comments:

Post a Comment