Monday, September 18, 2017

Latest Videos

I have made a ton of changes since I last recorded some videos showing what our Didget Browser can do, so I decided to make some new ones. I will add new links and descriptions as I record them. Here is what I have so far:


Introduction: (4 minutes) http://youtu.be/NgPTYsb4LRQ?hd=1

This video shows you how to create new containers that hold our Didget objects. It shows how to wipe out a container and start again from scratch. It also shows you how to configure the browser to only show certain features and how to pre-populate a new container with various Didgets.


Creating Database Tables: (5 minutes) http://youtu.be/rM1KEVe7TVc?hd=1

This video shows you how to create relational tables using Didgets. It shows how to create a table definition from scratch. It also shows how to create them using Json or CSV files. Once a definition is created, it can be used to create relational tables. Tables can also be constructed by extracting data from a local or remote database using a connector.


Querying our Database Tables: (7 minutes) http://youtu.be/T_Y2R4DA9UI?hd=1

This video shows how to query the tables; how to JOIN tables; and how to save any results out to a completely separate, persistent table that can be later queried just like any other table.

Monday, August 28, 2017

How Good is Good Enough?

I can be a perfectionist in some areas. I am passionate about speed when it comes to computer algorithms. Even if I have spent a few days getting some critical function to be 10x faster than it was before, I will often still stay up late if I know I can squeeze a few more percentage points out of it.

But I am also keen to the 'Lean Startup' idea of an MVP (Minimal Viable Product) where you build something that is just good enough without waiting until it is perfect before introducing it to the market. So I struggle with how good a particular feature has to be before I say it is good enough and move on to the next task.

The Didget Management System has a lot of very innovative features that set it apart from other kinds of general-purpose data managers. But speed is its greatest 'Wow!' factor. It can do things thousands of times faster than conventional file systems. It can do many database operations 2x, 3x, or 5x faster than the major relational database management systems. It takes full advantage of multi-core CPUs to make even single operations much faster by breaking them up and running pieces in parallel.

Yet, I have found my biggest challenge has been to get people to commit resources (time, money, effort) toward something that has considerable promise if there is any risk involved. I will show them something that is taking them 10 minutes to do using their PostgreSQL Database and can be done on my system in only 2 minutes and yet have trouble getting them to commit. This is not something trivial that is outside the core function of their business...it is something critical, and still they hesitate. I think this is mainly because it requires change - a step into the unknown.

Everyone knows that risk is the biggest enemy of innovation. All but the most trivial innovations required someone to take a chance and put something on the line to 'make it happen'. Business managers almost always remember some initiative that failed because they tried something out of the mainstream. But they rarely, if ever, know how a passed-up opportunity would have played out to their advantage.

I am keenly aware that in order to convince companies to switch to my system, it has to be a great improvement over their existing solution. When I started this project, I set the bar at twice as good. If I didn't think I could build something that was at least twice as fast, twice as convenient, or had twice the power; I would never have gone far into its development. It has far exceeded my expectations to the point that I think it will be at least 10x better than anything else.

So I plod ahead with the hope that eventually this platform will attract the attention of those innovators who will take a good look at the risk/reward ratio and decide the reward is just too great to pass up. We have some companies that are taking a look at it, but most have yet to do more than dip their toe in the water.

As I look over my list of tasks that are yet to be completed, there are many that are 'refinement' tasks or things that will make features that already work, significantly better. There are others that are features that do not yet work at all and need to be implemented. I have taken the approach to do a mixture of things from both groups. Every time I get one or two new things working, I will go back and enhance one thing that worked before. At the end of the month, I can then say that the product does things it never did before but also does a handful of things better than ever.

Tuesday, May 2, 2017

Design Principles

As we have designed and implemented the Didget Management System, we have done a good job so far at adhering to these basic principles:

1) No dependencies. We purposely stayed away from using any thing that may cause dependency issues down the road. We don't use .NET. We don't use third party libraries other than the standard C++ libraries. Our browser application uses Qt but the manager itself does not.

2) Take full advantage of CPU cores/threads. We want our code to run much faster on a CPU with more cores. Large individual operations are often broken up into multiple pieces and run in parallel using separate threads. We are not just running multiple queries simultaneously like many database servers (we do that too), but we can run a single query faster when more cores are available.

3) Use thread-safe code. Because we do so many things at the same time, we want to allow multiple operations on the same set of data to safely run in parallel. Data integrity is very important whether running many separate queries simultaneously or when many threads are running different parts of the same query.

4) Be operating system independent. We want this code to run equally well on Linux, Windows, or OSX. All operating system calls are confined to a single 'Kernel' module which is easily ported to other operating systems.

5) Be faster than anything else. Use things like maps, hash tables, and very fast algorithms written in efficient C++ code to do everything. No interpreted code.

6) Re-use code whenever possible. Often the same function can be used to manipulate a dozen or so different kinds of Didgets. When we fine tune something, it often improves performance in multiple areas.

Progress

I recently left my day job to work on the Didget system full time. My small team has made a number of changes to the project since the last blog post, but progress has been slow when so many other things get in the way. I have now been able to accelerate the development and testing greatly.

Here is a list of some new features that have been added over the past two years.

1) Added ability to do JOIN operations on our database tables.

2) Added direct connectors to external databases so we can create DB tables by querying those databases directly. In earlier versions, the user had to export the data into CSV files from those databases and then import those files into our system.

3) Added ability to transform data within database tables. We can now create new columns that are transformations of other columns. For example we can uppercase, lowercase, substitute, split, combine, convert, truncate, trim, etc.

4) Added a 'Folder' container type so that the data stream for every Didget is stored in a separate file. This helps with testing and lets us do more 'apples to apples' comparisons with file system operations.

5) Added more complex SQL query operations. We can now combine lots of AND and OR operations like "SELECT * FROM myTable WHERE Name LIKE '%son' AND Address ILIKE '%123%' OR ZipCode < 10000;" We made it very easy to create and save these queries.

6) Tuned a bunch of operations to be faster. SQL queries now execute even faster and often require less data to be read from disk. Many of our database queries are now about twice as fast as MySQL or PostgreSQL when performed on the same data set on the same machine. Again, we don't need any indexes to often outperform a fully indexed table in those other database systems.

Tuesday, September 22, 2015

2015 Demo Videos on YouTube

Links to latest demo videos.

10 minute video that shows file stuff as well as database operations:
https://www.youtube.com/watch?v=2uUvGMUyFhY

Shorter, 5 minute video that just shows the database operations, Latest and fastest code in that area is on display here:
https://youtu.be/0X02xpy8ygc

Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.


Friday, April 3, 2015

Update

I have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.

1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.

2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.

3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.

To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.

These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.

I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.

I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.

I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.

My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.