Friday, December 21, 2018

Updates

2018 has been a busy year. It has been some time since I last posted, so I didn't want the year to end without an update.

We formed a new company (Didgets.io) and started a simple web page for it. I have added a ton of features and enhanced many of the previously implemented ones. We successfully found our first two paying customers so we had some modest income this year. We are currently working on adding team members and looking for some working capital to speed up development. We have a number of potential customers that we are currently working with to get them on board.

As every startup founder does, I have to wear multiple hats. The 'Documentation Hat' has been one I have obviously neglected as other tasks have consumed my time.

To catch up, here is a brief list of a few of the most important changes over the past year:

1) Added a lot of Json support to try and capture some of the NoSQL market. Json files can be used to import tables and we can export values and results into a Json file. Since Json supports arrays for every single value, we have been able to test out our 'three dimensional table' features. Each row/column intersection can have multiple values that can be treated separately.

2) Ported everything to Linux and working on the MacOS version too. Updated the build tools to VS2017 and use Qt 5.11.2 for the browser tool. Can now build in both Visual Studio and Qt Creator.

3) Create indexes of sets of text files or of table columns. These are not your typical RDBMS indexes that are used to speed up queries. They are analytical tools to find and analyze patterns in text.

4) Create catalogs of other systems. We can now create Didgets (with associated tags) without importing the data streams based on files in other systems.

5) Enabled 'drill down' analytics from relational table results. If the user has a result set (from either a 'SELECT *' or a more specific query), they can see all the values represented for each column using the 'show values' option. For example: a query against a customers table might display all customers living in California (e.g. 10,000 rows). Right click on the 'city' column header and select the 'show values' option and it will give you a list of all the cities for those 10,000 customers and how many customers are found in each city. If you double click on one cell withing the result set (e.g. 'San Francisco' for row number 3 in the city column) it will pop up a new result set with every customer in that city (e.g. 2500 rows showing each customer in San Francisco). Continuing this process lets the user drill down to more and more specific criteria.

6) Added a set of formulas so our table results can each work much more like a spreadsheet.

7) Added a bunch of transformations to database tables. The user can now modify the data on a column by column basis. We can do things like uppercase, strip punctuation, trim spaces, truncate, replace or split values. The resulting transformations can be placed in entirely new columns within the table.

Monday, September 18, 2017

Latest Videos

I have made a ton of changes since I last recorded some videos showing what our Didget Browser can do, so I decided to make some new ones. I will add new links and descriptions as I record them. Here is what I have so far:


Introduction: (4 minutes) http://youtu.be/NgPTYsb4LRQ?hd=1

This video shows you how to create new containers that hold our Didget objects. It shows how to wipe out a container and start again from scratch. It also shows you how to configure the browser to only show certain features and how to pre-populate a new container with various Didgets.


Creating Database Tables: (5 minutes) http://youtu.be/rM1KEVe7TVc?hd=1

This video shows you how to create relational tables using Didgets. It shows how to create a table definition from scratch. It also shows how to create them using Json or CSV files. Once a definition is created, it can be used to create relational tables. Tables can also be constructed by extracting data from a local or remote database using a connector.


Querying our Database Tables: (7 minutes) http://youtu.be/T_Y2R4DA9UI?hd=1

This video shows how to query the tables; how to JOIN tables; and how to save any results out to a completely separate, persistent table that can be later queried just like any other table.

Monday, August 28, 2017

How Good is Good Enough?

I can be a perfectionist in some areas. I am passionate about speed when it comes to computer algorithms. Even if I have spent a few days getting some critical function to be 10x faster than it was before, I will often still stay up late if I know I can squeeze a few more percentage points out of it.

But I am also keen to the 'Lean Startup' idea of an MVP (Minimal Viable Product) where you build something that is just good enough without waiting until it is perfect before introducing it to the market. So I struggle with how good a particular feature has to be before I say it is good enough and move on to the next task.

The Didget Management System has a lot of very innovative features that set it apart from other kinds of general-purpose data managers. But speed is its greatest 'Wow!' factor. It can do things thousands of times faster than conventional file systems. It can do many database operations 2x, 3x, or 5x faster than the major relational database management systems. It takes full advantage of multi-core CPUs to make even single operations much faster by breaking them up and running pieces in parallel.

Yet, I have found my biggest challenge has been to get people to commit resources (time, money, effort) toward something that has considerable promise if there is any risk involved. I will show them something that is taking them 10 minutes to do using their PostgreSQL Database and can be done on my system in only 2 minutes and yet have trouble getting them to commit. This is not something trivial that is outside the core function of their business...it is something critical, and still they hesitate. I think this is mainly because it requires change - a step into the unknown.

Everyone knows that risk is the biggest enemy of innovation. All but the most trivial innovations required someone to take a chance and put something on the line to 'make it happen'. Business managers almost always remember some initiative that failed because they tried something out of the mainstream. But they rarely, if ever, know how a passed-up opportunity would have played out to their advantage.

I am keenly aware that in order to convince companies to switch to my system, it has to be a great improvement over their existing solution. When I started this project, I set the bar at twice as good. If I didn't think I could build something that was at least twice as fast, twice as convenient, or had twice the power; I would never have gone far into its development. It has far exceeded my expectations to the point that I think it will be at least 10x better than anything else.

So I plod ahead with the hope that eventually this platform will attract the attention of those innovators who will take a good look at the risk/reward ratio and decide the reward is just too great to pass up. We have some companies that are taking a look at it, but most have yet to do more than dip their toe in the water.

As I look over my list of tasks that are yet to be completed, there are many that are 'refinement' tasks or things that will make features that already work, significantly better. There are others that are features that do not yet work at all and need to be implemented. I have taken the approach to do a mixture of things from both groups. Every time I get one or two new things working, I will go back and enhance one thing that worked before. At the end of the month, I can then say that the product does things it never did before but also does a handful of things better than ever.

Tuesday, May 2, 2017

Design Principles

As we have designed and implemented the Didget Management System, we have done a good job so far at adhering to these basic principles:

1) No dependencies. We purposely stayed away from using any thing that may cause dependency issues down the road. We don't use .NET. We don't use third party libraries other than the standard C++ libraries. Our browser application uses Qt but the manager itself does not.

2) Take full advantage of CPU cores/threads. We want our code to run much faster on a CPU with more cores. Large individual operations are often broken up into multiple pieces and run in parallel using separate threads. We are not just running multiple queries simultaneously like many database servers (we do that too), but we can run a single query faster when more cores are available.

3) Use thread-safe code. Because we do so many things at the same time, we want to allow multiple operations on the same set of data to safely run in parallel. Data integrity is very important whether running many separate queries simultaneously or when many threads are running different parts of the same query.

4) Be operating system independent. We want this code to run equally well on Linux, Windows, or OSX. All operating system calls are confined to a single 'Kernel' module which is easily ported to other operating systems.

5) Be faster than anything else. Use things like maps, hash tables, and very fast algorithms written in efficient C++ code to do everything. No interpreted code.

6) Re-use code whenever possible. Often the same function can be used to manipulate a dozen or so different kinds of Didgets. When we fine tune something, it often improves performance in multiple areas.

Progress

I recently left my day job to work on the Didget system full time. My small team has made a number of changes to the project since the last blog post, but progress has been slow when so many other things get in the way. I have now been able to accelerate the development and testing greatly.

Here is a list of some new features that have been added over the past two years.

1) Added ability to do JOIN operations on our database tables.

2) Added direct connectors to external databases so we can create DB tables by querying those databases directly. In earlier versions, the user had to export the data into CSV files from those databases and then import those files into our system.

3) Added ability to transform data within database tables. We can now create new columns that are transformations of other columns. For example we can uppercase, lowercase, substitute, split, combine, convert, truncate, trim, etc.

4) Added a 'Folder' container type so that the data stream for every Didget is stored in a separate file. This helps with testing and lets us do more 'apples to apples' comparisons with file system operations.

5) Added more complex SQL query operations. We can now combine lots of AND and OR operations like "SELECT * FROM myTable WHERE Name LIKE '%son' AND Address ILIKE '%123%' OR ZipCode < 10000;" We made it very easy to create and save these queries.

6) Tuned a bunch of operations to be faster. SQL queries now execute even faster and often require less data to be read from disk. Many of our database queries are now about twice as fast as MySQL or PostgreSQL when performed on the same data set on the same machine. Again, we don't need any indexes to often outperform a fully indexed table in those other database systems.

Tuesday, September 22, 2015

2015 Demo Videos on YouTube

Links to latest demo videos.

10 minute video that shows file stuff as well as database operations:
https://www.youtube.com/watch?v=2uUvGMUyFhY

Shorter, 5 minute video that just shows the database operations, Latest and fastest code in that area is on display here:
https://youtu.be/0X02xpy8ygc

Monday, April 20, 2015

Object Storage

It is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.

Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.

Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.

This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.

Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.

To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.

When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.

Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.

What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.