tag:blogger.com,1999:blog-2663193900426543252024-02-20T07:04:21.527-08:00Welcome to the Realm, The World of DidgetsDidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.comBlogger40125tag:blogger.com,1999:blog-266319390042654325.post-30680172207203486302019-07-29T09:03:00.000-07:002019-07-29T09:10:42.975-07:00YouTube Channel<br />
<br />
<a href="https://www.youtube.com/channel/UC-L1oTcH0ocMXShifCt4JQQ">https://www.youtube.com/channel/UC-L1oTcH0ocMXShifCt4JQQ</a><br />
<br />
I created a couple new videos that demonstrate the speed and flexibility of the Didget System's database functions. I loaded in the Chicago crime data available in CSV or Json format on the city's open data portal. Anyone can go download this decent-sized table (nearly 7 million rows and 22 columns).<br />
<br />
I loaded this data into Didgets as well as Postgres and did some benchmarks. On average, the Didget System could query the table in about 1/4 of the time it took Postgres. I ran dozens of queries and saw results anywhere from twice as fast to up to ten times as fast. I saw similar results using SQL Server and MySQL.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-2172687041318854422018-12-21T10:17:00.000-08:002018-12-21T10:17:46.147-08:00Updates2018 has been a busy year. It has been some time since I last posted, so I didn't want the year to end without an update.<br />
<br />
We formed a new company (Didgets.io) and started a simple web page for it. I have added a ton of features and enhanced many of the previously implemented ones. We successfully found our first two paying customers so we had some modest income this year. We are currently working on adding team members and looking for some working capital to speed up development. We have a number of potential customers that we are currently working with to get them on board.<br />
<br />
As every startup founder does, I have to wear multiple hats. The 'Documentation Hat' has been one I have obviously neglected as other tasks have consumed my time.<br />
<br />
To catch up, here is a brief list of a few of the most important changes over the past year:<br />
<br />
1) Added a lot of Json support to try and capture some of the NoSQL market. Json files can be used to import tables and we can export values and results into a Json file. Since Json supports arrays for every single value, we have been able to test out our 'three dimensional table' features. Each row/column intersection can have multiple values that can be treated separately.<br />
<br />
2) Ported everything to Linux and working on the MacOS version too. Updated the build tools to VS2017 and use Qt 5.11.2 for the browser tool. Can now build in both Visual Studio and Qt Creator.<br />
<br />
3) Create indexes of sets of text files or of table columns. These are not your typical RDBMS indexes that are used to speed up queries. They are analytical tools to find and analyze patterns in text.<br />
<br />
4) Create catalogs of other systems. We can now create Didgets (with associated tags) without importing the data streams based on files in other systems.<br />
<br />
5) Enabled 'drill down' analytics from relational table results. If the user has a result set (from either a 'SELECT *' or a more specific query), they can see all the values represented for each column using the 'show values' option. For example: a query against a customers table might display all customers living in California (e.g. 10,000 rows). Right click on the 'city' column header and select the 'show values' option and it will give you a list of all the cities for those 10,000 customers and how many customers are found in each city. If you double click on one cell withing the result set (e.g. 'San Francisco' for row number 3 in the city column) it will pop up a new result set with every customer in that city (e.g. 2500 rows showing each customer in San Francisco). Continuing this process lets the user drill down to more and more specific criteria.<br />
<br />
6) Added a set of formulas so our table results can each work much more like a spreadsheet.<br />
<br />
7) Added a bunch of transformations to database tables. The user can now modify the data on a column by column basis. We can do things like uppercase, strip punctuation, trim spaces, truncate, replace or split values. The resulting transformations can be placed in entirely new columns within the table.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-87120232179105674162017-09-18T08:04:00.000-07:002017-09-18T15:49:41.112-07:00Latest VideosI have made a ton of changes since I last recorded some videos showing what our Didget Browser can do, so I decided to make some new ones. I will add new links and descriptions as I record them. Here is what I have so far:<br />
<br />
<br />
Introduction: (4 minutes) <a href="http://youtu.be/NgPTYsb4LRQ?hd=1">http://youtu.be/NgPTYsb4LRQ?hd=1</a><br />
<br />
This video shows you how to create new containers that hold our Didget objects. It shows how to wipe out a container and start again from scratch. It also shows you how to configure the browser to only show certain features and how to pre-populate a new container with various Didgets.<br />
<br />
<br />
Creating Database Tables: (5 minutes) <a href="http://youtu.be/rM1KEVe7TVc?hd=1">http://youtu.be/rM1KEVe7TVc?hd=1</a><br />
<br />
This video shows you how to create relational tables using Didgets. It shows how to create a table definition from scratch. It also shows how to create them using Json or CSV files. Once a definition is created, it can be used to create relational tables. Tables can also be constructed by extracting data from a local or remote database using a connector.<br />
<br />
<br />
Querying our Database Tables: (7 minutes) <a href="http://youtu.be/T_Y2R4DA9UI?hd=1">http://youtu.be/T_Y2R4DA9UI?hd=1</a><br />
<br />
This video shows how to query the tables; how to JOIN tables; and how to save any results out to a completely separate, persistent table that can be later queried just like any other table.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-27640706265185535302017-08-28T11:26:00.002-07:002017-08-28T15:20:03.511-07:00How Good is Good Enough?I can be a perfectionist in some areas. I am passionate about speed when it comes to computer algorithms. Even if I have spent a few days getting some critical function to be 10x faster than it was before, I will often still stay up late if I know I can squeeze a few more percentage points out of it.<br />
<br />
But I am also keen to the 'Lean Startup' idea of an MVP (Minimal Viable Product) where you build something that is just good enough without waiting until it is perfect before introducing it to the market. So I struggle with how good a particular feature has to be before I say it is good enough and move on to the next task.<br />
<br />
The Didget Management System has a lot of very innovative features that set it apart from other kinds of general-purpose data managers. But speed is its greatest 'Wow!' factor. It can do things thousands of times faster than conventional file systems. It can do many database operations 2x, 3x, or 5x faster than the major relational database management systems. It takes full advantage of multi-core CPUs to make even single operations much faster by breaking them up and running pieces in parallel.<br />
<br />
Yet, I have found my biggest challenge has been to get people to commit resources (time, money, effort) toward something that has considerable promise if there is any risk involved. I will show them something that is taking them 10 minutes to do using their PostgreSQL Database and can be done on my system in only 2 minutes and yet have trouble getting them to commit. This is not something trivial that is outside the core function of their business...it is something critical, and still they hesitate. I think this is mainly because it requires change - a step into the unknown.<br />
<br />
Everyone knows that risk is the biggest enemy of innovation. All but the most trivial innovations required someone to take a chance and put something on the line to 'make it happen'. Business managers almost always remember some initiative that failed because they tried something out of the mainstream. But they rarely, if ever, know how a passed-up opportunity would have played out to their advantage.<br />
<br />
I am keenly aware that in order to convince companies to switch to my system, it has to be a great improvement over their existing solution. When I started this project, I set the bar at twice as good. If I didn't think I could build something that was at least twice as fast, twice as convenient, or had twice the power; I would never have gone far into its development. It has far exceeded my expectations to the point that I think it will be at least 10x better than anything else.<br />
<br />
So I plod ahead with the hope that eventually this platform will attract the attention of those innovators who will take a good look at the risk/reward ratio and decide the reward is just too great to pass up. We have some companies that are taking a look at it, but most have yet to do more than dip their toe in the water.<br />
<br />
As I look over my list of tasks that are yet to be completed, there are many that are 'refinement' tasks or things that will make features that already work, significantly better. There are others that are features that do not yet work at all and need to be implemented. I have taken the approach to do a mixture of things from both groups. Every time I get one or two new things working, I will go back and enhance one thing that worked before. At the end of the month, I can then say that the product does things it never did before but also does a handful of things better than ever.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-13677430400390715602017-05-02T20:06:00.000-07:002017-05-02T20:06:20.532-07:00Design PrinciplesAs we have designed and implemented the Didget Management System, we have done a good job so far at adhering to these basic principles:<br />
<br />
1) No dependencies. We purposely stayed away from using any thing that may cause dependency issues down the road. We don't use .NET. We don't use third party libraries other than the standard C++ libraries. Our browser application uses Qt but the manager itself does not.<br />
<br />
2) Take full advantage of CPU cores/threads. We want our code to run much faster on a CPU with more cores. Large individual operations are often broken up into multiple pieces and run in parallel using separate threads. We are not just running multiple queries simultaneously like many database servers (we do that too), but we can run a single query faster when more cores are available.<br />
<br />
3) Use thread-safe code. Because we do so many things at the same time, we want to allow multiple operations on the same set of data to safely run in parallel. Data integrity is very important whether running many separate queries simultaneously or when many threads are running different parts of the same query.<br />
<br />
4) Be operating system independent. We want this code to run equally well on Linux, Windows, or OSX. All operating system calls are confined to a single 'Kernel' module which is easily ported to other operating systems.<br />
<br />
5) Be faster than anything else. Use things like maps, hash tables, and very fast algorithms written in efficient C++ code to do everything. No interpreted code.<br />
<br />
6) Re-use code whenever possible. Often the same function can be used to manipulate a dozen or so different kinds of Didgets. When we fine tune something, it often improves performance in multiple areas.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-87366401327284527552017-05-02T19:48:00.000-07:002017-05-02T19:48:26.762-07:00ProgressI recently left my day job to work on the Didget system full time. My small team has made a number of changes to the project since the last blog post, but progress has been slow when so many other things get in the way. I have now been able to accelerate the development and testing greatly.<br />
<br />
Here is a list of some new features that have been added over the past two years.<br />
<br />
1) Added ability to do JOIN operations on our database tables.<br />
<br />
2) Added direct connectors to external databases so we can create DB tables by querying those databases directly. In earlier versions, the user had to export the data into CSV files from those databases and then import those files into our system.<br />
<br />
3) Added ability to transform data within database tables. We can now create new columns that are transformations of other columns. For example we can uppercase, lowercase, substitute, split, combine, convert, truncate, trim, etc.<br />
<br />
4) Added a 'Folder' container type so that the data stream for every Didget is stored in a separate file. This helps with testing and lets us do more 'apples to apples' comparisons with file system operations.<br />
<br />
5) Added more complex SQL query operations. We can now combine lots of AND and OR operations like "SELECT * FROM myTable WHERE Name LIKE '%son' AND Address ILIKE '%123%' OR ZipCode < 10000;" We made it very easy to create and save these queries.<br />
<br />
6) Tuned a bunch of operations to be faster. SQL queries now execute even faster and often require less data to be read from disk. Many of our database queries are now about twice as fast as MySQL or PostgreSQL when performed on the same data set on the same machine. Again, we don't need any indexes to often outperform a fully indexed table in those other database systems.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-51636838289202693382015-09-22T08:49:00.001-07:002015-09-22T08:49:27.944-07:002015 Demo Videos on YouTubeLinks to latest demo videos.<br />
<br />
10 minute video that shows file stuff as well as database operations:<br />
<a href="https://www.youtube.com/watch?v=2uUvGMUyFhY" rel="nofollow" style="background-color: white; color: #022db7; font-family: verdana, arial, 'times new roman'; font-size: 12px; line-height: 17px; margin-top: 0px; padding: 0px 0px 5px;">https://www.youtube.com/watch?v=2uUvGMUyFhY</a><br />
<br />
Shorter, 5 minute video that just shows the database operations, Latest and fastest code in that area is on display here:<br />
<a href="https://youtu.be/0X02xpy8ygc" rel="nofollow" style="background-color: white; color: #022db7; font-family: verdana, arial, 'times new roman'; font-size: 12px; line-height: 17px; margin-top: 0px; padding: 0px 0px 5px;">https://youtu.be/0X02xpy8ygc</a>DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-16467025323302487122015-04-20T09:44:00.000-07:002015-04-20T09:44:08.748-07:00Object StorageIt is no secret that the amount of digital data being collected and stored has exploded over the past decade. With high speed networks, more portable devices, and the coming wave of "the Internet of Things (IoT)", it is unlikely to slow down anytime soon.<br />
<br />
Fortunately, there are lots of cheap, high-capacity options for storing all that data. Hard drives, flash memory devices, and even optical and tape offer capacities unheard of in the past. There are numerous ways to bundle all these devices using hardware and software to create huge virtual containers.<br />
<br />
Unfortunately, capacity increases and speed increases are not the same curve on the graph. It is simply easier and cheaper to double the capacity of any given device than it is to double its speed. The same is true for the newest flash devices like SSDs. A 1 TB flash drive is not twice as fast as a 500 GB flash drive.<br />
<br />
This means it will take longer to read all the data from a device shipped this year than it did to read all the data from a device shipped last year and it will take even longer to read it all in on next year's devices. This makes it more important than ever to improve the way the actual data is stored and managed on those devices through software.<br />
<br />
Compression techniques, data de-duping capabilities, and distributed storage solutions can make it easier to handle large amounts of data, but more needs to be done. An effective object manager is needed to handle huge numbers of objects using minimal resources.<br />
<br />
To give you an example of the problem, let's consider the default file systems on Windows machines - NTFS. This file system stores a 4096 byte file record for every file in the volume. That might not seem like a lot of space until you get large numbers of files. If you have 10 million files, then you must read in and cache 40 GB worth of file metadata. If you have 100 million files, the amount is 400 GB. For a billion files, it is a whopping 4 TB. Bear in mind that this figure is only for the file record table. All the file names are stored separately in a directory structure.<br />
<br />
When large data sets are used, it is very common for any given data operation to only affect a small portion of the overall data set. If you do daily or weekly backups, it is common for only 1% or less of the data to change between those backups. The same is true for synchronizing data sets between devices. Queries typically only need to examine a small portion of the data as well.<br />
<br />
Current systems are very inefficient in determining the small subset of data that is needed for the operation; thus too much data must be read and processed. For example, take two separate storage devices that each have a copy of the same 100 TB data set and that they are synchronized once a day. The data on each device changes independently between synchronization operations, but it is rare for more than a few GB to change each day. Using current systems, it might take a few hours and several TB of metadata reads to determine the small amount of data that must be transferred in order to bring the two devices back into perfect synchronization.<br />
<br />
What is needed is a system that can quickly read a small amount of metadata from the device and find all the needles in the haystack in record time. This is what the Didget Management System is designed to do. It can store 100 million Didgets in a single container and read in the entire metadata set in under 21 seconds from a cold boot. It only needs to read in and cache 6.4 GB of data to do that and a consumer grade SSD has all the speed it needs to accomplish that query time.<br />
<br />
<br />DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-40827943121909358952015-04-03T07:01:00.000-07:002015-04-20T09:34:14.610-07:00UpdateI have been busy (whenever I get a spare minute) updating the Didget management software. It now has a number of new features and has also received some significant speed improvements.<br />
<br />
1) I have converted all the internal container structures (bitmaps, tables, fragment lists) to be Didgets. This means I can leverage all the Didget management code for these structures. I can also expose these kinds of Didgets for external use. For example, an application can now create a "Bitmap Didget"; store a few billion bits in it; and utilize its API to set and clear ranges of bits. I use them internally to keep track of all the free and used blocks within the container, but others could use it for a number of other purposes.<br />
<br />
2) I have built the library and the browser interface using the latest tools. It is now a 64 bit application that will run on Windows 10. I built it using Visual Studio 2013 and the Qt 5.4 libraries. Some speed improvements have come from better code in these products. Now that it is 64 bit, I can allocate more than 4 GB of RAM and can do some benchmarks using extremely large data sets.<br />
<br />
3) I have multi-threaded the underlying Didget manager code. This allows operations on large data sets to be split into several pieces and distributed across multiple processors on most modern CPUs. I have seen significant performance boosts here as well.<br />
<br />
To give you some idea of the speed improvements, I will now post some results of tests I have been running. I hope to post a video soon showing these tests in action.<br />
<br />
These tests were running on my newer (1.5 year old) desktop machine. It has an Intel i7-3770 processor (fairly quick, but by no means the fastest out there) with 16 GB RAM and a 64 GB SSD.<br />
<br />
I can now create over 100 million Didgets in a single container. Queries that do not inspect tags (e.g. find all JPEG photos) complete in under 1 second even if the query matches 20 million Didgets. Queries that inspect tags (e.g. find all photos where place.State = "California") are also much faster but may take a few seconds to sort them all out when there are 20 million of them.<br />
<br />
I can now import a 5 million row, 10 column table from a .CSV file in 4 minutes, 45 seconds. This includes opening the file and reading in every row; parsing the ten values on each row; and inserting each value into its corresponding key-value store (a Tag Didget) while de-duping any duplicate values. That time also includes flushing the table to disk.<br />
<br />
I can then perform queries against that table in less than 1 second as well. "SELECT * FROM table WHERE address LIKE '1%'" (i.e. find every person with an address that starts with the number 1) will find all of them (about 500,000 rows) in less than a second.<br />
<br />
My original goal was to be able to store 100 million Didgets in a single container and be able to find anything (and everything) that matched any criteria in less than 10 seconds. With the latest code, I have been able to exceed that goal by quite a large margin.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-80853573896742103052014-07-06T23:44:00.001-07:002014-07-06T23:44:51.729-07:00Quick links to all the videos so far...<a href="http://screenr.com/xKmN">Database 1</a><br />
<a href="http://screenr.com/AKmN">Database 2</a><br />
<br />
<a href="http://www.screenr.com/XV17">Didgets 1</a><br />
<a href="http://www.screenr.com/fXd7">Didgets 2</a><br />
<a href="http://www.screenr.com/5Zx7">Didgets 3</a>DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-36818111809374501022014-07-06T23:28:00.000-07:002014-07-06T23:28:49.348-07:00Relational Database Tables Using DidgetsIt has been nearly a year since I last posted, but I haven't been lazy. I have been busy in my spare time improving the performance of the database operations and adding lots of new features. The relational database table operations are significantly faster now (as are the tag look-ups for Didgets).<br />
<br />
A couple of posts ago entitled "Another Piece to the Puzzle" gave the time it used to take to query a 1 million row table with 6 columns at 25 seconds. Now I am able to query a 10 column table (also with a million rows) in under a single second. The time it takes to import a .CSV file has also been greatly reduced.<br />
<br />
I can now import a 5 million row, 10 column table in 1 minute and 6 seconds. Most queries against that table now take at most about 2 seconds.<br />
<br />
See a video demonstration of queries against the 1 million row table at http://screenr.com/xKmN<br />
<br />
Since every column in each table is stored within a pair of Tag Didgets, they each become a separate key/value store. All values are de-duped, so if the column "First Name" has 10,000 rows with the value of "Fred", the actual string value is only stored once with 10,000 references to it. The user can select the Tag Didget containing all the values and view them along with the reference count for each.<br />
<br />
Each row in a database table is just a set of values from the separate key/value stores that all map to the same key (e.g. the row number).<br />
<br />
Another video demonstration of ways to view all the values in each key/value store is at http://screenr.com/AKmN<br />
<br />
We have several additional features that we are working on, but this will give you a taste of how fast our database inserts and queries are so you can compare this against existing database managers.<br />
<br />
Remember, we are not running on top of MySQL, SQLite, PostgreSQL, or any other commercial or open source database manager. This is all running just on the Didget Management System. You get all this functionality without needing to install a separate RDBMS.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-24976862039961226802013-07-13T16:32:00.000-07:002013-07-13T16:32:19.629-07:00Data ManagersOver the years, a number of systems have been created to help users manage their data. I call these systems "Data Managers". There are two types - primary data managers and secondary data managers.<br />
<br />
Primary data managers are very general-purpose in nature and are widely adopted in the computing world. File systems, databases, and web servers fall into this category. More recent members of this category include distributed file systems like Hadoop and cloud offerings like Amazon S3. These newer systems are gaining greater acceptance as "Big Data" becomes more pervasive and as users demand more mobile access to all their data.<br />
<br />
Secondary data managers are generally more specialized in the types of data they manage. They almost always utilize the services of a primary data manager to store their underlying data. Examples of these kinds of data managers include Apple's iTunes for managing music or Google's Picassa for managing photos. They typically keep most of their unstructured data as files in a file system and create a proprietary database for storing extra metadata. These data managers may also integrate with cloud services to give the user a virtual view of their data even when it may be spread across several systems. Unfortunately, these secondary data managers are nearly always in danger of interference from other programs and must rely upon the security measures offered by the primary data manager. If another application deletes, moves, or renames one or more of the files it manages, the secondary data manager can often have trouble reconciling those changes. If another program deletes one or more of its core metadata files (i.e. its database) then the secondary data manager can fail completely.<br />
<br />
The Didget Management System is a primary data manager. It not only provides new functionality that previous data managers lack, but it has also been designed to supplant them. This is very different from other primary data managers like databases for example, which were designed to manage structured data in ways that file systems never could, but were never designed to handle unstructured data well enough to make file systems unnecessary. A consequence of that strategy is that as each of the other primary data managers entered the market, we ended up with yet another "silo" into which a portion of our data can be put.<br />
<br />
That is why I designed the Didget Management System to manage both structured and unstructured data well. It is designed to manage that data in both simple configurations and in distributed cluster environments. When the amount of data grows from a few thousand pieces of information to billions of pieces utilizing petabytes of storage, there will not be a costly transition point where all the existing data must be migrated to an entirely new system. If we are successful, new data will not only be created as Didgets instead of as files or traditional database tables, but all the old data will be converted to Didgets as well. Our goal is to replace those other primary data managers completely.<br />
<br />
In order to realize that goal, the Didget Management System has to do all the critical data management functions of the system it is replacing in addition to its new feature set. It cannot just be 5%, 10%, or even 50% better either. It has to be at least TWICE as good as the old system. When I designed it, that was my minimum threshold. If I couldn't make it dramatically better, it would not gain widespread adoption and would likely fall into a very narrow niche product and not be worth the effort.<br />
<br />
Fortunately, the design has proven to work so well that I not only think we have met that 2x threshold, I think it has greatly exceeded it. I would not be surprised if once all the features are fully implemented, that we will have a system that is 10x better than those other systems. That does not mean that we will do everything 10x better than every feature found in those other systems (for example we will not be able to read a Didget ten times faster from disk than a file system can read a file), but rather that overall it will be that much better when all the factors of performance, feature set, ease of use, security, and flexibility are considered.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-52310876788440469922013-04-30T08:00:00.000-07:002013-04-30T08:00:16.720-07:00Another Piece to the PuzzleDidgets provide new and innovative ways to store, search and organize unstructured data that would normally be stored in files. They have also proven useful for storing structured data that is well suited for entry into a NoSQL database. A missing piece was to use them to store structured data that has been traditionally stored within tables in a regular Relational DataBase Management System (RDBMS) and accessed via a Structured Query Language (SQL).<br />
<br />
Since our tags had been effective in implementing NoSQL columns in a sparse table, we decided to use them to try and implement a regular relational table. While I had little hope that it could match the performance of a finely tuned RDBMS like MySQL, I at least wanted to implement something that would be acceptable and maybe provide a few unique features or an easier way of managing the data.<br />
<br />
To my surprise, it has not only matched the performance of MySQL in preliminary tests, it was 17% faster on many queries. I created a table with six columns and inserted 1 million rows of random data to test the performance of each system. Using the MyISAM storage engine under MySQL, a "Select *" query took 30 seconds to execute on my old test machine. The same query using the Didget Management System only took 25 seconds to complete.<br />
<br />
If I switched out the storage engine under MySQL to use the InnoDB engine, the same query took 1 minute and 20 seconds. I was surprised that the InnoDB engine with transactional support was so much slower than the MyISAM engine for this simple query. I have yet to implement the transaction feature using Didgets so I could not do a comparison test but I am confident that our transaction overhead will not be as dramatic as it was under MySQL.<br />
<br />
I am also confident that the Didget Management System will provide a very easy mechanism to create, query, and share database tables. It will also be much easier to administer since we can provide lightning fast queries without having to index columns or do complicated joins across multiple tables.<br />
<br />
In essence, the Didget Management System is a radically different architecture to the traditional RDBMS way of storing structured data in multiple tables. Since development of the database features are still in its infancy, there is much work yet to be done but I am confident that this will become a major feature of our new general-purpose data management system.<br />
<br />
Stay tuned for further developments....DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-46004383975862155652013-02-24T20:55:00.000-08:002013-02-24T20:55:11.636-08:00Connecting the DotsIf you look up at the sky on a moonless night, far away from any city lights, you will see many thousands of individual stars. An asterism is a group of those stars that can be connected together in our minds to form a stick figure. Constellations are ancient asterisms that gained popular names like Virgo or Ursa Major. Other asterisms that just make up a portion of a constellation have also been given popular names like "The Big Dipper" or "Orion's Belt". People who star gaze and either find some of these popular asterisms or form their own, are looking for "patterns" among the thousands of stars.<br />
<br />
Searching for patterns is also common when we deal with all that data that exists as individual files or database records on our hard drives, flash memory cards, or DVDs. Sometimes these patterns are already established for us. A popular software package may consist of a dozen separate executable files along with their configuration files and documentation. They are often copied into one or two folders or directories during an installation process to keep them together. Sometimes installation programs copy them into common folders like /usr/bin so that they get all mixed in with other programs and they are not so easy to sort out and figure which files belong to which programs.<br />
<br />
But even files that seem to be completely independent of other kinds of data (e.g. a photo or a song) can often be grouped together with other files to form ad hoc groups (e.g. a photo or a music album). We are constantly trying to make connections between different data points to form new and interesting patterns. Facebook and other social media sites provide mechanisms to form some of these patterns. A user posts messages, pictures, documents, videos, and other personal information in order to tell a story about their life, their interests, and their friends. It is the connections between lots of individual pieces of data that can lead to new interactions and help us make decisions.<br />
<br />
The current trend in "Big Data" and various forms of analytics is all about finding patterns in large amounts of data to drive business decisions. Analyze a million customer orders to look for patterns of shopping behaviour when it is cold outside in order to figure out what items to put on sale when the next big storm hits. Analyze emails sent by everyone over 65 years old in Florida to figure out what political messages will most likely sway the most voters.<br />
<br />
The trick to establishing meaningful patterns among millions or billions of individual data points lies in the ability to quickly analyze each point and determine if it has a significant connection to another point. The system that is used to store the information is a critical component to being able to quickly check lots of data points for a certain condition in order to sift the wheat from the chaff. The system must not only be able to match things like strings or numbers, but it must provide some kind of context in order to make more meaningful connections.<br />
<br />
For example, if someone wanted to analyze a group of messages to gain intelligence about military hardware, the word "Tank" would be a meaningful keyword to search for. However, such a "brute force" search might turn up every message that deals with water tanks, gas tanks, and R&B music. It is much more meaningful if the search was conducted using "Vehicle=Tank" instead.<br />
<br />
The Didget Management System was designed to not only manage large numbers of data points, but to also aid in making connections between points in order to find new patterns. By attaching many searchable tags to any given piece of data and by providing context for every single tag, the system makes it easy to find all the data that share a common attribute. It can also rank various connections between any two points based on the number of attributes they share in order to give hints about more relevant connections.<br />
<br />
Big Data Analytics is all about finding hidden patterns and unknown correlations in large amounts of data. This means that specialized queries must be conducted against all that data to try and find meaningful patterns. When the data is created and stored, the nature of such queries is largely unknown. In other words, the data must be stored in such a way as to make as wide as possible, a variety of potential queries.<br />
<br />
The speed at which a query can execute is a major factor in finding that "needle in a haystack". If a big data set consists of 10 billion data points and every query takes several hours to complete, then it becomes very hard to conduct lots of different types of queries, looking for a pattern. If, on the other hand, such a query can execute in a minute or less, then it becomes practical to conduct a wide variety of queries hoping that a meaningful pattern just "pops out in front of you".<br />
<br />
Several other "big data" projects like Hadoop, MapReduce, HBase, Cassandra, and MongoDB have been structured to be spread across a cluster of nodes so that the processing of data can occur in parallel. This can greatly reduce the time necessary to perform a query. Such systems can be very complex to set up and administer, however. Our system has been designed to greatly simplify such configurations.<br />
<br />
But finding patterns should not just be exclusive to large companies with big data sets. Individual users could greatly benefit from finding meaningful patterns among a few million pieces of information. If I got a message from Mary about her vacation in Hawaii, it would be helpful if there was an "about" button next to her name that when pushed would bring up a list of every message, photo, and document that she had sent me or was about her. Likewise it would be helpful if the message itself had hyper-links in it that when clicked would bring up my own photos of Hawaii or information about scuba diving or whale watching. These links could be generated automatically by the system based on tags already present on other Didgets.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-51250592723239883552013-01-19T21:52:00.000-08:002014-10-24T15:10:50.308-07:00Silos of InformationAs I stated earlier, the Didget Management System was designed to offer an alternative to conventional data management systems that tend to manage just a subset of all data and to build walls around any extra metadata they may generate. With such systems, a given set of a few million pieces of data (files and/or database rows) will often be fragmented into several of these "silos of information".<br />
<br />
To illustrate this using a real world example, consider the following:<br />
<br />
A user has a 2TB hard drive nearly full of data. Since the average size of each piece of data on the drive is about 1 million bytes, this represents nearly 2 million different files. Out of all that data, there exists three important pieces of information that are from the user's friend Bob. Bob has sent the user an email; the user has taken a picture of Bob and transferred that picture from his camera to the hard drive; and Bob has also authored a document that the user downloaded from Bob's web site.<br />
<br />
The email was transferred from the email server to the user's computer running Windows 7 by Microsoft Outlook and stored in a .pst file somewhere in the file system hierarchy. The picture was imported into Google's Picasa photo manager and was "tagged" with "Bob" using their facial recognition feature. This tag was embedded within the .jpg file using the EXIF space reserved for such tags. The document was also stored somewhere on the file system and the user set an extended attribute of "Author=Bob" on the document file using a special document manager program.<br />
<br />
Now the user wants to do a general search for everything on his drive that has to do with his friend Bob and hopes to come up with all three pieces of information. He wants a program that will comb through all his data and find those pieces.<br />
<br />
1) The program will prompt the user for what to base the search on. The user just types "Bob" since there is no standard schema that helps identify "Bob" as a person's first name.<br />
<br />
2) The program must now be able to do a complete folder hierarchy traversal, looking for any instances of the string "Bob". It might find a bunch of files that contain "Bob" in their file name, like "bobset.exe" or "stuff.bob". It would need to show those to the user since it doesn't know if they might be relevant.<br />
<br />
3) For every file the program searches, it would need to peek at its extended attributes to see if any contained the word "Bob". Like file names that matched, it would need to display a file called "Photo1.jpg" that had an extended attribute "Activity=Bobbing for apples". For every .jpg file, it would need to know how to open and search the EXIF data portions, also looking for any tags that might have "Bob" in them.<br />
<br />
4) It would need to be able to parse through any .pst files by following the Microsoft specification, looking for any emails that might come from, or be about Bob.<br />
<br />
Each of these things represents a different "silo" of information that would need to be accessed and understood by the program doing the search. The .pst database file; the file extended attributes; and the .jpg - EXIF file format information are examples of these silos. There are many other silos like .db files, html or xml files, registry files, INI files, and .doc files. Accessing each of them requires knowledge about their format and the rules for parsing them.<br />
<br />
If instead of using those systems to store data; Microsoft's Outlook, Google's Picasa, and the document manager were all built on top of the Didget Management System then things could be much simpler. The email could be stored in a "Message Didget", the picture could be stored in a "Photo Didget", and the document would be stored in a "Document Didget". Each of these Didgets would have a tag ".person.FirstName = Bob" attached to it.<br />
<br />
Now any application could look for stuff about Bob without missing anything or getting all kinds of unintentional results. It would also find all three items in less than 1 second instead of the painfully slow search in the previous example.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-27625047416637963162013-01-06T22:26:00.000-08:002013-01-06T22:26:57.509-08:00A Non-Hierarchical Data Management SystemThose of you who have been following this blog may be wondering why I have never called the Didget Management System an "Object Store". This has been intentional, since I believe that Didgets offer many features that other kinds of persistent objects simply do not, and I wanted to avoid confusion. Once someone hears the word "Object" they tend to get all kinds of notions in their head about what a Didget is and what it should do.<br />
<br />
But the reality is that a Didget has more in common with an object than with a traditional file. One of the things that really sets our system apart from other kinds of object stores like Amazon S3, is that we are designing it to be a replacement for local file systems as well as a cloud storage system. If we are successful, in another ten years all the data stored on your laptop, desktop, mobile device, and in various cloud storage containers will be in the form of Didgets.<br />
<br />
Although it will be very easy to overlay our system with a traditional hierarchical namespace to provide backward compatibility with legacy systems, our native storage design is anything but hierarchical. This is somewhat like having a file system volume with just a single folder (i.e. root) where every file in the volume is stored. Since files use simple names as their unique identifier, such a system is impractical (if not impossible) for a traditional file system with millions of files stored on it. With Didgets it is not only possible, but very practical to store a hundred million of them within a single container without a hierarchical naming model.<br />
<br />
Every Didget within a container has its own unique 64-bit ID that is used to access it. For systems that interact with data without needing its identifier to be in human readable form, it is easier and faster to store IDs as numbers rather than names like "C:\Windows\System\Drivers\adpahci.sys". With our system, is also very easy to find groups of Didgets that match a given criteria or to narrow down a simple search to find the single Didget you are looking for.<br />
<br />
But we have hierarchical namespaces for a reason, so it is worth reviewing how we got here.<br />
<br />
With file systems, a file's identifier is its name. You create, open, move, copy, and delete a file by passing its name into a file API function. Since its name is its identifier, you can't have two different files with the exact same name. This means you have to come up with a unique name for every file - a task that gets increasingly harder as the number of files increases. Early file systems like FAT that were not case sensitive and restricted names to 8.3 format made this task even harder. Even with long file name support that allows very specific and descriptive names, devices like cameras tend to want to create your pictures with names like "Photo_001.jpg", "Photo_002.jpg", "Photo_003.jpg", etc..<br />
<br />
To get around naming conflicts and to add a simple categorization facility, file system designers came up with a hierarchical directory (or folder) model. A file's name only needed to be unique within a given folder, and its full path name became its unique identifier. The file name and folder name could be easily human readable and provide clues for navigation that were intuitive for many users. The folder system also made it easy to copy, move, or delete whole folders or entire folder trees using simple commands.<br />
<br />
But file names and folder hierarchies have a number of problems associated with them. Changing the name of any file or any folder in its path will change its unique identifier and thus invalidate any stored references to it. The human readable names cannot be translated from one language to another without causing the same problem. An unprotected file might have its contents overwritten by a completely unrelated file that just happens to have the same name. If I want to store photos I have downloaded, I might have both a "/home/photos/download" and a "/home/download/photos" folder and have files in both - causing confusion.<br />
<br />
Didgets operate in a completely different manner. Each Didget can have a name (or multiple names) attached to it as a name tag. When a file is converted to a Didget, each folder name may be attached to it as a separate folder tag. Unlike file paths, the ordering of tags don't make any difference. So if we were to overlay a hierarchical namespace to the Didget system, a command like "ls /home/andy/documents/projects/projectX/*" would give the same results as "ls /documents/andy/projectX/project/home/*".<br />
<br />
You could leave out folder tags in a search and just get more results. For example: "ls /andy/*.jpg" would return all the JPG photos that were stored in any path that had "andy" as one of its folders. New folder tags could be added or existing tags deleted at any time without having "moved" the Didget or changing its unique identifier in any way. Existing tags can also be modified or translated to another language with the same lack of consequences.<br />
<br />
Such a system provides a much more flexible mechanism for categorizing and finding data. As previous posts have shown, we can find all matching Didgets much faster than conventional file systems can find matching files.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-57707945692136480042012-12-15T22:55:00.001-08:002012-12-15T22:55:42.089-08:00Cloud Based SolutionsWhen I bought my first computer back in 1986, I splurged for the 10 MB hard drive option. It cost nearly $800 and was incredibly slow by today's standards, but compared to the rest of my data storage (a handful of 5 1/4 inch floppy disks) it was a huge leap forward. That hard drive and my floppies together totaled less than 20 MB and comprised my entire data storage capacity.<br />
<br />
As time went on, I replaced each of my storage devices with larger capacity and faster units. Sometime when I bought a new device, it became a completely separate storage system instead of just replacing an existing one. Today, I have over 20 different storage devices (hard drives, flash drives and cards, NAS boxes, SSDs, and cloud buckets) each with a set of files stored on it. Total capacity is somewhere around 12 TB and I have a lot of data stored on them.<br />
<br />
Having lots of separate storage devices is both good and bad. I have storage directly attached to many of the devices I am working on so I can access information even when the Internet is not accessible - good. I try to spread my data around and keep redundant copies or backups of important data in case any individual storage device fails or is lost or stolen - better. If I have the right procedures in place, I ultimately control all the data that I have stored - best of all.<br />
<br />
But it can be difficult to figure out which of my many devices has a piece of data that I am looking for - bad. I have to remember to backup or replicate data that might be unique to any given device - worse. I might be on a trip and remember that the data I need is on a flash drive in a drawer at home - also worse. I might have multiple copies of a given piece of data and if I update one copy, I need to remember to update all the copies, otherwise I have multiple working sets of data that are not synchronized - worst of all.<br />
<br />
Recent offering by Cloud Storage providers such as Dropbox, Google Drive, SugarSync, or Amazon S3 have attempted to solve some of these problems and a few others. Unfortunately, they also introduce a number of problems or challenges as well.<br />
<br />
Keeping your data in the "Cloud" can be beneficial in many instances. Redundant copies or backups are handled automatically by the storage provider. The data can be accessed by nearly any device with an Internet connection. Storage capacity can be very flexible and grow to meet your storage needs without having to purchase new units and migrate your data. It is easy to share your data with others. All these features offer compelling reasons to put data in the cloud.<br />
<br />
But cloud storage is currently much more expensive than just buying a new hard drive. If you have many terabytes of data, it can be incredibly expensive to store all that data in the cloud. Data transfer speeds can also be very slow when compared to local storage. Sometimes users experience extremely slow speeds when performing a backup or restore operation. Slow performance and costs make it critical to be able to eliminate large quantities of unimportant data from cloud backup or synchronization functions. Finding stuff stored in the cloud can also be a slow and difficult process. If you have a few million pieces of data stored in one of those cloud buckets, it might take quite awhile to find it if you have forgotten its unique key name. Likewise, finding all pieces of data that meet some kind of specific criteria can also take a very long time.<br />
<br />
The most troubling part of cloud storage seems to be a lack of control over your own data. If your only copy of a valuable piece of data is out in the cloud, you are completely dependent upon the cloud provider to make sure you have unimpeded access; that the data is free from corruption; and that it is secure from unauthorized access. Recently, even Steve Wozniak expressed great concern about the recent trend for individuals and businesses to store large amounts of their important data on a system controlled by someone else.<br />
<br />
Personally, I think all the current cloud offerings represent a half-way solution. Universal access, flexible storage capacity, and automatic redundancy are great features. But I think the real, full solution is to have just a copy of important data (and only important data) stored in the cloud that is easily synchronized with other copies of that same data on local systems where the user has complete control.<br />
<br />
This is one of the compelling features of the Didget Management System.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-26188828629038170072012-12-06T22:13:00.000-08:002012-12-06T22:13:11.671-08:00Extreme Performance DemonstrationI created a third demonstration video of the Didget Management System in Action. This one shows how fast we can find things even when the number of Didgets gets very high.<br />
<br />
See it at www.screenr.com/5Zx7<br />
<br />
In this video I create nearly 10 million Didgets in a Chamber and automatically attach a set of tags to each one. Each tag has a value associated with it. I then performed queries against that Chamber for all Didgets of a certain type. I then performed an additional query for the Didgets that have a certain tag attached to it regardless of its value. Finally, I performed a couple of queries where we are looking for Didgets with that tag attached but also have a value that starts with an input string.<br />
<br />
<br />
Again, I was running this demonstration on the same low-end PC as in the previous two videos. If I were to attempt to find all the video files on my NTFS file system and if there were 10 million files on it, that query would take nearly an hour using a regular program calling the file API. With the Didget Management System, the slowest query took about 3 seconds.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-70637029991304281102012-12-03T22:35:00.000-08:002012-12-03T22:35:27.516-08:00Demo Video Part 2I added another short video of a demonstration of tags used in the Didget Management System.<br />
<br />
View at: www.screenr.com/fXd7<br />
<br />
This video emphasizes the creation of tags and attaching them to a set of Didgets so that we can query based on them or create lists (e.g. Albums) from the query results.<br />
<br />
Each Didget can have up to 255 different tags attached to it. There can be tens of thousand of different tags to choose from and each tag value can be a string, a number, a boolean, or other value type. We have a set of predefined tags such as .event.Holiday, .date.Year, and .activity.Sport but the user is free to define any additional tags and immediately begin attaching them to any Didget.<br />
<br />
Attaching tags to Didgets and performing queries based on them, works exactly the same way for photos, documents, music, videos, or any other type of Didget.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-11220252908801383902012-12-02T17:53:00.000-08:002012-12-02T17:53:57.022-08:00Video Demonstration of our BrowserAfter much trial and error, I was finally able to capture a video of our Didget Browser in action. The video was limited to only 5 minutes, so I had to move fast and could only show a few features, but it gives a good demonstration of the speed at which we can query any given Chamber populated with lots of Didgets.<br />
<br />
You can watch the video at: www.screenr.com/XV17<br />
<br />
The Didget Browser was running on a Windows 7 PC and was created using the open-source, cross-platform GUI library called Qt. It can easily be ported to the Linux and Mac OSX operating systems. It sits on top of our Didget Management System using its API to perform much of its work.<br />
<br />
The PC I used was a 3 year old Gateway machine I bought at Costco for $500. It has an Intel Core 2 processor, 4 GB of DDR2 RAM, and a 750 GB HDD. This was not a high-end box even when I bought it, let alone now. If you are impressed with the speed at which we are able to perform queries and to display large lists of tag values, please keep in mind it is NOT due to speedy hardware.<br />
<br />
Whenever we perform a query, we look at the metadata records for each Didget within the Chamber. This would be analogous to checking each iNode in an Ext3 file system when querying files. The same is true whenever we refresh the contents of the Status Tab. We look at each and every Didget metadata record and tally up a total of all the different categories displayed.<br />
<br />
It is important to know that we do not have a separate database that we are querying like indexing services such as Apple's Spotlight or Microsoft's Windows Search do. Such databases can take hours to create and can easily become out of sync with the file metadata that they index.<br />
<br />
Some of the query operations that we perform could be accomplished on a regular file system using command line utilities. For example, I can get a list of all .JPG files on my file system by entering the command:<br />
<br />
C:>Dir *.jpg /s<br />
<br />
The main difference is that on that same machine with the 500,000 files, this command takes nearly 3 minutes to complete. If my NTFS volume had 3 million files on it, the same command would take approximately 20 minutes to complete. Using the Didget Browser, we are able to accomplish the same task in under ONE second. In fact, we can get a list of all the JPG Photo Didgets in under one second even if there are 25 million of them.<br />
<br />
The differences in speed between our system and conventional file systems is even more pronounced when we must do even more complicated queries. Try to find all .JPG photos in a file system that have two extended attributes attached with the key:values of Place=Hawaii and Event=Vacation. We can find all the Didgets with those two tags attached in just a couple of seconds. File systems (the ones that even support extended attributes) will require a very long time.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-18653467913978011802012-11-18T22:07:00.001-08:002012-11-18T22:07:48.938-08:00The Big PictureSo far, I have posted several blogs that explain certain pieces of the Didget Management System and how each feature adds specific benefits over conventional file system or database architectures. I thought I would devote this 20th blog to explaining the entire system once all the pieces are put together to give the reader an idea of how it will look once completed.<br />
<br />
The Didget Realm represents a world-wide collection of individual Didget containers called Chambers. Each Chamber is managed by its own instance of the Didget Manager and together they represent a single node in this global data storage network. Each node can communicate with every other node to exchange Didget information. With the use of Policy Didgets, this information can be exchanged automatically without direct commands from a running application. Nodes can be grouped into domains or federations so that they can exchange even more information between them than can two nodes that are not in the same domain.<br />
<br />
Each Chamber can store several billion individual Didgets. The system is designed to effectively manage huge numbers of Didgets without sacrificing speed. Simple queries to a Chamber with over 10 million Didgets in it are designed to execute in under one second. Even the most complex queries are designed to execute in under ten seconds when the Didget Manager is running on a single desktop system. For Chambers with hundreds of millions or with billions of Didgets, the Chamber can be split into many individual pieces and managed by lots of separate systems in a distributed environment to perform lightning fast queries using map-reduce algorithms.<br />
<br />
A Chamber that has been converted to a distributed system looks exactly the same to an application or to another node in the global network, as does a Chamber that has not been split into several pieces and distributed. In other words, applications do not need to know if they are communicating with a single piece Chamber running on a laptop computer or if they are communicating with a Chamber that has been split into 100 different pieces and managed by 1000 different servers. The only difference will be the speed at which a query or other command may execute when the number of Didgets in the Chamber is extraordinarily large.<br />
<br />
Using Policy Didgets and Security Didgets, operations against all the Didgets with a Chamber can be tightly controlled. Sensitive information can be protected and a whole host of data management functions can happen automatically when either a certain amount of time has expired or when certain events happen.<br />
<br />
Individual Didgets can be classified, tagged, and grouped together in ways files or database rows never could. Copying or moving a Didget from one Chamber to another does not cause it to lose any of its metadata or to become any less secure than the original. Special attributes can be assigned to each Didget that enable it to be managed by the Didget Manager in very specific ways. Several of these attributes represent unique features that I have not seen on any other system.<br />
<br />
Applications can query for a set of Didgets based on any of these metadata fields and perform operations against the whole set (if permissions allow).<br />
<br />
Didgets can represent either structured and unstructured data. All the management functions work the same, regardless of the data type. Didgets can be accessed using file-like APIs or database-like queries.<br />
<br />
Inventory, search, backup, recovery, synchronization, organization, version control, and licensing are just a few of the management functions that are provided by the system. In every case, the functions will perform faster and with simpler mechanisms than with conventional systems.<br />
<br />
In summary, I think this system offers a far superior data management environment than do conventional file systems or NoSQL database environments. Once data is created as Didgets (or converted from legacy systems) it will be far easier to manage and provide significantly greater value to the end user than it would be as files or as database rows.<br />
<br />
The Didget Management system will revolutionize the way the whole world looks at data going forward. (You heard it here first!)DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-87562214110423613412012-11-17T16:17:00.000-08:002012-11-17T16:17:51.740-08:00Structured vs Unstructured DataPersistent data seems to fall into one of two categories. 1) Structured Data (like cells in a spreadsheet or a row/column intersection in a database table) that must adhere to some fairly strict rules regarding type, size, or valid ranges; or 2) Unstructured Data like photos, documents, or software where the data can be much more free-form.<br />
<br />
Databases are well equipped to handle structured data but generally do a poor job of managing large amounts of unstructured data (or blobs in database speak). File systems, on the other hand, were designed for large numbers of unstructured data wrapped in a metadata package called a file, but generally do a poor job of trying to handle structured data (although technically, databases themselves are almost always stored as a set of files in a file system volume).<br />
<br />
When I first designed the Didget Management System, I concentrated solely on improving the handling of unstructured data. It was designed to be a replacement for file systems. Databases could be stored in a set of Didgets just as easily as in a set of files, but I planned to largely ignore structured data the way file systems do.<br />
<br />
But with the introduction of the Didget Tags, I had to figure out how to handle large amounts of structured data as part of Didget metadata since each tag is defined with a schema and each tag value must adhere to this definition. I had to be able to assign each Didget a bunch of tags and then make it so I could query against the whole set of Didgets based on specific tag values. For example, "Find all Photo Didgets where .event.Vacation = Hawaii" would need to return a list of all photos that had been assigned this tag value. This feature is strikingly similar to executing an SQL query against a relational database.<br />
<br />
I still didn't make the connection of how this feature could add a whole new dimension to the Didget Management System until one of the programmers helping me with this project pointed out how similar a Didget is to a row in a NoSQL database table. In fact, the entire Didget Chamber could be thought of as a huge table of columns and rows where every column is a tag and every row is a Didget. In our system there can be tens of thousands of different tags defined (columns) and billions of Didgets (rows). Each Didget can have up to 255 different tag/value assignments.<br />
<br />
Since each Didget can also have a data stream assigned to it, this data stream could be thought of as just another column in the table (although it is a very special column in that its contents are not defined in a schema and its value can be unstructured and up to 16 TB in length). The Didget metadata record, likewise could be thought of as special columns in this huge table. We can query based on Didget type, stream length, events stamps, attributes, and the like.<br />
<br />
What this means is that every Didget could be treated kind of like a file or kind of like a row in a database. Applications can perform operations against a set of Didgets using an API that is very file oriented or by using one more familiar to database operators.<br />
<br />
Since the Didget Management System was designed to scale out by breaking a single chamber into multiple pieces and distributing them across a set of servers (local or remote), it could compete directly against large distributed NoSQL systems like CouchDB, MongoDB, Cassandra, or BigTable just as easily as it could against Hadoop in the distributed file system arena.<br />
<br />
Companies or individuals that work with large amounts of "Big Data" would no longer need two separate systems, one to handle their unstructured data and another to handle their structured data. With the Didget Management System, all their data (structured and unstructured) could be handled in a single distributed system and managed with the same set of tools and policies.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-65873278552961647232012-11-12T21:21:00.000-08:002014-12-29T12:32:25.217-08:00PoliciesIn the conventional file system world, file systems treat all files like black boxes and almost never perform any direct manipulation of files. If any file is created, modified, moved, or deleted it is done as a direct command from either the operating system or an application. All file management functions such as organization, backup, synchronization, or cleanup are performed by something other than the file system itself.<br />
<br />
In the Didget system, many of these management tasks can also be performed by the Didget Manager independent of another running program. Programs can schedule specific tasks to execute at specific times or when certain events occur with the use of Policy Didgets. These Didgets are somewhat similar to database triggers. They can cause the Didget Manager to manipulate data even while the application that scheduled the task is no longer available to the system.<br />
<br />
Just like all the other Didgets in the system, Policy Didgets can be created, protected, queried, synchronized, and deleted. They can have tags attached to them to help in finding or organizing different policies. They can have a data stream that contains specific instructions or program extensions or that logs results as the policy executes. Just about any conceivable data management function could be implemented or at least facilitated using these special Didgets.<br />
<br />
For example, an application could create a policy that automatically adds any new photos with a .event.Vacation tag to a List Didget called "Vacation Photo Album". At the same time it could search for another list Didget with a name matching the tag value (e.g. if .event.Vacation = "Hawaii" then it would look for a list where .didget.Name = "Hawaii Photo Album") and either add it to the existing list or create a new list if it did not exist and then add it.<br />
<br />
In another example, an application could create a policy that would automatically backup all new or modified Private Didgets to a chamber located in the cloud every Monday morning. This would create an incremental backup of everything the user created on that system during the week.<br />
<br />
In yet another example, an application could create a policy that automatically synchronized all new photos and documents with a chamber located on a phone every time the phone was connected to the desktop.<br />
<br />
Policy Didgets could be built and maintained to enforce company policies governing data protection, retention, and validation. Entire workflow systems could be driven by carefully crafted Policy Didgets by having data created, tagged, and organized as each step in the workflow progresses.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0tag:blogger.com,1999:blog-266319390042654325.post-76300345151326051712012-11-03T16:44:00.000-07:002012-11-03T16:44:31.034-07:00How Does it ScaleThe Didget Manager is designed to perform a variety of data management functions against a set of storage containers that may be attached to a single system or spread across several separate systems.<br />
<br />
These functions include:<br />
<br />
1) Backup<br />
2) Synchronization<br />
3) Replication<br />
4) Inventory<br />
5) Search<br />
6) Classification<br />
7) Grouping<br />
8) Activation (licensing)<br />
9) Protection<br />
10) Archiving<br />
11) Configuration<br />
12) Versioning<br />
13) Ordering<br />
14) Data Retention <br />
<br />
In order to properly perform each of these functions, a system is needed that can operate against all kinds of data sets consisting of structured and/or unstructured data, from very small sets to extremely large sets (i.e. "Big Data"). A legitimate question for any system is "How does it scale?"<br />
<br />
When it comes to the term "Scale", I define it in three dimensions -"Scale In", "Scale Out", and "Scale Up".<br />
<br />
"Scale In" refers to the ability of the system's algorithms to properly handle large amounts of data within a single storage container given a fixed amount of hardware resources on a single system. File Systems have a limited ability to scale in this manner. For example: the NTFS File System was designed to hold just over 4 billion files in a single volume. However; each file requires a File Record Segment (FRS) that is 1024 bytes long. This means that if you have 1 billion files in a volume, you must read approximately 1 TB of data from that volume just to access all the file metadata. If you want to keep all that metadata in system memory in order to perform multiple searches through it at a faster rate, you would need to have a TB of RAM. Regular file searches through that metadata can also be painfully slow even if all the metadata is in RAM due to the antiquated algorithms of file system design.<br />
<br />
The Didget system was designed to handle billions of Didgets and perform fast searches through the metadata even when limited RAM is available. If the same 1 billion files had been converted from files to Didgets, the system would only need to read 64 GB of metadata off the disk and have 64 GB of RAM to keep it in system memory. This is only 1/16 of the requirements needed for NTFS. Searches through that metadata would be hundreds of times faster than with file systems.<br />
<br />
"Scale Out" refers to the ability of the system to improve performance by adding additional resources and performing operations in parallel. This can be accomplished in two ways. Multiple computing systems can operate against a single container, or a single container can be split into multiple pieces and distributed out to those systems. Hadoop is a popular open-source distributed file system that spreads file data across many separate systems in order to service data requests in parallel. It has a serious limitation in that file metadata is stored on a single "NameNode". This has both availability and performance ramifications. It was designed more for smaller sets of extremely large files rather than for extremely large sets of smaller files. Most of the other traditional file systems were never designed to either operate in parallel or to be split up.<br />
<br />
The Didget system was designed for both kinds of parallel processing. Multiple systems can operate largely in parallel against a single container since all the metadata structures were designed for locking at the block level. When a system needs to update a piece of metadata, it does not need to establish a "global lock" on the container. It only needs to lock a small portion of the metadata where the update is applicable. This means that thousands of systems can be creating, deleting, and updating Didgets within a single container at the same time. Each container was also designed to be split up and distributed across multiple systems. Both the data streams and the Didget metadata can be split up and distributed. Map-Reduce algorithms are used to query against many of these container pieces in parallel.<br />
<br />
"Scale Up" refers to the ability of a single management system to manage data from small sets on simple devices to extremely large data sets on very complex hardware systems. Most data management systems today don't scale up very well. For example, backup programs that work well on pushing data from a single PC to the cloud do not generally work well as enterprise solutions. Users typically need separate data management systems for their home environment and for their work environment. As a business grows from a small business to a medium sized business to a large enterprise, it often must abandon old systems and adopt new systems as its data set grows.<br />
<br />
The Didget system was designed to work essentially the same whether it is managing a set of a few hundred Didgets on a mobile phone or it is managing billions of Didgets spread across thousands of different servers. Additional modules may be required and enhanced policies would need to be implemented for the larger environment to function effectively, but the two systems would function nearly identically from the user's (or application's) point of view. Applications that use the Didget system to store their data would not need to know which of the two environments was in play.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com1tag:blogger.com,1999:blog-266319390042654325.post-41306998924162130022012-09-01T21:40:00.000-07:002012-09-01T21:40:17.091-07:00Configuration DidgetsRemember the good ol' days when configuration on a Windows PC (or DOS in those days) meant that you had a simple text file in the same directory as your application that controlled the behavior of that application. The file was given an extension of .INI and was easy to read and to edit. When you uninstalled your application (using del *.*), the configuration file was cleaned up along with all your other application files.<br />
<br />
Unfortunately, this approach also had a number of drawbacks. If you had 1000 applications, you might also have 1000 little configuration files spread all over your folder hierarchy. They were difficult to find and edit when you wanted to manage a whole bunch of applications at at once.<br />
<br />
Microsoft's answer to this problem was to create a central database called the Registry where all the configuration settings for the system and user applications could be stored. Unfortunately, this approach also had a number of drawbacks. If this single Registry was deleted or corrupted then everything was a mess; if an application was uninstalled, it didn't always clean up after itself in the Registry; it wasn't always obvious where all the keys for a particular application were stored within the new hierarchy of this database; and there was no way for an application to try and protect unauthorized changes to its configuration settings.<br />
<br />
While several steps have been taken to keep the Registry from being corrupted and to be able to recover to a consistent state in the event something goes wrong, the Registry continues to be a bit of a headache when it comes to managing software. Special programs written to help clean up problems with the Registry have become more popular in recent times.<br />
<br />
With Didgets, we take a different approach. Just like the old .INI files, each application can have its own set of configuration settings stored in one or more Configuration Didgets. Just like all the other Didgets in our system, you can get a list of all these Configuration Didgets in just a second or two even if there are millions of them.<br />
<br />
Each Configuration Didget has some special fields designating what type of software they are used to configure so you can narrow your query to only look for the ones that configure word processors, for example.<br />
<br />
Just like the .INI files, if one of these Didgets becomes corrupt, none of the others are affected. Each Configuration Didget can be protected with the Read-Only attribute or with security keys.<br />
<br />
Just like the editing tool for the Registry (regedit), a configuration viewer/editor can be built to give the user a unified view to a whole host of Configuration Didgets. It can do this by consolidating all the data from each individual Didget into a single virtual view. Any changes made in the editor would not be made to a central database, but rather made directly into the Configuration Didget where the change was made.<br />
<br />
This is like having a word processor display a document where every page in the document was stored in a separate file. Any changes to one of the pages would be made to just the file that held that page. To the user, it looks like a single file but its really just a unified view of a whole bunch of separate files.DidgetMasterhttp://www.blogger.com/profile/17264703189806219219noreply@blogger.com0