Saturday, November 3, 2012

How Does it Scale

The Didget Manager is designed to perform a variety of data management functions against a set of storage containers that may be attached to a single system or spread across several separate systems.

These functions include:

1) Backup
2) Synchronization
3) Replication
4) Inventory
5) Search
6) Classification
7) Grouping
8) Activation (licensing)
9) Protection
10) Archiving
11) Configuration
12) Versioning
13) Ordering
14) Data Retention

In order to properly perform each of these functions, a system is needed that can operate against all kinds of data sets consisting of structured and/or unstructured data, from very small sets to extremely large sets (i.e. "Big Data"). A legitimate question for any system is "How does it scale?"

When it comes to the term "Scale", I define it in three dimensions -"Scale In", "Scale Out", and "Scale Up".

"Scale In" refers to the ability of the system's algorithms to properly handle large amounts of data within a single storage container given a fixed amount of hardware resources on a single system. File Systems have a limited ability to scale in this manner. For example: the NTFS File System was designed to hold just over 4 billion files in a single volume. However; each file requires a File Record Segment (FRS) that is 1024 bytes long. This means that if you have 1 billion files in a volume, you must read approximately 1 TB of data from that volume just to access all the file metadata. If you want to keep all that metadata in system memory in order to perform multiple searches through it at a faster rate, you would need to have a TB of RAM. Regular file searches through that metadata can also be painfully slow even if all the metadata is in RAM due to the antiquated algorithms of file system design.

The Didget system was designed to handle billions of Didgets and perform fast searches through the metadata even when limited RAM is available. If the same 1 billion files had been converted from files to Didgets, the system would only need to read 64 GB of metadata off the disk and have 64 GB of RAM to keep it in system memory. This is only 1/16 of the requirements needed for NTFS. Searches through that metadata would be hundreds of times faster than with file systems.

"Scale Out" refers to the ability of the system to improve performance by adding additional resources and performing operations in parallel. This can be accomplished in two ways. Multiple computing systems can operate against a single container, or a single container can be split into multiple pieces and distributed out to those systems. Hadoop is a popular open-source distributed file system that spreads file data across many separate systems in order to service data requests in parallel. It has a serious limitation in that file metadata is stored on a single "NameNode". This has both availability and performance ramifications. It was designed more for smaller sets of extremely large files rather than for extremely large sets of smaller files. Most of the other traditional file systems were never designed to either operate in parallel or to be split up.

The Didget system was designed for both kinds of parallel processing. Multiple systems can operate largely in parallel against a single container since all the metadata structures were designed for locking at the block level. When a system needs to update a piece of metadata, it does not need to establish a "global lock" on the container. It only needs to lock a small portion of the metadata where the update is applicable. This means that thousands of systems can be creating, deleting, and updating Didgets within a single container at the same time. Each container was also designed to be split up and distributed across multiple systems. Both the data streams and the Didget metadata can be split up and distributed. Map-Reduce algorithms are used to query against many of these container pieces in parallel.

"Scale Up" refers to the ability of a single management system to manage data from small sets on simple devices to extremely large data sets on very complex hardware systems. Most data management systems today don't scale up very well. For example, backup programs that work well on pushing data from a single PC to the cloud do not generally work well as enterprise solutions. Users typically need separate data management systems for their home environment and for their work environment. As a business grows from a small business to a medium sized business to a large enterprise, it often must abandon old systems and adopt new systems as its data set grows.

The Didget system was designed to work essentially the same whether it is managing a set of a few hundred Didgets on a mobile phone or it is managing billions of Didgets spread across thousands of different servers. Additional modules may be required and enhanced policies would need to be implemented for the larger environment to function effectively, but the two systems would function nearly identically from the user's (or application's) point of view. Applications that use the Didget system to store their data would not need to know which of the two environments was in play.

1 comment:

  1. Andy: very thoughtful summary. I've read a few things on your blog in the past, but this article in particular tells me that you're getting your secret sauce right, if you're an order of magnitude smaller in terms of resource requirements, while still being several orders of magnitude faster.

    Hopefully, the industry will recognize the value of Didgets and make you fabulously rich and famous.