Version Control is a concept that revolutionized software development and is making a huge impact on other unrelated industries in the digital world. For a primer around version control, please refer to the excellent Git Pro Book here (About Version Control).
The basic idea behind nano version control is made up of two very separate but important ideas that when put together give us something interesting and new.
- Version Control In Memory – This is the same idea of version control as you know it in [git, databases, documents etc…] except that it’s purely in memory. A data structure.
- Nano Scale Repositories – At first this might feel unnatural, but this hinges on the idea that we create an entire version control repository for one concept that you are modelling.
Both of these ideas on their own might seem underwhelming and an almost arbitrary step forward on the version control that you know already. However, what I hope to show you over time is that BOTH of these ideas at the SAME TIME are a game changer.
1. Version Control in Memory
If you are a developer, think of a git repository. A no-sql database person… think of revisions of a document. If you are neither, just think of multiple revisions of a Word document each with a timestamp (or more realistically a title like:
“Final Draft of Final Document V2 – Copy.docx”.
A typical version control naming convention for documents.
For way too long, the whole world has relied on our storage layer to provide us with version control. This is an obvious choice. Things go through many revisions and changes that it makes sense that we would need to offload those things to storage they probably wouldn’t fit in memory. Hence this being a function of the database. If I want to get the history for something, I have to go to a central location that ultimately becomes a system-throughput bottleneck. In the modern world of cloud and streaming… that seemed odd.
Additionally, as a software developer in university, I studied all sorts of interesting data structures.
2. Nano Scale Repositories
Most software developers are first introduced to the world of version control properly in the form of Git (Mercurial, Subversion, CVS etc…). This is a highly effective way to work on a large code base with a team of people. Lots of source code files spread across many directories which many people work on. If you structure your files neatly, most people will not step on each others toes. If your tools are good, they will help you when you do step on each others toes. On the odd occasion, you have to deal with horrible merge conflicts to figure out how to keep two people’s changes to the same thing.
Anyone who has tried to merge two Word documents, Excel spreadsheets or PowerPoint Presentations will know how difficult merge conflicts can get without the help of good tools.
The first time I noticed something strange here was in 2014 when I wanted to put 10 Million financial account holders data (probably about 50 Million rows of data) into one git repository. The truth is that having one CSV file for each record type and committing the whole lot actually worked (Kudos to Linus!). However people looking at the solution immediately felt that seemed odd.
What makes Git stand out from it’s predecessors what that Linus decided to store snapshots of the content as opposed to storing the delta’s (again, the simple elegance of Linus’ choice ended up making a huge difference later on). This meant that even a 1 byte change to the CSV file would mean that we need to commit a 1GB file. That seemed odd.
Secondly, the time it would take to retrieve the history of one client in that repository would get exponentially harder as the number of revisions increases. Even if the chance of a change to ONE account holder is small, the chance of a change to ANY account holder is almost certain. Distilling the changes to a SPECIFIC account holder would be computationally expensive and storage hungry to get that information. That seemed odd.
So what about storing a JSON file for each account holder. Now that seems like a good idea. The NO-SQL community has realised this to be useful over 10 to 20 years ago depending on what history you look at (NoSQL Wiki Article). Yes, create a file for each account holder so that we can have a complete picture of all data related to the concept (or model) in one document. Now we are starting to shrink down to the nano scale. Although this seems better because changes to one account holder are isolated from all other account holders, which seems very intuitive for financial data because my data has nothing to do with your data (what I call natural data islands). This leaves us with a git repository of 10 Million JSON files. Although git will manage, your Operating System file system won’t thank you for it. In fact, in order to handle a really huge repository like Windows, you might need to invent an entirely new file system for that (See the “Largest Git repo on the planet“). That seemed odd.
Lastly, when things like GDPR came around, it became necessary to be able to purge the history of one financial account holder upon request. If we used traditional version control methods like Git, No-SQL database or similar, it would be almost impossible to delete the history for one thin slice of the monolithic database because the hashes of one account holder are intrinsically tied to the hashes of all subsequent commits. This idea is put to great use with the block chain. Purging one thin slice of a concept in a monolithic repository is technically possible but practically and financially infeasible. That seemed odd.
3. What would happen if we had both?
Imagine for a moment if we had both ideas at the same time. Version Control in Memory AND Nano Repositories at the same time.
- Imagine if we shrunk the size of the model that we were version controlling down to such a small size where it no longer feels strange to have it all in memory.
- Imagine if we combined every existing data structure that we know (list, set, map, tree, graph) into a higher level data structure that gives us version control. The repo (or in-memory repository). Imagine a Map (key-value pairs) but with commit, branch, checkout semantics on top of it. Imagine if the keys created a logical tree (or in-memory filesystem) to organize the values. Think of a git repo, but in memory. Yes. No files. Just bytes stored at a logical path. And then snapshotted. What you are imagining is an API for this new data structure called a repo.
- Imagine if you could ask your storage layer for one document with ALL the history for one concept (like a financial account holder) and then you could pass that repository down as a value through a parallel pipeline (cloud stream maybe). If you like the changes at the end of the pipeline, you commit and merge into the master branch for that concept and store the result. If you don’t like the results then you could just discard the repo and try again (or merge the conflicts that you don’t like). This eliminates the hard dependency on your storage layer to understand history or more interestingly to create new history (which I call alternate futures).
- Imagine if each concept you were modelling could have it’s own independent repository for all it’s history that you could load and store separately from the history of all other concepts around it. This would mean that purging the entire history for one financial account holder suddenly becomes trivial under GDPR.
What you are imagining is Nano Version Control. When you combine both ideas at the same time, many very difficult problems suddenly become…
…Less Odd!
Nano Version Control makes traditionally hard problems, less odd to deal with.
There are more exciting applications of nano version control that I will be sharing with you in future posts. For now, I believe that these are the core ideas that got me going and I hope it sparks some interesting ideas in your own imaginations. If it does, please leave comments below and thank you for making it to the end of this post.
Until the next commit…