The Wind Blew My Tree Down

There was a small tree right outside my bedroom. It looked like a regular tree without much character. Then one day in the spring, I found it bloom into small yellow blossoms and it looked pretty. It…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

The Guide to Data Versioning

Already familiar with versioning code with git? A look at how it works to version data using the same abstractions.

Jack Nicholson and Diane Keaton discuss data versioning in Something’s Gotta Give

A version of something is defined as “a particular form in which some details are different from earlier or later forms.” In the digital world, versioning is a luxury we are fortunate to indulge — maintaining multiple versions of pretty much anything, from small objects to whole systems.

Many things we interact with are versioned automatically — word documents, codebases, the software that runs our precious phones. We rarely think twice about it.

The reason so many things are versioned is that it produces an invaluable record of incremental changes made and when they occurred. The ability to inspect this log is super helpful when understanding why the value of a certain datapoint is what it is.

What’s more: the ability to navigate between different versions is a superpower that acts as a virtual form of time travel. It’s no surprise that every software bug report starts with the same question: which version are you running?

Data is one area where versioning is still in its relative infancy. Why is this?

Well, data can be quite large — arguably larger than anything else in the digital world — and it is non-trivial to maintain multiple versions of something so heavy.

Non-trivial… but not impossible.

As we’ll explore, a versioning system for any size of data can exist with the right data structures and abstractions in place to efficiently map data objects to the versions they are a part of.

In business-critical data environments, versioning is increasingly seen as a vital component and not simply a nice-to-have premium. Before we dive into why this is, let’s take a step back and define what versioning is in the domain of data.

Fundamentally to version data means to create a unique reference for a collection of data. This reference can take the form of a query, an ID, or also commonly a datetime identifier.

This general definition qualifies a range of approaches as being “data versioning.” It includes something like saving an entire copy of the data under a new name or filepath every time you want to create a version of it.

It also includes more advanced versioning solutions that optimize storage usage between versions and expose special operations to manage them.

We’ll discuss how these work in more detail in the How Data Versioning Is Implemented Section below.

Data versioning is important because it allows for quicker development of data products while reducing errors.

Accidentally deleted a petabyte of production data? Restoring a previous version of a dataset is a lot easier than re-running a backfill job that can take a whole day to complete.

As these examples show, minimizing the cost of mistakes and exposing how data has changed over time are two ways to increase the development speed of a data team. Data versioning is the catalyst making this possible.

Versioning data has always had value. But it has particular significance in modern data environments that are being asked to do more than feed internal reporting. In an increasing number of organizations, data supports a myriad of mission-critical business processes, which brings increased responsibility and complexity.

The software engineering practices referred to? Things like unit tests, integration tests, CI/CD deployment, and of course, versioning. We see this theme continuing to play out and data versioning gaining adoption across the data ecosystem in the coming years.

Next, let’s look at a few ways we can implement data versioning, starting with basic forms and working our way up to more advanced solutions.

There are several ways to implement data versioning. We’ll cover three instructive approaches below:

Have a dataset you want to see how it’s changing over time? One option is to save a full copy of it under a new location each time you want a version of it. This works best for smaller datasets with something like a daily versioning frequency.

Versioning via saving a full copy of an example users dataset daily.

While this approach does create versioned data, it does so in the least space-efficient way. In the illustration above, any block that stays green is an example of a data object that hasn’t changed but is now duplicated across each version.

Furthermore, code or queries that interact with this versioned data will be error-prone with the correct date value having to be manually hardcoded in different places.

Although not the most elegant solution, it is an easy way to get started versioning data.

Using query filters to get the state of the Orders table on Oct. 17.

This approach works quite well for “time traveling” throughout a single collection of tabular data. However, it provides only one method of interacting with the versions — which is to add filters to queries on the metadata fields as shown above.

Yes, we can materialize the table as it was at various points in time, but this also what we are limited to.

If the first two approaches can be summarized as “Let me add a bit of versioning to the data I already have”, now it’s time to change mindsets. Instead, we should think of versioning as a first-class citizen of our data environment. An inherent property of any data we introduce into the system.

To make this possible — as made clear by the limitations of the above approaches — we need to solve a few challenges.

Now, these are not simple problems. And you probably won’t hack your way to a solution in an afternoon or even a weekend.

Borrowing useful abstractions from git, lakeFS lets you creates versions of data via commits, which in turn belong to branches. In essence, “creating a commit” is synonymous with “creating a version” (Challenge #2).

The above diagram shows the full relationships between the actual datafiles being versioned in lakeFS (bottom row), all the way up to the commit and branch abstractions exposed (top).

The important concept is that duplication of datafiles is minimized between commits (Challenge #1). This is depicted by the arrows going from one Metarange to multiple Ranges and the arrows from one Range to multiple objects.

An even more detailed look at these relationships can be seen below:

Detailed looks at the contents of a Range in lakeFS.

Let’s give a clearer picture of how data versioning is useful in different contexts by walking through some examples.

Say you work for a company that uses machine learning algorithms to enhance grainy video footage and identify objects. The users of the product use the enhanced footage for a variety of commercial uses.

A new and improved algorithm is developed by the data scientists that improve the classification accuracy of the outputs. However, it is not possible to roll out the new algorithm to all users at once. Instead, for a period of time, we need to let users switch between the classifications of both algorithms.

We could save the outputs of both versions of the algorithm to different paths in the object store. For one algorithm with just two versions, you could get away with this “hardcoding filepaths” type of approach. When the numbers of both algorithms and versions increase though, developers will start to make a greater number of mistakes — forgetting to update the path of a version or losing track of what parameters we’re used for a particular version.

Incorporating data versioning raises useful abstractions (e.g. commit messages, branch names) to manage the outputs in a more sane manner.

The foundational process of analytics is creating metrics that contain the logic to evaluate a business and user behavior. The logic for important concepts like “Active users”, “Sessions”, “Churn Rate” are often defined in SQL and calculated on a regular cadence.

A common problem in analytics is that a query that ran fine a day ago might start causing errors because of a change in the data. The most effective way to figure out what is causing the issue is to run the same query over the data as it was when the error first occurred.

When implementing data versioning, here are some tips we’ve found helpful.

It may be the case that there’s no need to retain versions older than 30 days or a year, for example. But no one wants to play housekeeper and be responsible for deleting older versions of data. And so they start to accumulate.

What makes data versions even more meaningful is tying versioning to the start and/or completion of data pipeline tasks. And ETL script finished running? Create a version. About to send an email to your “highly engaged” users? Create a version first.

This lets you include more meaningful metadata around your versions beyond the time they were created and lets you figure out what happened much faster if something goes wrong.

One of the challenges in data environments is to not step on the toes of your teammates. Often data assets are treated as a sort of shared folder that anyone can read, write to, or make modifications to.

One way to avoid these problems is to create personal versions of the data when developing. This prevents the chance that a change you make inadvertently affects another.

There’s a noticeable movement in the data space to adopt the mindset of treating “Data as a Product”. We believe this is a positive trend for data orgs and it requires a “leveling up” of the way data teams operate.

Development best practices like CI/CD, testing, and version control are features you need to be thinking about if you want your data team to confidently take on these types of projects.

The Wind Blew My Tree Down

The Guide to Data Versioning

Already familiar with versioning code with git? A look at how it works to version data using the same abstractions.

Add a comment

Related posts:

how to become a good girl in school

The Wealth of Heaven

Code Coverage Reports using Codacy and Codefresh