Write-Audit-Publish for Knowledge Lakes in Pure Python (no JVM)

An open supply implementation of WAP utilizing Apache Iceberg, Lambdas, and Venture Nessie all working totally Python

Look Ma: no JVM! Photograph by Zac Ong on Unsplash

On this weblog put up we offer a no-nonsense, reference implementation for Write-Audit-Publish (WAP) patterns on an information lake, utilizing Apache Iceberg as an open desk format, and Venture Nessie as an information catalog supporting git-like semantics.

We selected Nessie as a result of its branching capabilities present a superb abstraction to implement a WAP design. Most significantly, we selected to construct on PyIceberg to get rid of the necessity for the JVM when it comes to developer expertise. The truth is, to run all the challenge, together with the built-in purposes we’ll solely want Python and AWS.

Whereas Nessie is technically inbuilt Java, the information catalog is run as a container by AWS Lightsail on this challenge, we’re going to work together with it solely by means of its endpoint. Consequently, we are able to categorical all the WAP logic, together with the queries downstream, in Python solely!

As a result of PyIceberg is pretty new, a bunch of issues are literally not supported out of the field. Particularly, writing remains to be in early days, and branching Iceberg tables remains to be not supported. So what you’ll discover right here is the results of some authentic work we did ourselves to make branching Iceberg tables in Nessie potential immediately from Python.

So all this occurred, roughly.

Again in 2017, Michelle Winters from Netflix talked a few design sample referred to as Write-Audit-Publish (WAP) in information. Primarily, WAP is a purposeful design geared toward making information high quality checks simple to implement earlier than the information turn into obtainable to downstream shoppers.

As an illustration, an atypical use case is information high quality at ingestion. The stream will seem like making a staging surroundings and run high quality checks on freshly ingested information, earlier than making that information obtainable to any downstream software.

Because the identify betrays, there are basically three phases:

Write. Put the information in a location that isn’t accessible to shoppers downstream (e.g. a staging surroundings or a department).
Audit. Remodel and check the information to ensure it meets the specs (e.g. test whether or not the schema abruptly modified, or whether or not there are surprising values, reminiscent of NULLs).
Publish. Put the information within the place the place shoppers can learn it from (e.g. the manufacturing information lake).

This is just one instance of the potential purposes of WAP patterns. It’s simple to see how it may be utilized at completely different phases of the information life-cycle, from ETL and information ingestion, to complicated information pipelines supporting analytics and ML purposes.

Regardless of being so helpful, WAP remains to be not very widespread, and solely not too long ago corporations have began eager about it extra systematically. The rise of open desk codecs and initiatives like Nessie and LakeFS is accelerating the method, but it surely nonetheless a bit avant garde.

In any case, it’s a superb mind-set about information and this can be very helpful in taming among the most widespread issues holding engineers up at night time. So let’s see how we are able to implement it.

We’re not going to have a theoretical dialogue about WAP nor will we offer an exhaustive survey of the alternative ways to implement it (Alex Merced from Dremio and Einat Orr from LakeFs are already doing an outstanding job at that). As a substitute, we’ll present a reference implementation for WAP on an information lake.

👉 So buckle up, clone the Repo, and provides it a spin!

📌 For extra particulars, please consult with the README of the challenge.

The thought right here is to simulate an ingestion workflow and implement a WAP sample by branching the information lake and working an information high quality check earlier than deciding whether or not to place the information into the ultimate desk into the information lake.

We use Nessie branching capabilities to get our sandboxed surroundings the place information can’t be learn by downstream shoppers and AWS Lambda to run the WAP logic.

Primarily, every time a brand new parquet file is uploaded, a Lambda will go up, create a department within the information catalog and append the information into an Iceberg desk. Then, a easy a easy information high quality check is carried out with PyIceberg to test whether or not a sure column within the desk comprises some NULL values.

If the reply is sure, the information high quality check fails. The brand new department is not going to be merged into the principle department of the information catalog, making the information unattainable to be learn in the principle department of knowledge lake. As a substitute, an alert message goes to be despatched to Slack.

If the reply isn’t any, and the information doesn’t comprise any NULLs, the information high quality check is handed. The brand new department will thus be merged into the primary department of the information catalog and the information might be appended within the Iceberg desk within the information lake for different processes to learn.

Our WAP workflow: picture from the authors

All information is totally artificial and is generated robotically by merely working the challenge. After all, we offer the potential of selecting whether or not to generate information that complies with the information high quality specs or to generate information that embrace some NULL values.

To implement the entire end-to-end stream, we’re going to use the next elements:

Venture structure: picture from the authors

This challenge is fairly self-contained and comes with scripts to arrange all the infrastructure, so it requires solely introductory-level familiarity with AWS and Python.

It’s additionally not supposed to be a production-ready answer, however slightly a reference implementation, a place to begin for extra complicated situations: the code is verbose and closely commented, making it simple to change and lengthen the fundamental ideas to raised go well with anybody’s use circumstances.

To visualise the outcomes of the information high quality check, we offer a quite simple Streamlit app that can be utilized to see what occurs when some new information is uploaded to first location on S3 — the one that isn’t obtainable to downstream shoppers.

We are able to use the app to test what number of rows are within the desk throughout the completely different branches, and for the branches aside from primary, it’s simple to see in what column the information high quality check failed and in what number of rows.

Knowledge high quality app — that is what you see while you look at a sure add department (i.e. *emereal-keen-shame*) the place a desk of 3000 row was appended and didn’t move the information high quality test as a result of one worth in *my_col_1 is a NULL*. Picture from the authors.

As soon as we’ve got a WAP stream based mostly on Iceberg, we are able to leverage it to implement a composable design for our downstream shoppers. In our repo we offer directions for a Snowflake integration as a solution to discover this architectural chance.

Step one in the direction of the Lakehouse: picture from the authors

This is among the primary tenet of the Lakehouse structure, conceived to be extra versatile than trendy information warehouses and extra usable than conventional information lakes.

On the one hand, the Lakehouse hinges on leveraging object retailer to get rid of information redundancy and on the identical time decrease storage value. On the opposite, it’s supposed to supply extra flexibility in selecting completely different compute engines for various functions.

All this sounds very attention-grabbing in principle, but it surely additionally sounds very sophisticated to engineer at scale. Even a easy integration between Snowflake and an S3 bucket as an exterior quantity is frankly fairly tedious.

And in reality, we can’t stress this sufficient, shifting to a full Lakehouse structure is numerous work. Like so much!

Having mentioned that, even a journey of a thousand miles begins with a single step, so why don’t we begin by reaching out the bottom hanging fruits with easy however very tangible sensible penalties?

The instance within the repo showcases considered one of these easy use case: WAP and information high quality checks. The WAP sample here’s a probability to maneuver the computation required for information high quality checks (and presumably for some ingestion ETL) exterior the information warehouse, whereas nonetheless sustaining the potential of benefiting from Snowflake for extra excessive worth analyitcs workloads on licensed artifacts. We hope that this put up will help builders to construct their very own proof of ideas and use the

The reference implementation right here proposed has a number of benefits:

Tables are higher than information

Knowledge lakes are traditionally exhausting to develop towards, for the reason that information abstractions are very completely different from these usually adopted in good outdated databases. Large Knowledge frameworks like Spark first supplied the capabilities to course of giant quantities of uncooked information saved as information in numerous codecs (e.g. parquet, csv, and many others), however folks typically don’t suppose when it comes to information: they suppose it phrases of tables.

We use an open desk format because of this. Iceberg turns the principle information lake abstraction into tables slightly than information which makes issues significantly extra intuitive. We are able to now use SQL question engines natively to discover the information and we are able to depend on Iceberg to handle offering appropriate schema evolution.

Interoperability is sweet for ya

Iceberg additionally permits for better interoperability from an architectural viewpoint. One of many primary advantages of utilizing open desk codecs is that information will be saved in object retailer whereas high-performance SQL engines (Spark, Trino, Dremio) and Warehouses (Snowflake, Redshift) can be utilized to question it. The truth that Iceberg is supported by nearly all of computational engines on the market has profound penalties for the way in which we are able to architect our information platform.

As described above, our recommended integration with Snowflake is supposed to indicate that one can intentionally transfer the computation wanted for the ingestion ETL and the information high quality checks exterior of the Warehouse, and maintain the the latter for giant scale analytics jobs and final mile querying that require excessive efficiency. At scale, this concept can translate into considerably decrease prices.

Branches are helpful abstractions

WAP sample requires a solution to write information in a location the place shoppers can’t by chance learn it. Branching semantics naturally offers a solution to implement this, which is why we use Nessie to leverage branching semantics on the information catalog degree. Nessie builds on Iceberg and on its time journey and desk branching functionalities. Quite a lot of the work carried out in our repo is to make Nessie work immediately with Python. The result’s that one can work together with the Nessie catalog and write Iceberg tables in numerous branches of the information catalog and not using a JVM based mostly course of to write down.

Less complicated developer expertise

Lastly, making the end-to-end expertise utterly Python-based simplifies remarkably the arrange fo the system and the interplay with it. Some other system we’re conscious of would require a JVM or an extra hosted service to write down again into Iceberg tables into completely different branches, whereas on this implementation all the WAP logic can run inside one single lambda perform.

There may be nothing inherently improper with the JVM. It’s a basic element of many Large Knowledge frameworks, offering a standard API to work with platform-specific assets, whereas making certain safety and correctness. Nevertheless, the JVM takes a toll from a developer expertise perspective. Anyone who labored with Spark is aware of that JVM-based methods are usually finicky and fail with mysterious errors. For many individuals who work in information and think about Python as their lingua franca the benefit of the JVM is paid within the coin of usability.

We hope extra individuals are enthusiastic about composable designs like we’re, we hope open requirements like Iceberg and Arrow will turn into the norm, however most of all we hope that is helpful.

So it goes.