Autonomous Solution for Storing & Querying Complex Data from Variable Schemas


At Illumina – the world leader in DNA sequencing systems – the data systems to support the manufacturing of highly complex DNA sequencers are as varied as the products we make.  In particular, massive amounts of quality control (QC) data – with almost unlimited schemas - are produced by a vast range of measuring instruments, vendors, and subcontractors...and those schemas are constantly changing.  Storing that data using traditional, tightly-coupled relational tables and fragile ETL would have been chaos and unsustainable, and prior design attempts produced exactly that - data chaos. 

The challenge was to model and design a system that could store the data with full integrity, in a scalable and easy-to-query and highly performant fashion, and do it autonomously – no code or structural changes would be needed to support new schemas.  As you’ll see in the presentation, it was the “easy-to-query” part that was the hardest to implement, and also the reason for this presentation.

A novel, Postgres-based object-relational model using the JSONB data type was created and prototyped, and after a successful trial, became the de facto destination for QC data. It has been running 24x7 for 2 years.

But what about querying that JSON data?  It’s not easy, with  special operators and syntax or using R and taking a huge performance hit.  Instead, using Postgres’s plpgsql and stored functions,  I built code that auto-generates a view layer that includes each schema stored in the json and also generates new views on the fly as new schemas appear.  It exposes that json data as “tables” (Postgres views), which, in our situation, connects to a data virtualization layer for consumption by analysts and data scientists.  This solution will accept “any” data and store it in 3 proper relational tables…with a twist.  I will detail the insanely simple, scalable, and highly performant solution, end to end including the source code, and document why it may be an appropriate solution for teams looking to design an autonomous solution for variable data.

Room 105
Thursday, March 5, 2020 - 10:30 to 11:15