Smashing Bias with Dynamic Data Versioning
Today, AI algorithms are everywhere. They are used in finance, biomedical, automotive, and other industries. They help make decisions about our health, whether you get a loan or not, and whether your resume gets through the HR to the hiring manager. AIs can predict your chances of surviving cancer. They can decide whether you should go to jail or get bail. They can predict whether you’ll be a good employee or not. But these predictions are far from flawless. How do we know if an employee is good or not? Maybe what we think makes a good employee, like someone who always follows the rules perfectly, makes sense, but in reality, it is the exact opposite of the traits we saw in our top performers. How can we protect ourselves from AI’s mistakes? More and more, the answer to fixing challenges with machine learning models is data versioning. We can use data versioning to ensure data integrity and a biased-free decision-making process. When you train a machine learning model, it’s only as good as the data you put into it. What if you figure out later that data was biased? How can you trace everything that happened from training to production to find answers and improve your results? Data Scientists need a reliable way to go back in time and analyze their steps.
Versioning large data sets require a special tool that adheres to the principles of the modern infrastructure. While your code can be stored in a Git-based system, you need a reliable tool to track changes and reproduce the results of your computations. Preserving the connections between versions of code, data, and model poses an even more complicated task. In this talk, we will discuss problems that occur when data changes are not tracked and what drastic consequences it might have. We will also discuss possible solutions that provide version control for training data sets that run in modern infrastructures such as Kubernetes and Kubeflow.