How we Migrated RCSB.org at the San Diego Supercomputer Center to Kubernetes
This presentation is divided into the following sections:
- Motivation (10 min)
- Challenges (15 min)
- Design decisions (15 min)
Conclusions and future plans (5 min)
Motivation covers introductionary information i.e. who we are, what is our mission - shortly speaking we are a team within San Diego Supercomputer Center responsible for rcsb.org website. Which is a gateway to the proteine strustures databak aka RCSB PDB. Using our website and also headless pubic APIs users across the globe can access and most importantly perform sophisticated searches on so called experimental proteine structures i.e. discovered by conventional way using e.g. X-Rays as well as newly added so-called computational model structures i.e. predicted by a dedicated AI algorithm e.g. AlphaFold by DeepMind, Google UK. In total we maintain ~1,2M structures (by 14 Nov 2022) and ~750K external requests daily. Our software is architectured with micro services in mind, actually refactored from a monolith, hence has somewhat legacy and historical burden.
Our main goals while switching to K8s are: retire OpenStack; avoid in-house made solutions; limit human factor; leverage Kubernetes features (development sandboxes, canary releases etc); end up with more integrated solution i.e. GitHub + Actions + K8s. Or in other words to get rid off technical debt we collected over the years working with OpenStack, including tons of in-house produced software components for performing daily routines. Switch to much more flexible and robust system for both DevOps and developers.
Next section is dedicated to challenges we faced during the platform design stage and actual migration. We chose a smooth transition when services are gradually migrated from OpenStack to K8s. This allows us greater control if something goes wrong, however such hybrid solution implies some difficulties over its course. Main thing for us was to preserve our internal analytics engine/pipelines based on custom Python scripts and/or Elastic indices ingests. Infrastructure related issues as we costal distributed i.e. one data center is here in San Diego, while another one is in Rutgers University, NJ. Also, the team had very limited, especially developers, previous experience with even Docker not saying about K8s. So, we chose a number of developer tools to use, specifically: telepresense, IntelliJ, helm and skaffold. Switching to a new CI/CD pipeline is also a challenge we had. Finally, we were thinking about the upcoming integration of many more computed model structures (as of time of writing Meta announced a release of 600M newly computed model structures and there are other databases with millions of structures).
In this section dedicated to challenges as closing words some trouble shooting tactics will be presented that we discovered along the way. For example, applied deployment does not start any pods - in fact replicaset is responsible for creating pods, not deployment, so one has to look for any issues wih replicaset, which may be counter-intuitive in the beginning.
To meet all of those challenges we had to design and implement a number of decisions. This section of the presentation will cover those in detail. Namely: clusters federation VS multiple physical nodes in one cluster VS multiple clusters with regard to coastal distribution of our system. Minimizing changes in logging/analytics pipeline, how we achieved that. Switching to self-hosted GitHub actions opens a lot of interesting opportunities, such as customized job runners docker images. Also, we ended up using Helm + Skaffold for automated deployment and other routines.
Finally, this presentation will summarize the design of the K8s platform we ended up with. Current migration progress, how and which KPIs have already significantly improved like setting up development environment, performing user experience monitoring etc. By the time of writing this abstracts some challenges have not been addressed yet, this presentation will bullet point them.
Very shortly - future plans. Address the remaining challenges, test and migrate more services.