Little Big Data
What do you do when your data starts from nothing, grows slowly, and never stops? What better place to learn to scale than SCaLE? Big Data seems to spring fully formed from the Hadoop cluster that holds it but the details reveal a confusing array of approaches. Starting out seems daunting with so many exciting, highly-hyped technologies, from myriad individual programs to entire frameworks that provide containers, distributed and object storage, parallel computation, and infrastructure and configuration management. Deconstructing these options into their simpler counterparts provides a way to progress incrementally.
Architecture depends on and determines budget, project, and time constraints. When do you plan for the next level of hardware and software: memory/data structures/databases, processors/algorithms, storage/filesystems, racks/cloud? Low-level aspects stay all the way (bash, C, linux) while the top-level updates often (data stores, frameworks, distributions). Data size and technical debt grow in parallel so agile-influenced late-binding proves prudent.
Differences in data across analysis domains (clinical, financial, research, etc.) will influence decisions about data schema, algorithms (data vs. hypothesis driven), interfaces (publically accessible or internal), and performance needs (HPC, bare-metal, cloud/grid). Despite these disparities Linux provides the kernel of every good solution. Each year SCaLE presents the best programs and practices to carry your project forward. When a distributed filesystem requires more bandwidth than the budget allows then that ethernet bonding talk from SCaLE 9x finally finds its use.
Customizing and combining a decade's SCaLE talks into the study of -omics (connectome, genome, microbiome) led to the Pain and Interoceptive Neuroimaging Repository, a scalable system for open collaboration by researchers around the world. Discussion of decisions (Ceph-Glusterfs-pNFS, SGE-VM-Docker, RDBMS-NoSQL) will provide an overview of all aspects of this system as it scales from a single server to petabytes of data and analyses. Timelines of data growth, technology maturation and obsolescences, and development pace will show how to succeed across the long-term. In the era of "failing fast" there are still old technologies (bash 1989, xfs 1996) integral to modern systems and some whose arrival we await (btrfs, realtime linux).