HPC, Big Data, and Data Science

HPC on OpenStack

the good, the bad and the ugly

February 2, 2020
3:30 PM – 3:55 PM

UB5.132

Ümit Seren

HPC systems have been traditionally operated as monolithic installations on bare-metal hardware primarily used by users with computational background to submit classic batch jobs. However the commoditization of compute resources and the introduction of new scientific fields such as life sciences to high performance computing has caused a shift in this paradigm. Today, an increasing number of biological software is made accessible through web portals. This improved ease of use has led towards a democratization of access to computational resources Users of those fields don’t have the same computational knowledge as traditional HPC users from physics or chemistry and additionally require different kinds of workloads and applications that don’t fit traditional non-interactive batch scheduling resource management systems. Additionally, cloud computing is becoming more and more relevant and various efforts to lift HPC into the Cloud were started. We manage the HPC infrastructure for 3 life science and 2 particle physics institutions at the Vienna Bio Center (VBC). For the new HPC system that was procured at the end of 2018, we decided to go with an on-prem cloud framework based on OpenStack to accommodate the various emerging workflows and programs. OpenStack is not a finished product and requires considerable amount of engineering. It took us around 2 years of testing and engineering to feel confident in deploying the new HPC infrastructure on top of OpenStack. Since summer 2019 we have our 200 node production SLURM cluster running on top of VMs in OpenStack. In this talk we want to share our experiences from our endeavor into HPC on OpenStack. We want to briefly discuss the reasoning behind HPC in the cloud and specifically OpenStack. Often times these kind of projects either completely fade away in case of failure or get published in a high-level white paper that is only useful as marketing material. We want to share our honest experience from both implementer and operator perspective. We discuss how we use 3 environments to test updates and configuration changes. We will also explain our approach to automation and infrastructure as code all the way from the underlying infrastructure to the SLURM payload and how we keep our sanity using development procedures around pull requests and code reviews. We will also share some stories from the trenches, such as why you still learn new things about OpenStack after 1000 deploys or discover that a simple config change can destroy performance. This talk will contain information that you won’t find in success stories or white papers but is hopefully very helpful or anyone who considers deploying HPC on OpenStack.

Additional information

Type	devroom

More sessions

2/2/20	Introducing HPC with a Raspberry Pi cluster HPC, Big Data, and Data Science Colin Sauze UB5.132 This talk will discuss the development of a RaspberryPi cluster for teaching an introduction to HPC. The motivation for this was to overcome four key problems faced by new HPC users: The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course. A fear of using a large and expensive HPC system for the first time and worries that doing something ...
2/2/20	Building an open source data lake at scale in the cloud HPC, Big Data, and Data Science Adrian Woodhead UB5.132 This presentation will give an overview of the various tools, software, patterns and approaches that Expedia Group uses to operate a number of large scale data lakes in the cloud and on premise. The data journey undertaken by the Expedia Group is probably similar to many others who have been operating in this space over the past two decades - scaling out from relational databases to on premise Hadoop clusters to a much wider ecosystem in the cloud. This talk will give an overview of that journey ...
2/2/20	Magic Castle: Terraforming the Cloud for HPC HPC, Big Data, and Data Science Félix-Antoine Fortin UB5.132 Compute Canada provides HPC infrastructures and support to every academic research institution in Canada. In recent years, Compute Canada has started distributing research software to its HPC clusters using with CERN software distribution service, CVMFS. This opened the possibility for accessing the software from almost any location and therefore allow the replication of the Compute Canada experience outside of its physical infrastructure. From these new possibilities emerged an open-source ...
2/2/20	Maggy: Asynchronous distributed hyperparameter optimization based on Apache Spark HPC, Big Data, and Data Science Moritz Meister UB5.132 Maggy is an open-source framework built on Apache Spark, for asynchronous parallel execution of trials for machine learning experiments. In this talk, we will present our work to tackle search as a general purpose method efficiently with Maggy, focusing on hyperparameter optimization. We show that an asynchronous system enables state-of-the-art optimization algorithms and allows extensive early stopping in order to increase the number of trials that can be performed in a given period of time on ...
2/2/20	Snorkel Beambell - Real-time Weak Supervision on Apache Flink HPC, Big Data, and Data Science Suneel Marthi UB5.132 The advent of Deep Learning models has led to a massive growth of real-world machine learning. Deep Learning allows Machine Learning Practitioners to get the state-of-the-art score on benchmarks without any hand-engineered features. These Deep Learning models rely on massive hand-labeled training datasets which is a bottleneck in developing and modifying machine learning models. Most large scale Machine Learning systems today like Google’s DryBell use some form of Weak Supervision to construct ...
2/2/20	Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases HPC, Big Data, and Data Science Frank McQuillan UB5.132 In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
2/2/20	Predictive Maintenance HPC, Big Data, and Data Science UB5.132 Predictive maintenance and condition monitoring for remote heavy machinery are compelling endeavors to reduce maintenance cost and increase availability. Beneficial factors for such endeavors include the degree of interconnectedness, availability of low cost sensors, and advances in predictive analytics. This work presents a condition monitoring platform built entirely from open-source software. A real world industry example for an escalator use case from Deutsche Bahn underlines the advantages ...

FOSDEM 2020

2/1/20 – 2/2/20

Event

FOSS Events

Created by @foss_events 25 Follower

Event Calendar