HPC, Big Data, and Data Science

Distributed Join Algorithms in CrateDB

How We Made Distributed Joins 23 Thousand Times Faster
D.hpc
Marija Selakovic
<p>Join operator is one of the standard operations available in relational databases. In a large-scale distributed scenario, efficiently implementing joins poses unique challenges as the data is usually spread around a cluster of machines instead of stored on a single machine. The goal of this talk is to illustrate the approach to implementing distributed joins in the CrateDB database that exhibits significant performance improvements compared to the existing algorithms. In the first part of the talk, we will cover the limitations of the nested loop and block nested loop join algorithms. The second part will show how the hash algorithm can work in distributed settings by addressing some of its memory limitations. Finally, we will introduce the distributed block hash join algorithm and how it enables CrateDB to analyze massive amounts of data 23 thousand times faster compared to the initial nested loop implementation.</p>
In relational databases, join operators are usually implemented with nested or block nested loop algorithms. However, in a large-scale distributed scenario, the ability to efficiently query massive amounts of data can be challenging. Traditional approaches in implementations of join operators are no longer enough to achieve the high performance of complex joins in distributed data processing systems. For instance, a nested loop algorithm is relatively simple to implement and could be easily adjusted to execute distributed joins. Unfortunately, it comes with a high-performance cost that equals quadratic time complexity with respect to the number of rows of the two tables joined. This talk will show our approach to implementing distributed equi-join operator in CrateDB that exhibits significant performance improvements compared to the original nested loop algorithm. CrateDB is an open-source, distributed SQL database that runs queries on millions of data records daily. It scales up to hundreds of nodes and PBs of indexed data making the performance of join operators highly important: it is required to have efficient algorithms that can scale with the input size. More specifically, we explore the implementation of the distributed block hash join algorithm. First, we address the memory limitations of the basic hash join algorithm with a switch to block-based processing. Block-based processing refers to a procedure of dividing a large dataset up into smaller blocks that can be worked on separately. As those blocks can be distributed across the CrateDB cluster the join can be executed in parallel using multiple nodes for increased performance and load distribution. Second, we illustrate the changes in the single node block hash join algorithm to enable its distributed execution. To evaluate the performance of distributed block hash join algorithm, we run CrateDB benchmarks against two algorithms: the original nested loop algorithm and the single node block hash join algorithm. The benchmark consists of queries with join operators and runs on tables of various sizes, up to 50 million rows. The final result illustrates that the distributed block hash join algorithm enables CrateDB to analyze massive amounts of data 23 thousand times faster than the initial nested loop implementation.

Additional information

Type devroom

More sessions

2/5/22
HPC, Big Data, and Data Science
Olena Kutsenko
D.hpc
<p>Working with Big Data means that we need tools to organise and understand the data. And you don’t have to be a developer to search, aggregate and visualise your data. Whether you need an affordable business analytics tool or you want to analyse log data in near real time, OpenSearch can help you. And all of it through a visual interface of OpenSearch Dashboards.</p> <p>After listening to this talk you’ll understand the basics of working with an OpenSearch cluster and different use cases ...
2/5/22
HPC, Big Data, and Data Science
Max Meldrum
D.hpc
<p>In this talk, I will present Arcon, a Rust-native streaming runtime that integrates seamlessly with the Apache Arrow ecosystem. The Arcon philosophy is streaming first, similarly to systems such as Apache Flink and Timely Dataflow. However, unlike all existing systems, Arcon features great flexibility when it comes to its application state. Arcon's TSS query language allows extracting and operating on state snapshots consistently based on application-time constraints and interfacing with ...
2/5/22
HPC, Big Data, and Data Science
D.hpc
<p>Any conversation about Big Data would be incomplete without talking about Apache Kafka and Apache Flink: the winning open source combination for high-volume streaming data pipelines.</p> <p>In this talk we'll explore how moving from long running batches to streaming data changes the game completely. We'll show how to build a streaming data pipeline, starting with Apache Kafka for storing and transmitting high throughput and low latency messages. Then we'll add Apache Flink, a distributed ...
2/5/22
HPC, Big Data, and Data Science
John Garbutt
D.hpc
<p>Why build #4 on the Green500 using OpenStack? It makes it easier to manage. Cambridge University started using OpenStack in 2015. Since mid 2020, all new hardware is controlled using OpenStack. Compute nodes, GPU nodes, Lustre nodes, Ceph nodes, almost everything. OpenStack allows large baremetal slurm clusters and dedicated TRE (trusted research environments) to share the same images. Is this a cloud native supercomputer?</p>
2/5/22
HPC, Big Data, and Data Science
Christian Kniep
D.hpc
<p>This short talk will disect the container ecosystem for HPC in four segments and discusses what to look out for, what is already settled and how to navigate containers in 2022.</p>
2/5/22
HPC, Big Data, and Data Science
D.hpc
<p>Optimizing CPU management improves cluster performance and security, but is daunting to almost everyone. CPU management may seem complex, but it can be explained in such a way that even your inner toddler will comprehend. With this talk, we will give a path to success.</p> <p>You may have a multi-socket node cluster where your AI/ML workloads care about the proximity of your CPUs to GPUs. You may be running scientific workloads where you want to pin in cores within containers instead of just ...
2/5/22
HPC, Big Data, and Data Science
Trevor Grant
D.hpc
<p>Working with big data matrices is challenging, Kubernetes allows users to elastically scale, but can only have a pod as large as a node, which may not be large enough to fit the matrix in memory. While Kubernetes allows for other paradigms on top of it which allows pods to coordinate on individual jobs, setting them up and making them play nice with ML platforms is not straightforward. Using Apache Spark and Apache Mahout we can work with matrices of any dimension and distribute them across ...