Lightning Talks

Apache DataSketches

A Production Quality Sketching Library for the Analysis of Big Data
H.2215 (Ferrer)
Claude Warren
In​ the analysis of b​ig data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well. Examples include c​ount-distinct, ​quantiles, most frequent items, joins, matrix computations, and graph analysis. Algorithms that can produce accuracy guaranteed approximate answers for these problem queries are a required toolkit for modern analysis systems that need to process massive amounts of data​ quickly. For interactive queries there may not be other viable alternatives, and in the case of real­-time streams, these specialized algorithms, called stochastic, s​treaming, sublinear algorithms,​ or 's​ketches',​ are the only known solution. This technology has helped Yahoo successfully reduce data processing times from days to hours or minutes on a number of its internal platforms and has enabled subsecond queries on real-time platforms that would have been infeasible without sketches. This talk provides a short introduction to sketching and to Apache DataSketches, an open source library of these algorithms designed for large production analysis systems.
Fast: Sketches are fast. The sketch algorithms in this library process data in a single pass and are suitable for both real-time and batch. Sketches enable streaming computation of set expression cardinalities, quantiles, frequency estimation and more. This allows simplification of system's architecture and fast queries of heretofore difficult computational tasks. Big Data Platforms: This library has been specifically designed for big data platforms. Included are adaptors for Hadoop Pig, Hive, Spark, Druid, and Postgresql, which also can be used as examples for other systems, and many other capabilities typically required in big data analysis systems. For example, a Memory package for managing large off-heap memory data structures. Our sketch library is implemented in Java, C++ and Python and provides binary compatibility across languages and platforms. Some of our sketches provide off-Java-heap capability which dramatically improves performance in large systems. Our APIs provide a rich set of options to enable fine tuning performance parameters that are particularly important for large systems. Analysis: Built-in Theta Sketch set operators (Union, Intersection, Difference) produce sketches as a result (and not just a number) enabling full set expressions of cardinality, such as ((A ∪ B) ∩ (C ∪ D)) \ (E ∪ F). This capability along with predictable and superior accuracy (compared with Include/Exclude approaches) enable unprecedented analysis capabilities for fast queries.

Additional information

Type lightningtalk

More sessions

2/1/20
Lightning Talks
Matthias Kirschner
H.2215 (Ferrer)
More and more traditionally processes in our society now incorporate, and are influenced by software.
2/1/20
Lightning Talks
Mikel Cordovilla
H.2215 (Ferrer)
OpenOlitor is a SaaS open-source tool facilitating the organization and management of CSAs (Community Supported Agriculture) communities. This tool covers a large spectrum of functionalities needed for CSAs such as member management, emailing, invoicing, share planning and delivery, absence scheduling, etc. This software is organized and monitored by an international community that promotes the tool, helps operate it and support the interested communities. In order to promote the sustainability ...
2/1/20
Lightning Talks
Pierre Slamich
H.2215 (Ferrer)
Open Food Facts is a collaborative and crowdsourced database of food products from the whole planet, licensed under the Open Database License (ODBL). It was launched in 2012, and today it is powered by 27000 contributors who have collected data and images for over 1 million products in 178 countries (and growing strong…) This is the opportunity to learn more about Open Food Facts, and the latest developments of the project.
2/1/20
Lightning Talks
Bruno Škvorc
H.2215 (Ferrer)
For as long as human society has existed, humans have been unable to trust each other. For millennia, we relied on middlemen to establish business or legal relationships. With the advent of Web2.0, we also relayed the establishment of personal connections, and the system has turned against us. The middlemen abuse our needs and their power and we find ourselves chained to convenience at the expense of our own thoughts, our own privacy. Web3 is a radical new frontier ready to turn the status quo ...
2/1/20
Lightning Talks
Atlas Engineer
H.2215 (Ferrer)
While actual browsers expose their internals through an API and limit access to the host system, Next doesn't, allowing for infinite extensibility and inviting the users to program their web browser. On top of that, it doesn't tie itself to a particular platform (we currently provide bindings to WebKit and WebEngine) and allows for live code reloads, thanks to the Common Lisp language, about which we'll share our experience too.
2/1/20
Lightning Talks
Michal Čihař
H.2215 (Ferrer)
Please note that this talk will now be given by Michal Čihař instead of Václav Zbránek. You will learn how to localize your project easily with little effort, open-source way. No repetitive work, no manual work with translation files anymore. Weblate is unique for its tight integration to VCS. Set it up once and start engaging the community of translators. More languages translated means more happy users of your software. Be like openSUSE, Fedora, and many more, and speak your users' ...
2/1/20
Lightning Talks
Roberto Abdelkader Martínez Pérez
H.2215 (Ferrer)
This talk is about "Kapow!" an open source webframework for the shell developed by BBVA Innovation Labs. We will talk about the current development of the project including an overview of Kapow!'s technology stack and the recent release of the first stable version.