Session
FOSDEM 2021 Schedule
Monitoring and Observability

Production Machine Learning Monitoring: Outliers, Drift, Explainers & Statistical Performance

D.monitoring
Alejandro Saucedo
The lifecycle of a machine learning model only begins once it's in production. In this talk we provide a practical deep dive of the best practices, principles, patterns and techniques around production monitoring of machine learning models. We will cover standard microservice monitoring techniques applied into deployed machine learning models, as well as more advanced paradigms to monitor machine learning models through concept drift, outlier detector and explainability. We'll dive into a hands on example, where we will train an image classification machine learning model from scratch, deploy it as a microservice in Kubernetes, and introduce advanced monitoring components as architectural patterns with hands on examples. These monitoring techniques will include AI Explainers, Outlier Detectors, Concept Drift detectors and Adversarial Detectors. We will also be understanding high level architectural patterns that abstract these complex and advanced monitoring techniques into infrastructural components that will enable for scale, introducing the standardised interfaces required for us to enable monitoring across hundreds or thousands of heterogeneous machine learning models.

Additional information

Type devroom

More sessions

2/7/21
Monitoring and Observability
Richard Hartmann
D.monitoring
Our customary welcome.
2/7/21
Monitoring and Observability
Atibhi Agrawal
D.monitoring
Observability is not a new idea, it first originated in control theory. In control theory observability is defined as "A measure of how well internal states of a system can be inferred from knowledge of its external outputs" We software folks borrowed the term and now define it as the property of any system that allows us to understand what is going on with them, monitor what they are doing and get the information we need to operate & troubleshoot. In this talk, I am going to give an ...
2/7/21
Monitoring and Observability
D.monitoring
Recently Google published a paper on their monitoring system Monarch, which happened to have similar design choices to the existing CNCF Incubated project: Thanos! During this talk, two of Thanos maintainers will explain why Thanos could be claimed as an unintentional open source evolution of Google Monitoring Systems like Monarch.
2/7/21
Monitoring and Observability
Joe Elliott
D.monitoring
Grafana Tempo is a new high volume distributed tracing backend whose only dependency is object storage. Unlike other tracing backends Tempo can hit massive scale without a massive and difficult to manage Elasticsearch or Cassandra cluster. The current trade off for using object storage is that Tempo supports search by trace id only. However, we will see how this trade off can be overcome using the other pillars of observability. In this session we will use an OpenTelemetry instrumented ...
2/7/21
Monitoring and Observability
D.monitoring
How do you monitor Postgres? What information can you get out of it, and to what degree does this information help to troubleshoot operational issues? What if you want/need to log all the queries? That may bring heavy trafficked databases down. At OnGres we’re obsessed with improving PostgreSQL’s observability. So we worked together with Tetrate folks on an Envoy’s Network Filter extension for PostgreSQL, to provide and extend observability of the traffic inout a cluster infrastructure. ...
2/7/21
Monitoring and Observability
Jason Yee
D.monitoring
Good monitoring allows us to quickly troubleshoot problems and ensure that they remain minor blips rather than escalate into hours or days of downtime. But what is “good”? Just like good code, good monitoring should include tests and documentation to ensure that it’s always valid and easily used by everyone. In this lightning talk, I’ll share best practices for validating and documenting your monitoring.
2/7/21
Monitoring and Observability
Valerii Kravchuk
D.monitoring
Bpftrace is a relatively new open source tracer for modern Linux (kernels 5.x.y) for analyzing production performance problems and troubleshooting software. Basic usage of the tool, as well as bpftrace-based one liners and small scripts useful for MariaDB DBAs (and even developers) are presented. Problems of MariaDB Server dynamic tracing with bpftrace are discussed.