AI Plumbers

Supercharging LLM serving with Dynamo

January 31, 2026
3:40 PM – 4:00 PM

UD2.120 (Chavanne)

Live Stream

Piotr Tarasiewicz

<p>The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:</p> <ul> <li>Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.</li> <li>Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.</li> <li>Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.</li> </ul> <p>This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.</p>

Additional information

Live Stream	https://live.fosdem.org/watch/ud2120
Type	devroom
Language	English

More sessions

1/31/26	Welcome to the AI Plumbers Devroom AI Plumbers UD2.120 (Chavanne) <p>Welcome talk covering some organizational questions</p>
1/31/26	Multimodal support in llama.cpp - Achievements and Future Directions AI Plumbers Xuan-Son Nguyen UD2.120 (Chavanne) <p>llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.</p> <p>We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real ...
1/31/26	API Remoting for llama.cpp: Near-Native GPU Speed in macOS Containers AI Plumbers José Castillo Lema UD2.120 (Chavanne) <p>Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.</p> <p><strong>The Problem: Bridging the OS and Acceleration Gap</strong></p> <p>While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is ...
1/31/26	tract - an efficient rust neural network inference engine AI Plumbers UD2.120 (Chavanne) <p>Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.</p> <p>This talk introduces <a href="https://github.com/sonos/tract"><strong>tract</strong></a>, <a href="https://sonos.com/">Sonos</a>'s open-source neural network inference toolkit started in 2018 and written in Rust. ...
1/31/26	Beyond TinyML: Balance inference accuracy and latency on MCUs AI Plumbers UD2.120 (Chavanne) <p>Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and ...
1/31/26	Bringing up bare metal ExecuTorch on RISC-V AI Plumbers UD2.120 (Chavanne) <p>During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.</p> <p>In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom ...
1/31/26	WebNN and WebLLM on RISC-V: Closing the AI Acceleration Gap with RVV and Tenstorrent AI Plumbers UD2.120 (Chavanne) <p>As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs ...

FOSDEM 2026

1/31/26 – 2/1/26

Event

Hackerkonferenzen

Created by @CCC 58 Follower

Event Calendar