AI Plumbers

Supercharging LLM serving with Dynamo

UD2.120 (Chavanne)
Piotr Tarasiewicz
<p>The explosive growth of Large Language Models (LLMs) requires massively efficient and scalable inference systems. This talk will share key innovations NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) adds to enable system-level optimizations while leveraging performance from inference engines such as vLLM, SGLang, and TRT-LLM:</p> <ul> <li>Smart Scheduling that routes requests based on the KV cache hit rate and load, intelligently autoscales, and disaggregates the prefill and decode phases.</li> <li>Hierarchical Memory Management that utilizes HBM, host memory, local disk, and remote storage.</li> <li>Low-Latency Transfer of the KV cache across nodes and the memory hierarchy.</li> </ul> <p>This talk will also introduce production-grade LLM serving features of Dynamo that enable users to: - Find the best configuration for disaggregated serving offline. - Tune performance automatically based on real-time traffic. - Dynamically scale prefill and decode workers via topology-aware gang scheduling. - Leverage LLM-specific fault tolerance.</p>

Additional information

Live Stream https://live.fosdem.org/watch/ud2120
Type devroom
Language English

More sessions

1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>Welcome talk covering some organizational questions</p>
1/31/26
AI Plumbers
Xuan-Son Nguyen
UD2.120 (Chavanne)
<p>llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.</p> <p>We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real ...
1/31/26
AI Plumbers
José Castillo Lema
UD2.120 (Chavanne)
<p>Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.</p> <p><strong>The Problem: Bridging the OS and Acceleration Gap</strong></p> <p>While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.</p> <p>This talk introduces <a href="https://github.com/sonos/tract"><strong>tract</strong></a>, <a href="https://sonos.com/">Sonos</a>'s open-source neural network inference toolkit started in 2018 and written in Rust. ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.</p> <p>In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs ...