AI Plumbers

Beyond TinyML: Balance inference accuracy and latency on MCUs

January 31, 2026
11:50 AM – 12:10 PM

UD2.120 (Chavanne)

Live Stream

<p>Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and a unified API, vAccel bridges heterogeneous devices, enabling seamless offload without modifying application logic.</p> <p>This session presents our IoT port of vAccel (client & lightweight agent) and demonstrates a real deployment where an ESP32 delegates inference to a GPU-backed k8s node, reducing latency by 3 orders of magnitude while preserving Kubernetes-native control and observability. Attendees will see how open acceleration can unify the Cloud–Edge–IoT stack through standard interfaces and reusable runtimes.</p>

Additional information

Live Stream	https://live.fosdem.org/watch/ud2120
Type	devroom
Language	English

More sessions

1/31/26	Welcome to the AI Plumbers Devroom AI Plumbers UD2.120 (Chavanne) <p>Welcome talk covering some organizational questions</p>
1/31/26	Multimodal support in llama.cpp - Achievements and Future Directions AI Plumbers Xuan-Son Nguyen UD2.120 (Chavanne) <p>llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.</p> <p>We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real ...
1/31/26	API Remoting for llama.cpp: Near-Native GPU Speed in macOS Containers AI Plumbers José Castillo Lema UD2.120 (Chavanne) <p>Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.</p> <p><strong>The Problem: Bridging the OS and Acceleration Gap</strong></p> <p>While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is ...
1/31/26	tract - an efficient rust neural network inference engine AI Plumbers UD2.120 (Chavanne) <p>Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.</p> <p>This talk introduces <a href="https://github.com/sonos/tract"><strong>tract</strong></a>, <a href="https://sonos.com/">Sonos</a>'s open-source neural network inference toolkit started in 2018 and written in Rust. ...
1/31/26	Bringing up bare metal ExecuTorch on RISC-V AI Plumbers UD2.120 (Chavanne) <p>During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.</p> <p>In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom ...
1/31/26	WebNN and WebLLM on RISC-V: Closing the AI Acceleration Gap with RVV and Tenstorrent AI Plumbers UD2.120 (Chavanne) <p>As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs ...
1/31/26	Single-source cross-platform GPU LLM inference with Slang and Rust AI Plumbers Crozet Sébastien UD2.120 (Chavanne) <p>Leveraging Rust and Khronos' emerging Slang initiative, we introduce our efforts toward a cross-platform GPU LLM inference ecosystem. With a single-source approach we aim to minimize backend-specific code and foster community participation by writing inference kernels once and run them everywhere.</p>

FOSDEM 2026

1/31/26 – 2/1/26

Event

Hackerkonferenzen

Created by @CCC 58 Follower

Event Calendar