AI Plumbers

tract - an efficient rust neural network inference engine

UD2.120 (Chavanne)
<p>Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.</p> <p>This talk introduces <a href="https://github.com/sonos/tract"><strong>tract</strong></a>, <a href="https://sonos.com/">Sonos</a>'s open-source neural network inference toolkit started in 2018 and written in Rust. We'll explore how tract bridges the gap between training frameworks and production deployment by offering a no-nonsense, self-contained inference solution used today to deploy deep learning on millions of devices at Sonos.</p> <p>This toolkit has some unique strengths thanks to embedded graph optimization, automated streaming management, and symbolic abstraction for dynamic dimensions — plus support for multiple open exchange standards including <a href="https://onnx.ai/">ONNX</a>, <a href="https://www.khronos.org/nnef">NNEF</a>, and TensorFlow Lite.</p> <p><a href="https://github.com/sonos/tract"><strong>tract</strong></a> also has a companion project coined <a href="https://github.com/sonos/torch-to-nnef"><strong>torch-to-nnef</strong></a> that strive to export PyTorch models to an NNEF optimized for tract with maximum compatibility. It enables some unique features like quantization, better Fourier Transform support and easier extensibility: this will also be discussed shortly during this presentation.</p>

Additional information

Live Stream https://live.fosdem.org/watch/ud2120
Type devroom
Language English

More sessions

1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>Welcome talk covering some organizational questions</p>
1/31/26
AI Plumbers
Xuan-Son Nguyen
UD2.120 (Chavanne)
<p>llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.</p> <p>We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real ...
1/31/26
AI Plumbers
José Castillo Lema
UD2.120 (Chavanne)
<p>Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.</p> <p><strong>The Problem: Bridging the OS and Acceleration Gap</strong></p> <p>While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>Can an ESP32-based MCU run (tiny)ML models accurately and efficiently? This talk showcases how a tiny microcontroller can transparently leverage neighboring nodes to run inference on full, unquantized torchvision models in less than 100ms! We build on vAccel, an open abstraction layer that allows interoperable hardware acceleration and enable devices like the ESP32 to transparently offload ML inference and signal-processing tasks to nearby edge or cloud nodes. Through a lightweight agent and ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.</p> <p>In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom ...
1/31/26
AI Plumbers
UD2.120 (Chavanne)
<p>As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs ...
1/31/26
AI Plumbers
Crozet Sébastien
UD2.120 (Chavanne)
<p>Leveraging Rust and Khronos' emerging Slang initiative, we introduce our efforts toward a cross-platform GPU LLM inference ecosystem. With a single-source approach we aim to minimize backend-specific code and foster community participation by writing inference kernels once and run them everywhere.</p>