| Live Stream | https://live.fosdem.org/watch/ud2120 |
|---|---|
| Type | devroom |
| Language | English |
| 1/31/26 |
<p>Welcome talk covering some organizational questions</p>
|
| 1/31/26 |
<p>llama.cpp has become a key tool for running LLMs efficiently on any hardware. This talk explores how multimodal features have grown in the project. It focuses on libmtmd, a library added in April 2025 to make multimodal support easier to use and to maintain in llama.cpp.</p> <p>We will first cover main achievements. These include combining separate CLI tools for different models into one single tool called llama-mtmd-cli. Next, we will discuss how libmtmd works with llama-server and show real ...
|
| 1/31/26 |
<p>Running modern Large Language Model (LLM) workloads on macOS presents a unique challenge: reconciling powerful local hardware with a mature, Linux-first AI tooling and container ecosystem.</p> <p><strong>The Problem: Bridging the OS and Acceleration Gap</strong></p> <p>While containerization offers macOS developers access to Linux-centric tools like Ramalama and the Podman Desktop AI Lab, introducing a virtualization layer immediately compromises GPU acceleration. Direct device passthrough is ...
|
| 1/31/26 |
<p>Deploying neural networks in production environments presents unique challenges: models must run efficiently across diverse hardware, from powerful servers to resource-constrained embedded devices, while maintaining predictable performance without heavy runtime dependencies.</p> <p>This talk introduces <a href="https://github.com/sonos/tract"><strong>tract</strong></a>, <a href="https://sonos.com/">Sonos</a>'s open-source neural network inference toolkit started in 2018 and written in Rust. ...
|
| 1/31/26 |
<p>During 2025 we ported ExecuTorch, the extension to PyTorch for embedded systems, to a bare metal multi-core RISC-V microcontroller based on the CORE-V CV32E40Pv2 processor.</p> <p>In this talk we'll explain the steps we had to take to achieve this. - removing dependencies on an underlying operating system - how to handle memory management between slow main memory and fast local memory - how to handle tiling and operators on bare metal multicore systems - how to take advantage of custom ...
|
| 1/31/26 |
<p>As AI workloads move to the browser, the lack of a unified low-level acceleration layer on Linux—equivalent to DirectML or CoreML—creates major bottlenecks. In this talk, we explore how WebNN and next-generation WebLLM can unlock efficient on-device inference on RISC-V, using Tenstorrent hardware and the emerging RVV 1.0 Variable-Length vector ISA. We cover the challenges of WebNN integration on Linux, the importance of WASM support for RVV, and demonstrate progress on running modern LLMs ...
|
| 1/31/26 |
<p>Leveraging Rust and Khronos' emerging Slang initiative, we introduce our efforts toward a cross-platform GPU LLM inference ecosystem. With a single-source approach we aim to minimize backend-specific code and foster community participation by writing inference kernels once and run them everywhere.</p>
|