LLM Inference Engineer

Bagel Labs•Remote, North America•Today

We are Bagel - a frontier research collective engineering the backbone of a decentralized, open-source AI economy.

Role Overview

You will architect and optimize distributed inference systems for large language models. Your focus is on building scalable, fault-tolerant infrastructure that can serve open-source models like Llama, DeepSeek etc. across multiple nodes and regions, with efficient LoRA adaptation support.

Key Responsibilities

Design and implement distributed inference systems using vLLM across multiple nodes and regions.
Architect high-availability clusters with automatic failover and load balancing.
Build monitoring and observability systems for distributed inference (latency, throughput, GPU utilization).
Integrate with open-source model serving frameworks (DeepSeek, Text Generation Inference) in a distributed setting.
Design and optimize LoRA adaptation pipelines for efficient model fine-tuning and serving.
Document designs, review code, and post clear write-ups on blog.bagel.net.

Who You Might Be

You have a deep understanding of distributed systems and transformer inference. You enjoy architecting scalable infrastructure and optimizing every layer of the serving stack. You're excited about making open-source models production-ready at scale and love diving into the internals of distributed model serving frameworks and efficient adaptation techniques.

Required Skills

At least 5 years of experience with distributed systems and production model serving.
Hands-on experience with distributed vLLM, Text Generation Inference, or similar frameworks.
Deep understanding of distributed systems concepts (consistency, availability, partitioning).
Experience with container orchestration (Kubernetes) and service mesh technologies.
Proven record of optimizing distributed inference latency and throughput.
Experience with GPU profiling and optimization in a distributed setting.
Strong understanding of LoRA and efficient fine-tuning techniques.

Bonus Skills

Contributions to open-source distributed model serving frameworks.
Experience with multi-region deployment and global load balancing.
Knowledge of distributed model quantization and sharding techniques.
Experience with dynamic LoRA switching and multi-adapter serving.
Talks or posts that explain distributed inference optimization in plain language.

What We Offer

A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
Full remote flexibility within North American time zones.
Ownership of work that can set the direction for decentralized AI.
Paid travel opportunities to the top ML conferences around the world.