LLM Inference Engineer

Bagel LabsRemote, North AmericaToday

We are Bagel - a frontier research collective engineering the backbone of a decentralized, open-source AI economy.


Role Overview

You will architect and optimize distributed inference systems for large language models. Your focus is on building scalable, fault-tolerant infrastructure that can serve open-source models like Llama, DeepSeek etc. across multiple nodes and regions, with efficient LoRA adaptation support.


Key Responsibilities

  • Design and implement distributed inference systems using vLLM across multiple nodes and regions.
  • Architect high-availability clusters with automatic failover and load balancing.
  • Build monitoring and observability systems for distributed inference (latency, throughput, GPU utilization).
  • Integrate with open-source model serving frameworks (DeepSeek, Text Generation Inference) in a distributed setting.
  • Design and optimize LoRA adaptation pipelines for efficient model fine-tuning and serving.
  • Document designs, review code, and post clear write-ups on blog.bagel.net.

Who You Might Be

You have a deep understanding of distributed systems and transformer inference. You enjoy architecting scalable infrastructure and optimizing every layer of the serving stack. You're excited about making open-source models production-ready at scale and love diving into the internals of distributed model serving frameworks and efficient adaptation techniques.


Required Skills

  • At least 5 years of experience with distributed systems and production model serving.
  • Hands-on experience with distributed vLLM, Text Generation Inference, or similar frameworks.
  • Deep understanding of distributed systems concepts (consistency, availability, partitioning).
  • Experience with container orchestration (Kubernetes) and service mesh technologies.
  • Proven record of optimizing distributed inference latency and throughput.
  • Experience with GPU profiling and optimization in a distributed setting.
  • Strong understanding of LoRA and efficient fine-tuning techniques.

Bonus Skills

  • Contributions to open-source distributed model serving frameworks.
  • Experience with multi-region deployment and global load balancing.
  • Knowledge of distributed model quantization and sharding techniques.
  • Experience with dynamic LoRA switching and multi-adapter serving.
  • Talks or posts that explain distributed inference optimization in plain language.

What We Offer

  • A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
  • Full remote flexibility within North American time zones.
  • Ownership of work that can set the direction for decentralized AI.
  • Paid travel opportunities to the top ML conferences around the world.