We are Bagel - a frontier research collective engineering the backbone of a decentralized, open-source AI economy.
Role Overview
You will architect and optimize distributed inference systems for large language models. Your focus is on building scalable, fault-tolerant infrastructure that can serve open-source models like Llama, DeepSeek etc. across multiple nodes and regions, with efficient LoRA adaptation support.
Key Responsibilities
- Design and implement distributed inference systems using vLLM across multiple nodes and regions.
- Architect high-availability clusters with automatic failover and load balancing.
- Build monitoring and observability systems for distributed inference (latency, throughput, GPU utilization).
- Integrate with open-source model serving frameworks (DeepSeek, Text Generation Inference) in a distributed setting.
- Design and optimize LoRA adaptation pipelines for efficient model fine-tuning and serving.
- Document designs, review code, and post clear write-ups on blog.bagel.net.
Who You Might Be
You have a deep understanding of distributed systems and transformer inference. You enjoy architecting scalable infrastructure and optimizing every layer of the serving stack. You're excited about making open-source models production-ready at scale and love diving into the internals of distributed model serving frameworks and efficient adaptation techniques.
Required Skills
- At least 5 years of experience with distributed systems and production model serving.
- Hands-on experience with distributed vLLM, Text Generation Inference, or similar frameworks.
- Deep understanding of distributed systems concepts (consistency, availability, partitioning).
- Experience with container orchestration (Kubernetes) and service mesh technologies.
- Proven record of optimizing distributed inference latency and throughput.
- Experience with GPU profiling and optimization in a distributed setting.
- Strong understanding of LoRA and efficient fine-tuning techniques.
Bonus Skills
- Contributions to open-source distributed model serving frameworks.
- Experience with multi-region deployment and global load balancing.
- Knowledge of distributed model quantization and sharding techniques.
- Experience with dynamic LoRA switching and multi-adapter serving.
- Talks or posts that explain distributed inference optimization in plain language.
What We Offer
- A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
- Full remote flexibility within North American time zones.
- Ownership of work that can set the direction for decentralized AI.
- Paid travel opportunities to the top ML conferences around the world.