Skip to content

rllm-org/rllm

Repository files navigation

rLLM

Train your AI agents with RL. Any framework. Minimal code changes.

Documentation Slack Website Blogs X

rLLM is an open-source framework for training AI agents with reinforcement learning. Swap in a tracked client, define a reward function, and let RL handle the rest — no matter what agent framework you use.

Core Features

  • Works with any agent framework — LangGraph, SmolAgent, Strands, OpenAI Agents SDK, Google ADK, or plain openai.OpenAI. Just swap the client. 🔌
  • Near-zero code changes — Add @rllm.rollout to wrap your agent code, and rLLM traces every LLM call automatically. 🪄
  • CLI-first workflow — Eval and train from the command line with 50+ built-in benchmarks. rllm eval gsm8k just works. ⚡
  • Battle-tested results — rLLM-trained agents beat models 50x their size (4B → outperforms 235B on finance, 1.5B → surpasses O1-Preview on math). 📈
  • Multiple RL algorithms — GRPO, REINFORCE, RLOO, rejection sampling, and more. 🧠
  • Two training backendsverl for distributed multi-GPU training, tinker for single-machine / CPU setups. Same API either way. 🔧

Read more on our documentation site.

Installation

rLLM requires Python >= 3.10 (3.11 is needed if using tinker). You can install it either directly via pip or build from source.

uv pip install "rllm @ git+https://github.com/rllm-org/rllm.git"

this installs dependencies for running rllm cli, which uses Tinker as the training backend.

To use verl as the training backend (GPU machine required), install via

# For distributed GPU training (verl + vLLM/SGLang)
uv pip install rllm[verl] @ git+https://github.com/rllm-org/rllm.git

For building from source or Docker, see the installation guide.

Quickstart

Option A: CLI (no code needed)

# 1. Configure your model provider
rllm model setup

# 2. Evaluate on a benchmark
rllm eval gsm8k

# 3. Train with RL
rllm train gsm8k

Option B: Python API

Define a rollout (your agent) and an evaluator (your reward function), then hand them to the trainer:

# my_flow.py
from openai import OpenAI
import rllm
from rllm.experimental.eval.types import AgentConfig, Task
from rllm.types import Episode, Trajectory

@rllm.rollout
def solve(task: Task, config: AgentConfig) -> Episode:
    client = OpenAI(base_url=config.base_url, api_key="EMPTY")
    response = client.chat.completions.create(
        model=config.model,
        messages=[{"role": "user", "content": task.data["question"]}],
    )
    answer = response.choices[0].message.content or ""
    return Episode(
        trajectories=[Trajectory(name="solver", steps=[])],
        artifacts={"answer": answer},
    )
# my_evaluator.py
import rllm
from rllm.experimental.eval.types import EvalOutput, Signal, _extract_agent_answer
from rllm.types import Episode

@rllm.evaluator
def score(task: dict, episode: Episode) -> EvalOutput:
    answer = _extract_agent_answer(episode)
    is_correct = answer.strip() == task["ground_truth"].strip()
    reward = 1.0 if is_correct else 0.0
    return EvalOutput(reward=reward, is_correct=is_correct,
                      signals=[Signal(name="accuracy", value=reward)])
# train.py
from rllm.experimental.unified_trainer import AgentTrainer

trainer = AgentTrainer(
    backend="tinker",
    agent_flow=solve,
    evaluator=score,
    config=config,
    train_dataset=dataset,
)
trainer.train()

During training, config.base_url points to a gateway that transparently captures token IDs and logprobs — your agent code stays the same for eval and training.

See the cookbooks for complete working examples (single-turn VLM solver, multi-agent solver-judge, and more).

Architecture

rLLM follows a simple pipeline: run your agent → collect traces → compute rewards → update the model.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Your Agent  │───▶│    Traces     │───▶│   Rewards    │───▶│  RL Update   │
│  (any code)  │    │  (auto-logged)│    │ (your logic) │    │  (GRPO etc.) │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Your agent runs as-is — rLLM's SDK intercepts LLM calls and structures them into Episodes (one task) containing Trajectories (one agent run) made of Steps (one LLM call). A reward function scores the result, and the RL algorithm updates the model weights. The same agent code works for both eval and training.

Under the hood:

  • Workflow Engine runs N parallel agent instances to collect rollouts
  • LiteLLM Proxy routes requests and captures token IDs + logprobs
  • Transform Pipeline groups trajectories for advantage computation
  • Training Backend (verl or tinker) handles the policy update

Community Projects

Articles & Blog Posts

Acknowledgements

Our work is done as part of Berkeley Sky Computing Lab. The rLLM team is generously supported by grants from Laude Institute, AWS, Hyperbolic, Fireworks AI, and Modal. We pay special thanks to Together AI for the research partnership and compute support.

Citation

@misc{rllm2025,
  title={rLLM: A Framework for Post-Training Language Agents},
  author={Sijun Tan and Michael Luo and Colin Cai and Tarun Venkat and Kyle Montgomery and Aaron Hao and Tianhao Wu and Arnav Balyan and Manan Roongta and Chenguang Wang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}},
  note={Notion Blog},
}

You may also cite our prior work DeepScaleR, DeepCoder, and DeepSWE.