Reinforcement Learning¶

Reinforcement learning (RL) means learning from trial and error. Whereas in supervised learning, we're given input-output pairs, in RL, we're given inputs (prompts) and reward functions (i.e., a function for scoring candidate outputs). RL algorithms need to discover what good outputs look like.

Here are the RL training modes supported in the Logits Cookbook:

RL with Verifiable Rewards (RLVR): the reward function checks model outputs using a program—for example, checking a math answer or running unit tests on generated code. Especially suitable for teaching reasoning and multi-step tool use.
RL on Human Feedback (RLHF): a preference model trained on human judgements scores candidate outputs, and RL optimizes against those scores.

We anticipate a few common use cases:

Specialist model: start with a post-trained model and do RL on a custom environment.
Research on RL algorithms: use our minimal training loop as a starting point.
Research on post-training pipelines: chain SL and RL runs with different data mixes and reward functions.

Every training iteration writes human-readable HTML reports and machine-readable JSON files to log_path. See Training outputs for the full file reference.