Reinforcement Learning¶
Reinforcement learning (RL) means learning from trial and error. Whereas in supervised learning, we're given input-output pairs, in RL, we're given inputs (prompts) and reward functions (i.e., a function for scoring candidate outputs). RL algorithms need to discover what good outputs look like.
Here are the RL training modes supported in the Logits Cookbook:
- RL with Verifiable Rewards (RLVR): the reward function checks model outputs using a program—for example, checking a math answer or running unit tests on generated code. Especially suitable for teaching reasoning and multi-step tool use.
- RL on Human Feedback (RLHF): a preference model trained on human judgements scores candidate outputs, and RL optimizes against those scores.
We anticipate a few common use cases:
- Specialist model: start with a post-trained model and do RL on a custom environment.
- Research on RL algorithms: use our minimal training loop as a starting point.
- Research on post-training pipelines: chain SL and RL runs with different data mixes and reward functions.
Every training iteration writes human-readable HTML reports and machine-readable JSON files to log_path. See Training outputs for the full file reference.