Skip to content

Reinforcement Learning

Reinforcement learning (RL) means learning from trial and error. Whereas in supervised learning, we're given input-output pairs, in RL, we're given inputs (prompts) and reward functions (i.e., a function for scoring candidate outputs). RL algorithms need to discover what good outputs look like.

Here are the RL training modes supported in the Logits Cookbook:

  • RL with Verifiable Rewards (RLVR): the reward function checks model outputs using a program—for example, checking a math answer or running unit tests on generated code. Especially suitable for teaching reasoning and multi-step tool use.
  • RL on Human Feedback (RLHF): a preference model trained on human judgements scores candidate outputs, and RL optimizes against those scores.

We anticipate a few common use cases:

  • Specialist model: start with a post-trained model and do RL on a custom environment.
  • Research on RL algorithms: use our minimal training loop as a starting point.
  • Research on post-training pipelines: chain SL and RL runs with different data mixes and reward functions.

Every training iteration writes human-readable HTML reports and machine-readable JSON files to log_path. See Training outputs for the full file reference.