Skip to content

Saving, Loading and Exporting Weights

During training, you'll need to save checkpoints for two main purposes: sampling (to test your model) and resuming training (to continue from where you left off). The TrainingClient provides three methods to handle these cases:

  1. save_weights_for_sampler(): saves a copy of the model weights that can be used for sampling.
  2. save_state(): saves the weights and the optimizer state. You can fully resume training from this checkpoint.
  3. load_state(): load the weights and the optimizer state. You can fully resume training from this checkpoint.

Note that (1) is faster and requires less storage space than (2).

Both save_* functions require a name parameter—a string that you can set to identify the checkpoint within the current training run. For example, you can name your checkpoints "0000", "0001", "step_1000", etc.

The return value contains a path field, which is a fully-qualified checkpoint URI. Older deployments often return values like tinker://<model_id>/<name>, while newer deployments may return logits://<model_id>/<name>. Either form can be loaded later by a new ServiceClient or TrainingClient.

Saving for sampling

import logits

service_client = logits.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3.5-4B", rank=32
)

# Save a checkpoint that you can use for sampling
sampling_path = training_client.save_weights_for_sampler(name="0000").result().path

# Create a sampling client with that checkpoint
sampling_client = service_client.create_sampling_client(model_path=sampling_path)

Shortcut: Combine these steps with:

sampling_client = training_client.save_weights_and_get_sampling_client(name="0000")

Saving to resume training

Use save_state() and load_state() when you need to pause and continue training with full optimizer state preserved:

# Save a checkpoint that you can resume from
resume_path = training_client.save_state(name="0010").result().path

# Load that checkpoint
training_client.load_state(resume_path)

Use save_state() when you need to:

  • Run multi-step pipelines (e.g. supervised learning followed by reinforcement learning)
  • Adjust hyperparameters or data mid-run
  • Recover from interruptions
  • Preserve exact optimizer state (momentum, learning rate schedules, etc.)

Downloading weights

To download a checkpoint archive as a file:

import urllib.request
import logits

service_client = logits.ServiceClient()
rest_client = service_client.create_rest_client()

checkpoint_path = "logits://<unique_id>/sampler_weights/final"  # tinker://... also works
archive = rest_client.get_checkpoint_archive_url_from_tinker_path(checkpoint_path).result()

# archive.url is a signed download URL valid until archive.expires
urllib.request.urlretrieve(archive.url, "archive.tar")

Publishing and sharing checkpoints

To share a checkpoint with other users on the same deployment:

import logits

service_client = logits.ServiceClient()
rest_client = service_client.create_rest_client()

checkpoint_path = "logits://<run_id>/weights/<checkpoint_id>"  # tinker://... also works

# Publish (make public)
rest_client.publish_checkpoint_from_tinker_path(checkpoint_path).result()

# Unpublish (make private again)
rest_client.unpublish_checkpoint_from_tinker_path(checkpoint_path).result()

Published checkpoints load the same way as private ones:

training_client = service_client.create_training_client_from_state(checkpoint_path)

For the full method list, see RestClient.