Saving, Loading and Exporting Weights¶
During training, you'll need to save checkpoints for two main purposes: sampling (to test your model) and resuming training (to continue from where you left off). The TrainingClient provides three methods to handle these cases:
save_weights_for_sampler(): saves a copy of the model weights that can be used for sampling.save_state(): saves the weights and the optimizer state. You can fully resume training from this checkpoint.load_state(): load the weights and the optimizer state. You can fully resume training from this checkpoint.
Note that (1) is faster and requires less storage space than (2).
Both save_* functions require a name parameter—a string that you can set to identify the checkpoint within the current training run. For example, you can name your checkpoints "0000", "0001", "step_1000", etc.
The return value contains a path field, which is a fully-qualified checkpoint URI. Older deployments often return values like tinker://<model_id>/<name>, while newer deployments may return logits://<model_id>/<name>. Either form can be loaded later by a new ServiceClient or TrainingClient.
Saving for sampling¶
import logits
service_client = logits.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3.5-4B", rank=32
)
# Save a checkpoint that you can use for sampling
sampling_path = training_client.save_weights_for_sampler(name="0000").result().path
# Create a sampling client with that checkpoint
sampling_client = service_client.create_sampling_client(model_path=sampling_path)
Shortcut: Combine these steps with:
sampling_client = training_client.save_weights_and_get_sampling_client(name="0000")
Saving to resume training¶
Use save_state() and load_state() when you need to pause and continue training with full optimizer state preserved:
# Save a checkpoint that you can resume from
resume_path = training_client.save_state(name="0010").result().path
# Load that checkpoint
training_client.load_state(resume_path)
Use save_state() when you need to:
- Run multi-step pipelines (e.g. supervised learning followed by reinforcement learning)
- Adjust hyperparameters or data mid-run
- Recover from interruptions
- Preserve exact optimizer state (momentum, learning rate schedules, etc.)
Downloading weights¶
To download a checkpoint archive as a file:
import urllib.request
import logits
service_client = logits.ServiceClient()
rest_client = service_client.create_rest_client()
checkpoint_path = "logits://<unique_id>/sampler_weights/final" # tinker://... also works
archive = rest_client.get_checkpoint_archive_url_from_tinker_path(checkpoint_path).result()
# archive.url is a signed download URL valid until archive.expires
urllib.request.urlretrieve(archive.url, "archive.tar")
Publishing and sharing checkpoints¶
To share a checkpoint with other users on the same deployment:
import logits
service_client = logits.ServiceClient()
rest_client = service_client.create_rest_client()
checkpoint_path = "logits://<run_id>/weights/<checkpoint_id>" # tinker://... also works
# Publish (make public)
rest_client.publish_checkpoint_from_tinker_path(checkpoint_path).result()
# Unpublish (make private again)
rest_client.unpublish_checkpoint_from_tinker_path(checkpoint_path).result()
Published checkpoints load the same way as private ones:
training_client = service_client.create_training_client_from_state(checkpoint_path)
For the full method list, see RestClient.