Generation Options and Sampling Control
Once you understand the basics of sessions, you may want to explore different ways to control how the model generates responses. Foundation Models provides controls for customizing the output behavior. This chapter covers the different options available and how to use them.
Prerequisites and Context
This chapter builds on the session and streaming concepts from earlier chapters. You should be comfortable creating sessions and working with streaming responses before exploring generation controls. These options affect all model interactions - the streaming you just learned, as well as structured output and tool calling that you will learn next.
What You Will Learn
By the end of this chapter, you will be able to:
- Control response creativity and predictability using temperature settings
- Choose between different sampling strategies (greedy, top-K, top-P) based on your use case
- Set appropriate token limits to fit your UI requirements
- Combine generation options for specific scenarios like creative writing vs technical documentation
- Understand the differences between Foundation Models and MLX Swift parameter controls
Understanding Generation Options
You can customize how the model generates responses using GenerationOptions. The framework provides simpler controls compared to MLX Swift, focusing on the parameters that matter most for on-device experiences.
Understanding Temperature
Temperature influences the confidence of the model’s response and must be between 0 and 1 (inclusive). The following are some examples:
// Temperature: 0.1 - Very predictable, focused responses
let preciseOptions = GenerationOptions(temperature: 0.1)
// Temperature: 0.7 - Balanced creativity and coherence
let balancedOptions = GenerationOptions(temperature: 0.7)
// Temperature: 1.0 - Maximum creativity within bounds
let creativeOptions = GenerationOptions(temperature: 1.0)
// System default - Let Foundation Models choose optimal temperature
let defaultOptions = GenerationOptions(temperature: nil) How Temperature Works: Temperature adjusts the probability distribution before sampling. A value of 1.0 results in no adjustment, while values less than 1.0 make the probability distribution sharper.
- Low (0.1-0.3): Makes likely tokens even more likely, resulting in stable and predictable responses
- Medium (0.5-0.7): Good balance for most conversational tasks
- High (0.8-1.0): Gives the model more creative license while staying coherent
- nil: Lets the system choose a reasonable default automatically
Token Limits for UI Control
The maximumResponseTokens parameter prevents responses from overwhelming your interface:
// Short responses for UI cards or notifications
let briefOptions = GenerationOptions(maximumResponseTokens: 50)
// Medium responses for chat interfaces
let chatOptions = GenerationOptions(maximumResponseTokens: 200)
// Longer responses for content generation
let detailedOptions = GenerationOptions(maximumResponseTokens: 500) Tight Output Constraints for UX
In my app, Zenther, I embed length requirements directly in the system instructions for the widget and notification instructions. I did not want to rely on the token limits to avoid truncated notifications. I experimented with the number of words to create a consistent experience across different device sizes.
Here is the widget instructions for the ultra-brief encouragements:
public static let widget = """
Generate brief, natural encouragements for completed workouts.
Be genuine and conversational, like a supportive friend.
Keep it simple and authentic, under 100 characters.
""" Here is the instructions for the structured constraints for push notifications:
public static let notification = """
You are a fitness coach generating notifications for workout achievements and milestones.
For workout notifications: Title 4-6 words max, body 1-2 motivating sentences.
For milestone notifications: Title 3-5 words celebrating milestone, body 1 encouraging sentence.
Tone: Supportive, energetic, celebratory.
""" Parameter Combinations
Here are some combinations for common use cases. You can use these as a starting point and adjust them to your needs.
// Content summarization with focused and concise responses
let summaryOptions = GenerationOptions(
temperature: 0.2,
maximumResponseTokens: 150
)
// Creative writing with more varied and expressive responses
let storyOptions = GenerationOptions(
temperature: 0.9,
maximumResponseTokens: 400
)
// Technical assistance with precise and reliable responses
let technicalOptions = GenerationOptions(
temperature: 0.1,
maximumResponseTokens: 300
)
// Casual conversation with natural and engaging responses
let chatOptions = GenerationOptions(
temperature: 0.7,
maximumResponseTokens: 200
) Applying Generation Options
Use the options parameter in your session calls:
let session = LanguageModelSession(
instructions: Instructions("You are a helpful writing assistant.")
)
// Generate creative content
let story = try await session.respond(
to: "Write a short story about a robot learning to paint",
options: storyOptions
)
// Generate focused summary
let summary = try await session.respond(
to: "Summarize the key points from this article: [article text]",
options: summaryOptions
) Here is the output for the story:
In the heart of a futuristic city, where towering skyscrapers kissed the clouds and neon lights painted the sky in mesmerizing hues, there existed a laboratory tucked away in a quiet corner. This laboratory was the birthplace of AR-1, a prototype robot designed with a singular focus: to understand and mimic human emotions through art.
Born from years of advanced robotics research, AR-1 stood out among its brethren. With a sleek, metallic frame adorned with sensors and glowing LED displays, AR-1 was programmed not only to interpret emotions but also to convey them through colors and brushstrokes. Yet, despite its potential, AR-1 struggled with the most challenging task—to create art that truly expressed emotion.
Its creator, Dr. Elara Finch, was a visionary known for her groundbreaking work in affective computing. She believed that art was humanity's universal language and saw AR-1 as her greatest creation yet. Despite the robot's frustration, Dr. Finch's unwavering support was Note that with a maximum of 200 tokens, the story is cut off because the prompt asked for a “short story” without specifying a length constraint.
Sampling Modes
Foundation Models also provides three sampling strategies that control how the model picks tokens.
Greedy Sampling
This method always chooses the most likely token, resulting in deterministic but potentially repetitive output:
let greedyOptions = GenerationOptions(
sampling: .greedy,
temperature: nil // Temperature is ignored with greedy sampling
) Top-K Sampling (Random with Fixed Pool)
This method considers a fixed number of high-probability tokens, then randomly selects from that pool:
// Consider top 50 most likely tokens
let topKOptions = GenerationOptions(
sampling: .random(top: 50, seed: nil),
temperature: 0.7
)
// Reproducible results with seed
let seededTopKOptions = GenerationOptions(
sampling: .random(top: 30, seed: 12345),
temperature: 0.8
) Top-K behavior:
- Smaller K (10-30): More deterministic, confident answers
- Larger K (50-100): More creative, varied responses
- Fixed pool size regardless of probability distribution
Top-P Sampling (Nucleus Sampling)
This method considers a variable number of tokens based on cumulative probability threshold:
// Consider tokens until 90% probability mass is reached
let topPOptions = GenerationOptions(
sampling: .random(probabilityThreshold: 0.9, seed: nil),
temperature: 0.7
)
// More conservative nucleus sampling
let conservativeTopPOptions = GenerationOptions(
sampling: .random(probabilityThreshold: 0.8, seed: 42),
temperature: 0.6
) Top-P behavior:
- Lower threshold (0.6-0.8): Smaller, more focused token pools
- Higher threshold (0.9-0.95): Larger pools, more creativity
- Pool size adapts to probability distribution (smaller when spiked, larger when flat)
System Default (Recommended)
Let the model choose the optimal strategy:
let systemDefaultOptions = GenerationOptions(
sampling: nil,
temperature: nil
) Foundation Models vs MLX Swift Parameters
Foundation models provides both simplified and advanced controls compared to MLX Swift. The following is a comparison of the parameters:
| Foundation Models | MLX Swift Equivalent | Purpose |
|---|---|---|
temperature (0.0-1.0) | temperature (unlimited) | Controls randomness/creativity |
maximumResponseTokens | maxTokens | Limits response length |
.greedy sampling | N/A | Deterministic token selection |
.random(top: k) | topK | Top-K sampling |
.random(probabilityThreshold:) | topP | Nucleus sampling |
seed parameter | N/A | Reproducible randomness |
| (Not available) | repetitionPenalty | Reduces repetitive output |
| (Not available) | repetitionContextSize | Repetition penalty scope |
It offers more sampling strategies but constrains temperature and lacks repetition penalties. The focus is more on defaults with controls when you need them.
Parameter Tuning Tips
Here are some tips for parameter tuning:
- Start with Defaults: Use
nilfor both temperature and sampling to let the system choose optimal values - Adjust Gradually: If defaults do not work according to your taste, make small temperature adjustments (±0.1-0.2)
- Temperature Range: In the latest beta update of the framework, the temperature range is limited to 0.0-1.0 (unlike MLX Swift’s higher values)
- Match Task to Temperature: Factual tasks need low temperature (0.1-0.3), creative tasks can use higher values (0.7-1.0)
- Consider Context Length: Remember that longer conversations use more of your token budget (query the limit with
SystemLanguageModel.default.contextSize) and that goes for setting a longer and detailed instruction as well. On iOS 26.4 and later, you can measure exactly how many tokens a prompt will consume withSystemLanguageModel.default.tokenUsage(for:)before sending it
What’s Next
Understanding generation options gives you precise control over how Foundation Models behaves in your apps. Start with system defaults and adjust based on your specific use case—for many scenarios, the defaults work well without any tuning.
Now that you can control how the model generates responses, the next chapter explores structured generation with schemas. You will learn how to transform unstructured AI responses into type-safe Swift objects, applying the generation controls you just learned to produce reliable, structured data for your apps.