Foundation Models Guide / Chapter 4

Generation Options and Sampling Control

Once you understand the basics of sessions, you may want to explore different ways to control how the model generates responses. Foundation Models provides controls for customizing the output behavior. This chapter covers the different options available and how to use them.

Prerequisites and Context

This chapter builds on the session and streaming concepts from earlier chapters. You should be comfortable creating sessions and working with streaming responses before exploring generation controls. These options affect all model interactions - the streaming you just learned, as well as structured output and tool calling that you will learn next.

What You Will Learn

By the end of this chapter, you will be able to:

Control response creativity and predictability using temperature settings
Choose between different sampling strategies (greedy, top-K, top-P) based on your use case
Set appropriate token limits to fit your UI requirements
Combine generation options for specific scenarios like creative writing vs technical documentation
Understand the differences between Foundation Models and MLX Swift parameter controls

Understanding Generation Options

You can customize how the model generates responses using GenerationOptions. The framework provides simpler controls compared to MLX Swift, focusing on the parameters that matter most for on-device experiences.

Understanding Temperature

Temperature influences the confidence of the model’s response and must be between 0 and 1 (inclusive). The following are some examples:

// Temperature: 0.1 - Very predictable, focused responses
let preciseOptions = GenerationOptions(temperature: 0.1)

// Temperature: 0.7 - Balanced creativity and coherence  
let balancedOptions = GenerationOptions(temperature: 0.7)

// Temperature: 1.0 - Maximum creativity within bounds
let creativeOptions = GenerationOptions(temperature: 1.0)

// System default - Let Foundation Models choose optimal temperature
let defaultOptions = GenerationOptions(temperature: nil)

How Temperature Works: Temperature adjusts the probability distribution before sampling. A value of 1.0 results in no adjustment, while values less than 1.0 make the probability distribution sharper.

Low (0.1-0.3): Makes likely tokens even more likely, resulting in stable and predictable responses
Medium (0.5-0.7): Good balance for most conversational tasks
High (0.8-1.0): Gives the model more creative license while staying coherent
nil: Lets the system choose a reasonable default automatically

Token Limits for UI Control

The maximumResponseTokens parameter prevents responses from overwhelming your interface:

// Short responses for UI cards or notifications
let briefOptions = GenerationOptions(maximumResponseTokens: 50)

// Medium responses for chat interfaces
let chatOptions = GenerationOptions(maximumResponseTokens: 200)

// Longer responses for content generation
let detailedOptions = GenerationOptions(maximumResponseTokens: 500)

Tight Output Constraints for UX

In my app, Zenther, I embed length requirements directly in the system instructions for the widget and notification instructions. I did not want to rely on the token limits to avoid truncated notifications. I experimented with the number of words to create a consistent experience across different device sizes.

Here is the widget instructions for the ultra-brief encouragements:

public static let widget = """
Generate brief, natural encouragements for completed workouts.
Be genuine and conversational, like a supportive friend.
Keep it simple and authentic, under 100 characters.
"""

Here is the instructions for the structured constraints for push notifications:

public static let notification = """
You are a fitness coach generating notifications for workout achievements and milestones.

For workout notifications: Title 4-6 words max, body 1-2 motivating sentences.
For milestone notifications: Title 3-5 words celebrating milestone, body 1 encouraging sentence.

Tone: Supportive, energetic, celebratory.
"""

Parameter Combinations

Here are some combinations for common use cases. You can use these as a starting point and adjust them to your needs.

// Content summarization with focused and concise responses
let summaryOptions = GenerationOptions(
    temperature: 0.2,
    maximumResponseTokens: 150
)

// Creative writing with more varied and expressive responses
let storyOptions = GenerationOptions(
    temperature: 0.9,
    maximumResponseTokens: 400
)

// Technical assistance with precise and reliable responses
let technicalOptions = GenerationOptions(
    temperature: 0.1,
    maximumResponseTokens: 300
)

// Casual conversation with natural and engaging responses
let chatOptions = GenerationOptions(
    temperature: 0.7,
    maximumResponseTokens: 200
)

Applying Generation Options

Use the options parameter in your session calls:

let session = LanguageModelSession(
    instructions: Instructions("You are a helpful writing assistant.")
)

// Generate creative content
let story = try await session.respond(
    to: "Write a short story about a robot learning to paint",
    options: storyOptions
)

// Generate focused summary
let summary = try await session.respond(
    to: "Summarize the key points from this article: [article text]",
    options: summaryOptions
)

Here is the output for the story:

In the heart of a futuristic city, where towering skyscrapers kissed the clouds and neon lights painted the sky in mesmerizing hues, there existed a laboratory tucked away in a quiet corner. This laboratory was the birthplace of AR-1, a prototype robot designed with a singular focus: to understand and mimic human emotions through art.

Born from years of advanced robotics research, AR-1 stood out among its brethren. With a sleek, metallic frame adorned with sensors and glowing LED displays, AR-1 was programmed not only to interpret emotions but also to convey them through colors and brushstrokes. Yet, despite its potential, AR-1 struggled with the most challenging task—to create art that truly expressed emotion.

Its creator, Dr. Elara Finch, was a visionary known for her groundbreaking work in affective computing. She believed that art was humanity's universal language and saw AR-1 as her greatest creation yet. Despite the robot's frustration, Dr. Finch's unwavering support was

Note that with a maximum of 200 tokens, the story is cut off because the prompt asked for a “short story” without specifying a length constraint.

Sampling Modes

Foundation Models also provides three sampling strategies that control how the model picks tokens.

Greedy Sampling

This method always chooses the most likely token, resulting in deterministic but potentially repetitive output:

let greedyOptions = GenerationOptions(
    sampling: .greedy,
    temperature: nil // Temperature is ignored with greedy sampling
)

Top-K Sampling (Random with Fixed Pool)

This method considers a fixed number of high-probability tokens, then randomly selects from that pool:

// Consider top 50 most likely tokens
let topKOptions = GenerationOptions(
    sampling: .random(top: 50, seed: nil),
    temperature: 0.7
)

// Reproducible results with seed
let seededTopKOptions = GenerationOptions(
    sampling: .random(top: 30, seed: 12345),
    temperature: 0.8
)

Top-K behavior:

Smaller K (10-30): More deterministic, confident answers
Larger K (50-100): More creative, varied responses
Fixed pool size regardless of probability distribution

Top-P Sampling (Nucleus Sampling)

This method considers a variable number of tokens based on cumulative probability threshold:

// Consider tokens until 90% probability mass is reached
let topPOptions = GenerationOptions(
    sampling: .random(probabilityThreshold: 0.9, seed: nil),
    temperature: 0.7
)

// More conservative nucleus sampling
let conservativeTopPOptions = GenerationOptions(
    sampling: .random(probabilityThreshold: 0.8, seed: 42),
    temperature: 0.6
)

Top-P behavior:

Lower threshold (0.6-0.8): Smaller, more focused token pools
Higher threshold (0.9-0.95): Larger pools, more creativity
Pool size adapts to probability distribution (smaller when spiked, larger when flat)

System Default (Recommended)

Let the model choose the optimal strategy:

let systemDefaultOptions = GenerationOptions(
    sampling: nil,
    temperature: nil
)

Foundation Models vs MLX Swift Parameters

Foundation models provides both simplified and advanced controls compared to MLX Swift. The following is a comparison of the parameters:

Foundation Models	MLX Swift Equivalent	Purpose
`temperature` (0.0-1.0)	`temperature` (unlimited)	Controls randomness/creativity
`maximumResponseTokens`	`maxTokens`	Limits response length
`.greedy` sampling	N/A	Deterministic token selection
`.random(top: k)`	`topK`	Top-K sampling
`.random(probabilityThreshold:)`	`topP`	Nucleus sampling
`seed` parameter	N/A	Reproducible randomness
(Not available)	`repetitionPenalty`	Reduces repetitive output
(Not available)	`repetitionContextSize`	Repetition penalty scope

It offers more sampling strategies but constrains temperature and lacks repetition penalties. The focus is more on defaults with controls when you need them.

Parameter Tuning Tips

Here are some tips for parameter tuning:

Start with Defaults: Use nil for both temperature and sampling to let the system choose optimal values
Adjust Gradually: If defaults do not work according to your taste, make small temperature adjustments (±0.1-0.2)
Temperature Range: In the latest beta update of the framework, the temperature range is limited to 0.0-1.0 (unlike MLX Swift’s higher values)
Match Task to Temperature: Factual tasks need low temperature (0.1-0.3), creative tasks can use higher values (0.7-1.0)
Consider Context Length: Remember that longer conversations use more of your token budget (query the limit with SystemLanguageModel.default.contextSize) and that goes for setting a longer and detailed instruction as well. On iOS 26.4 and later, you can measure exactly how many tokens a prompt will consume with SystemLanguageModel.default.tokenUsage(for:) before sending it

What’s Next

Understanding generation options gives you precise control over how Foundation Models behaves in your apps. Start with system defaults and adjust based on your specific use case—for many scenarios, the defaults work well without any tuning.

Now that you can control how the model generates responses, the next chapter explores structured generation with schemas. You will learn how to transform unstructured AI responses into type-safe Swift objects, applying the generation controls you just learned to produce reliable, structured data for your apps.