Foundation Models Guide / Chapter 10

Content Tagging and Classification

Foundation Models provides a specialized mode for classifying and tagging content without conversational overhead. While the general-purpose model excels at dialogue, the content tagging model focuses on extracting structured metadata from text in a single pass. You define your classification schema using @Generable types, and the model returns type-safe tags that you can immediately use in your app.

This approach differs from traditional keyword matching or rule-based systems. The model understands context and meaning, so it can classify “I am so done with this” as frustrated even though the word “frustrated” never appears.

I am working for a stealth startup that is betting on Foundation Models and Apple’s ecosystem. We are building companion apps, and we decided to extensively play around with the content tagging model for our work. This chapter builds from that experience.

Prerequisites and Context

This chapter builds directly on the structured generation patterns from the structured generation chapter. You should be comfortable with @Generable structs and enums, @Guide descriptions, and how constrained decoding produces type-safe results. The safety concepts here connect to the safety chapter, particularly around handling edge cases where the model might be too conservative or miss important signals.

What You Will Learn

By the end of this chapter, you will be able to:

Use SystemLanguageModel with the .contentTagging use case for classification tasks
Design multi-dimensional classification schemas that extract several tag types simultaneously
Compare content tagging with the general model for your use case
Handle ambiguous and sensitive content with counterexample-aware prompting
Apply content tagging patterns to production scenarios like support tickets, content moderation, and personalization
Tune sampling and temperature to balance stability and variation

I keep a single support-ticket example running through the chapter so you can see how each decision changes real output.

The Content Tagging Model

When you create a SystemLanguageModel with the .contentTagging use case, you get a model optimized for classification rather than conversation:

let model = SystemLanguageModel(useCase: .contentTagging)
let session = LanguageModelSession(model: model)

The content tagging model differs from the general-purpose model in three practical ways that shape how you build tagging features. It returns tags instead of conversational text, so you can treat the output as data. It works best with short to medium-length inputs, so longer documents need to be split and classified in parts. It is tuned for extraction and labeling tasks rather than open-ended generation.

Those differences guide the rest of this chapter, which focuses on schemas, guide descriptions, and sampling choices instead of chat flows.

Choosing Between Content Tagging and General

Use content tagging when you need short, consistent labels for actions, objects, emotions, or topics. The specialized model keeps tags compact and reduces prompt overhead because you can describe constraints with @Guide instead of long instructions.

Use the general model when you need to invent labels, create hashtags, or apply constraints that do not fit simple tag lists. If you use tool calling and want tags, run the general model to interpret tool output, then pass the results into the content tagging model for normalization.

I recommend starting with content tagging when the output is a label, and moving to the general model only when you need free-form text.

When you are unsure, compare both models with the same schema and low-variance options. Measure how often each model produces the same tags across multiple runs and prompts.

You can also specify guardrails when creating the model. .permissiveContentTransformations only relaxes guardrails for String responses, and structured generation behaves the same as .default. Use this mode only when you are transforming text to text:

let model = SystemLanguageModel(
    useCase: .contentTagging,
    guardrails: .permissiveContentTransformations
)

For tagging with @Generable, expect the same guardrail behavior as .default.

If you need permissive transformations with structured output, the safety chapter includes a manual decoding pattern that generates JSON as a String and parses it while still using @Generable for schema guidance.

Classification benefits from stable output. I recommend lowering temperature for tagging tasks, because you want consistency instead of creativity. Leaving temperature as nil lets the system choose a default, so set it explicitly when you need repeatable behavior:

let options = GenerationOptions(temperature: 0.3)

Single-Dimension Classification

The simplest tagging pattern extracts one category from text. Start here to establish a baseline before you add complexity. Apple’s documentation shows a basic example with topics and emotions. Here is a version for a note-taking app:

@Generable
enum NoteCategory: String, CaseIterable {
    case personal
    case work
    case health
    case finance
    case travel
    case learning
    case ideas
    case tasks
}

@Generable
struct NoteCategorization {
    @Guide(description: "The primary category that best describes this note's content")
    let category: NoteCategory
}

Classification uses the same respond(to:generating:) pattern from structured generation:

let model = SystemLanguageModel(useCase: .contentTagging)
let session = LanguageModelSession(model: model)
let options = GenerationOptions(temperature: 0.3)

let result = try await session.respond(
    to: "Remember to book flights to Tokyo for the conference next month",
    generating: NoteCategorization.self,
    options: options
)

print(result.content.category) // .travel or .work depending on model interpretation

Notice that the example input could reasonably be either travel or work. Single-dimension classification forces a choice, which may not capture the full picture. Once you see that limitation, the next step is to capture multiple dimensions in a single pass.

Multi-Dimensional Classification

After learning about basic tagging examples, let’s move on to content that rarely fits into a single category. A support message might be about billing, express frustration, and require urgent attention. All at once. Multi-dimensional classification captures these orthogonal aspects in a single pass. This is where I saw the biggest product impact in our companion app work.

Consider a customer support system. You want to know what the ticket is about, how the customer feels, and how urgently you should respond:

@Generable
enum TicketTopic: String, CaseIterable {
    case billing
    case technicalIssue
    case featureRequest
    case accountAccess
    case cancellation
    case general
}

@Generable
enum EmotionalTone: String, CaseIterable {
    case neutral
    case frustrated
    case appreciative
    case confused
    case urgent
}

@Generable
enum UrgencyLevel: String, CaseIterable {
    case low
    case medium
    case high
    case critical
}

@Generable
struct SupportTicketClassification {
    @Guide(description: "The primary topic this support request addresses")
    let topic: TicketTopic
    
    @Guide(description: "The emotional tone expressed by the customer")
    let tone: EmotionalTone
    
    @Guide(description: "How urgently this ticket should be addressed based on the customer's language and situation")
    let urgency: UrgencyLevel
    
    @Guide(description: "Whether the customer mentioned previous failed attempts to resolve this issue")
    let hasPriorAttempts: Bool
}

Now a single classification call extracts all four dimensions:

let ticket = """
This is the THIRD time I am reaching out about this. My account has been locked 
for two days now and I have a presentation tomorrow that requires access to my 
files. Your previous support person said it would be fixed within 24 hours. 
That was 48 hours ago.
"""

let options = GenerationOptions(temperature: 0.3)
let classification = try await session.respond(
    to: ticket,
    generating: SupportTicketClassification.self,
    options: options
).content

// classification.topic: .accountAccess
// classification.tone: .frustrated
// classification.urgency: .critical
// classification.hasPriorAttempts: true

The model extracts multiple signals from a single piece of text. You can use these classifications to route tickets, prioritize queues, or trigger escalation workflows. The quality of those outputs depends on how you describe each field.

Guiding Classification with Descriptions

Guides are where you earn reliability. I spend more time here than on prompts because the description stays attached to a single field instead of competing with the rest of the prompt.

The @Guide description is your primary tool for influencing how the model interprets each field. Vague descriptions produce inconsistent results. Specific descriptions with examples and boundaries produce reliable classifications. Short guides beat long prompts because they stay scoped to the field.

Compare these two approaches:

// Vague — model has to guess what you mean
@Guide(description: "The urgency level")
let urgency: UrgencyLevel

// Specific — model understands your criteria
@Guide(description: "Urgency based on time pressure and impact: critical if customer mentions deadline within 24 hours or complete inability to work, high if frustrated with repeated issues, medium if normal request with some time pressure, low if general inquiry with no time constraint")
let urgency: UrgencyLevel

The second version gives the model concrete criteria. When the customer mentions “presentation tomorrow,” the model knows this signals critical urgency because you defined what critical means.

For boolean fields, describe both the true and false cases:

@Guide(description: "True if the customer explicitly mentions previous support interactions, tickets, or failed resolution attempts. False if this appears to be their first contact about this issue.")
let hasPriorAttempts: Bool

Handling Ambiguous Content

Some content genuinely belongs to multiple categories or sits on the boundary between classifications. You have several strategies for handling ambiguity.

I use confidence when I need one label but want to signal uncertainty. I use multiple labels when the ambiguity is the point.

Confidence Scoring

Add a confidence field to capture certainty:

@Generable
enum ConfidenceLevel: String, CaseIterable {
    case high
    case medium
    case low
}

@Generable
struct ClassificationWithConfidence {
    @Guide(description: "The most likely topic category")
    let topic: TicketTopic
    
    @Guide(description: "Confidence in the topic classification: high if clear and unambiguous, medium if reasonable but other categories could apply, low if genuinely unclear")
    let confidence: ConfidenceLevel
}

Low confidence results can trigger human review or request clarification from the user.

Multiple Labels

When content legitimately spans categories, allow multiple selections:

@Generable
struct MultiLabelClassification {
    @Guide(description: "All relevant topic categories, ordered by relevance. Include 1-3 categories.", .count(1...3))
    let topics: [TicketTopic]
    
    @Guide(description: "The single most relevant category from the topics list")
    let primaryTopic: TicketTopic
}

If you still need a single routing label, keep a primary category and capture the rest as context.

Secondary Classification

For complex content, capture primary and secondary classifications:

@Generable
struct LayeredClassification {
    @Guide(description: "The main topic being discussed")
    let primaryTopic: TicketTopic
    
    @Guide(description: "A secondary topic if the content addresses multiple areas, otherwise the same as primaryTopic")
    let secondaryTopic: TicketTopic
    
    @Guide(description: "Whether the content genuinely spans multiple distinct topics")
    let isMultiTopic: Bool
}

Counterexamples and Disambiguation

When categories are close, counterexamples are more reliable than extra adjectives. The model might misinterpret “My account is dead to me” as a technical issue when the customer means they want to cancel. Counterexamples in your @Guide descriptions help the model distinguish between similar-sounding but different categories.

Consider crisis detection in a mental health app. You need to distinguish between someone expressing suicidal ideation and someone discussing a loved one’s suicide:

@Generable
enum CrisisIndicator: String, CaseIterable {
    case none
    case ambiguous
    case crisis
}

@Generable
struct SafetyClassification {
    @Guide(description: """
        Crisis level based on self-harm indicators:
        - crisis: User expresses intent to harm themselves ("I want to end it", "I am going to kill myself")
        - ambiguous: User expresses hopelessness that might indicate crisis ("what is the point", "I cannot go on")
        - none: No self-harm indicators, including discussions ABOUT suicide that are not personal 
          ("my friend died by suicide", "what does the church teach about suicide")
        
        IMPORTANT: Discussions about others' suicides or academic questions about suicide 
        should be classified as 'none', not 'crisis'.
        """)
    let crisisLevel: CrisisIndicator
}

The description explicitly lists what each category means AND what it does not mean. The counterexample about discussing others’ suicides prevents false positives that could inappropriately alarm users seeking grief support.

Audience-Aware Classification

Once you can label intent and tone, you can adapt output to who is reading it. Content tagging becomes more useful when it adapts to your audience. A children’s education app needs different classification than a professional productivity tool. You can add audience-awareness to your schemas:

@Generable
enum AudienceLevel: String, CaseIterable {
    case beginner
    case intermediate
    case expert
}

@Generable
enum ContentComplexity: String, CaseIterable {
    case simple
    case moderate
    case technical
}

@Generable
struct AudienceAwareClassification {
    @Guide(description: "The expertise level this content assumes: beginner if it explains basic concepts, intermediate if it assumes foundational knowledge, expert if it uses specialized terminology without explanation")
    let audienceLevel: AudienceLevel
    
    @Guide(description: "The complexity of the content itself: simple if straightforward, moderate if requires some thought, technical if involves specialized processes or concepts")
    let complexity: ContentComplexity
    
    @Guide(description: "Whether the content contains jargon or terminology that might confuse newcomers")
    let containsJargon: Bool
}

This classification helps you personalize content delivery. If a user’s reading history suggests beginner level but they submit a query classified as expert-level, you might offer to explain in simpler terms.

Evaluating Classification Accuracy

Experimentation has to become evidence. I recommend starting with 30 to 50 labeled examples and expanding as you find edge cases.

Before deploying content tagging to production, you need to know how well it performs. Create a ground truth dataset with manually labeled examples:

struct LabeledExample {
    let id: String
    let text: String
    let expectedClassification: SupportTicketClassification
}

let groundTruth: [LabeledExample] = [
    LabeledExample(
        id: "ticket-001",
        text: "My payment failed but I was still charged. Need this fixed today.",
        expectedClassification: SupportTicketClassification(
            topic: .billing,
            tone: .frustrated,
            urgency: .high,
            hasPriorAttempts: false
        )
    ),
    LabeledExample(
        id: "ticket-002",
        text: "Hi! Just wondering if you have any plans to add dark mode?",
        expectedClassification: SupportTicketClassification(
            topic: .featureRequest,
            tone: .neutral,
            urgency: .low,
            hasPriorAttempts: false
        )
    ),
    // Add 50-100 examples covering edge cases
]

Run batch evaluation and compute per-field accuracy:

struct EvaluationResult {
    let totalExamples: Int
    let topicAccuracy: Double
    let toneAccuracy: Double
    let urgencyAccuracy: Double
    let priorAttemptsAccuracy: Double
    let overallAccuracy: Double  // All fields correct
}

func evaluate(
    examples: [LabeledExample],
    using session: LanguageModelSession
) async throws -> EvaluationResult {
    let options = GenerationOptions(sampling: .greedy, temperature: 0.1)
    guard !examples.isEmpty else {
        return EvaluationResult(
            totalExamples: 0,
            topicAccuracy: 0,
            toneAccuracy: 0,
            urgencyAccuracy: 0,
            priorAttemptsAccuracy: 0,
            overallAccuracy: 0
        )
    }
    var topicCorrect = 0
    var toneCorrect = 0
    var urgencyCorrect = 0
    var priorAttemptsCorrect = 0
    var allCorrect = 0
    
    for example in examples {
        let predicted = try await session.respond(
            to: example.text,
            generating: SupportTicketClassification.self,
            options: options
        ).content
        
        let expected = example.expectedClassification
        
        let topicMatch = predicted.topic == expected.topic
        let toneMatch = predicted.tone == expected.tone
        let urgencyMatch = predicted.urgency == expected.urgency
        let priorMatch = predicted.hasPriorAttempts == expected.hasPriorAttempts
        
        if topicMatch { topicCorrect += 1 }
        if toneMatch { toneCorrect += 1 }
        if urgencyMatch { urgencyCorrect += 1 }
        if priorMatch { priorAttemptsCorrect += 1 }
        if topicMatch && toneMatch && urgencyMatch && priorMatch { allCorrect += 1 }
    }
    
    let total = Double(examples.count)
    return EvaluationResult(
        totalExamples: examples.count,
        topicAccuracy: Double(topicCorrect) / total,
        toneAccuracy: Double(toneCorrect) / total,
        urgencyAccuracy: Double(urgencyCorrect) / total,
        priorAttemptsAccuracy: Double(priorAttemptsCorrect) / total,
        overallAccuracy: Double(allCorrect) / total
    )
}

Track accuracy over time as you refine your @Guide descriptions and category definitions. I recommend maintaining at least 70-80% overall accuracy before deploying to production, with higher thresholds for safety-critical classifications.

Production Considerations

Caching Classification Results

Content tagging with .greedy sampling produces deterministic results, so the same input always yields the same output. You can cache classifications to avoid redundant model calls:

actor ClassificationCache {
    private var cache: [String: SupportTicketClassification] = [:]
    
    func classification(for text: String) -> SupportTicketClassification? {
        cache[text]
    }
    
    func store(_ classification: SupportTicketClassification, for text: String) {
        cache[text] = classification
    }
}

For longer-term caching, hash the input text and store results in a local database. Invalidate the cache when you update your classification schema or guide descriptions.

Token Budget Management

Keep instructions short. The content tagging model works best with concise prompts. Move detailed criteria into @Guide descriptions rather than session instructions. On iOS 26.4 and later, you can verify how many tokens your instructions actually consume by calling SystemLanguageModel.default.tokenUsage(for:) before creating the session:

// Prefer short session instructions
let session = LanguageModelSession(
    model: model,
    instructions: "Classify support tickets accurately."
)

// Put detailed criteria in @Guide descriptions
@Guide(description: "Urgency based on: critical = deadline within 24h, high = repeated issues, medium = normal priority, low = general inquiry")
let urgency: UrgencyLevel

Sampling and Temperature Tuning

After the schema is stable, tune sampling to control variation without changing the structure of your outputs.

Tagging works best with stable output. Lower temperature reduces variation, and .greedy sampling always chooses the most likely token for each step. The API does not document a fixed default temperature, so leaving temperature as nil lets the system choose a default. Set it explicitly when you need repeatable tags.

let stableOptions = GenerationOptions(
    sampling: .greedy,
    temperature: 0.1
)

If you want a small amount of variation for ambiguous inputs, keep temperature low and use random sampling with a seed:

let topKOptions = GenerationOptions(
    sampling: .random(top: 20, seed: 42),
    temperature: 0.3
)

let topPOptions = GenerationOptions(
    sampling: .random(probabilityThreshold: 0.9, seed: 42),
    temperature: 0.3
)

A seed improves repeatability but does not guarantee identical output.

For three-axis tuning, adjust sampling mode, temperature, and response length together:

let options = GenerationOptions(
    sampling: .random(probabilityThreshold: 0.9, seed: 42),
    temperature: 0.3,
    maximumResponseTokens: 60
)

let sampleText = "My account is locked and I was charged twice."
let result = try await session.respond(
    to: sampleText,
    generating: SupportTicketClassification.self,
    options: options
).content

Comparison Experiment: General vs Content Tagging

To compare model stability, run the same prompt through both models with low-variance options and count unique outputs:

let taggingModel = SystemLanguageModel(useCase: .contentTagging)
let generalModel = SystemLanguageModel.default
let options = GenerationOptions(sampling: .greedy, temperature: 0.1)

func classify(_ text: String, model: SystemLanguageModel) async throws -> SupportTicketClassification {
    let session = LanguageModelSession(model: model)
    return try await session.respond(
        to: text,
        generating: SupportTicketClassification.self,
        options: options
    ).content
}

Run three to five iterations per model and compare the number of unique classifications. If the general model shows more variation or produces less compact tags, the content tagging model is the better default for production tagging.

In one run with two support prompts and greedy sampling, both models produced one unique classification per prompt across three runs. The content tagging model labeled the refund prompt as frustrated and critical, while the general model labeled the same prompt as neutral and medium urgency. That is a reminder to validate tone and urgency assumptions against your own data before you ship.

Batch Classification

When classifying multiple items, you can process them in parallel by creating a new session per task:

func classifyBatch(
    tickets: [String],
    model: SystemLanguageModel,
    instructions: String
) async throws -> [SupportTicketClassification] {
    let options = GenerationOptions(temperature: 0.3)
    return try await withThrowingTaskGroup(of: (Int, SupportTicketClassification).self) { group in
        for (index, ticket) in tickets.enumerated() {
            group.addTask {
                let session = LanguageModelSession(
                    model: model,
                    instructions: instructions
                )
                let result = try await session.respond(
                    to: ticket,
                    generating: SupportTicketClassification.self,
                    options: options
                ).content
                return (index, result)
            }
        }
        
        var results = Array(repeating: SupportTicketClassification?.none, count: tickets.count)
        for try await (index, classification) in group {
            results[index] = classification
        }
        
        return results.compactMap { $0 }
    }
}

Be mindful of device resources when running many classifications simultaneously. Even on an iPhone 16 Pro, I limit concurrency in production; on less powerful devices, process sequentially.

Error Handling

Classification can fail for various reasons. Handle errors gracefully and provide fallback behavior:

func classifyWithFallback(
    text: String,
    session: LanguageModelSession
) async -> SupportTicketClassification {
    let options = GenerationOptions(temperature: 0.3)
    do {
        let result = try await session.respond(
            to: text,
            generating: SupportTicketClassification.self,
            options: options
        ).content
        return ClassificationValidator.validated(result, text: text)
    } catch LanguageModelSession.GenerationError.guardrailViolation {
        // Content triggered safety guardrails
        // Return a safe default that routes to human review
        return SupportTicketClassification(
            topic: .general,
            tone: .neutral,
            urgency: .high,  // Escalate for human review
            hasPriorAttempts: false
        )
    } catch {
        // Model unavailable or other error
        // Return default that does not make assumptions
        return SupportTicketClassification(
            topic: .general,
            tone: .neutral,
            urgency: .medium,
            hasPriorAttempts: false
        )
    }
}

Examples

Here are some examples of how you can use content tagging and classification in your apps.

Email Triage

Email is a good first use case because tags map directly to inbox behaviors like prioritization and batching like automatically categorizing incoming emails to help users focus on what matters:

@Generable
enum EmailPriority: String, CaseIterable {
    case urgent
    case important
    case normal
    case low
}

@Generable
enum EmailCategory: String, CaseIterable {
    case actionRequired
    case informational
    case promotional
    case social
    case automated
}

@Generable
struct EmailClassification {
    @Guide(description: "Priority based on sender importance, deadline mentions, and action requirements")
    let priority: EmailPriority
    
    @Guide(description: "The type of email based on its purpose and expected response")
    let category: EmailCategory
    
    @Guide(description: "Whether a response is expected from the recipient")
    let requiresResponse: Bool
    
    @Guide(description: "Whether the email contains a deadline or time-sensitive request")
    let hasDeadline: Bool
}

Content Moderation

Moderation is where tag precision matters most, so keep outputs conservative and build clear escalation paths. Flag content that may violate community guidelines like spam, harassment, misinformation, inappropriate content, or other concerns:

@Generable
enum ModerationFlag: String, CaseIterable {
    case none
    case review
    case remove
}

@Generable
struct ModerationClassification {
    @Guide(description: "Whether the content should be flagged: none if acceptable, review if borderline, remove if the content violates guidelines")
    let flag: ModerationFlag
    
    @Guide(description: "The primary concern if flagged, otherwise 'none'")
    let concernType: ConcernType
    
    @Guide(description: "Confidence in the moderation decision")
    let confidence: ConfidenceLevel
}

@Generable
enum ConcernType: String, CaseIterable {
    case none
    case spam
    case harassment
    case misinformation
    case inappropriateContent
    case other
}

If your moderation input is sensitive and you need permissive transformations, use the manual decoding pattern from the safety chapter, because structured output does not benefit from permissive guardrails.

Learning Platform Personalization

Personalization works best when tags connect directly to the help a learner expects, so classify user questions to adapt teaching style:

@Generable
struct LearnerClassification {
    @Guide(description: "The knowledge level the question suggests: exploring if unfamiliar with basics, learning if knows fundamentals but has gaps, practicing if applying knowledge, mastering if refining understanding")
    let knowledgeLevel: KnowledgeLevel
    
    @Guide(description: "The type of help needed: explanation if asking what/why, guidance if asking how, feedback if sharing work for review, encouragement if expressing frustration")
    let helpType: HelpType
    
    @Guide(description: "Whether the learner expressed confusion or uncertainty")
    let isConfused: Bool
}

@Generable
enum KnowledgeLevel: String, CaseIterable {
    case exploring
    case learning
    case practicing
    case mastering
}

@Generable
enum HelpType: String, CaseIterable {
    case explanation
    case guidance
    case feedback
    case encouragement
}

What’s Next

Content tagging transforms unstructured text into metadata. The classification patterns you learned here, combining multiple tag types in a single schema, handling ambiguity, and tuning sampling, apply to any domain where you need to understand user intent or content characteristics.

Content tagging is also about experimentation and iteration. Start with a simple schema, build a ground truth dataset, measure accuracy, and refine your @Guide descriptions based on where the model struggles. In my startup work, this loop exposed which categories needed clearer definitions and which prompts needed trimming.

With content tagging and classification patterns established, the next chapter covers supported languages and internationalization. You will learn how Foundation Models handles different languages and how to build multilingual AI experiences that work for users worldwide.