Overview
In long-running voice AI conversations, context grows with every exchange. This increases token usage, raises costs, and can eventually hit context window limits. Pipecat includes built-in context summarization that automatically compresses older conversation history while preserving recent messages and important context.How It Works
Context summarization automatically triggers when either condition is met:- Token limit reached: Context size exceeds
max_context_tokens(estimated using ~4 characters per token) - Message count reached: Number of new messages exceeds
max_unsummarized_messages
- Sends a
LLMContextSummaryRequestFrameto the LLM service - The LLM generates a concise summary of older messages
- Context is reconstructed as:
[system_message] + [summary] + [recent_messages] - Incomplete function call sequences and recent messages are preserved
Context summarization is asynchronous and happens in the background without
blocking the pipeline. The system uses request IDs to match summary requests
with results and handles interruptions gracefully.
Enabling Context Summarization
Enable summarization by settingenable_context_summarization=True in LLMAssistantAggregatorParams:
Customizing Behavior
UseLLMContextSummarizationConfig to control when and how summarization occurs:
| Parameter | Default | Description |
|---|---|---|
max_context_tokens | 8000 | Maximum context size (in estimated tokens) before triggering summarization |
target_context_tokens | 6000 | Target token count for the generated summary |
max_unsummarized_messages | 20 | Maximum new messages before triggering summarization |
min_messages_after_summary | 4 | Number of recent messages to preserve uncompressed |
summarization_prompt | None | Custom prompt for summary generation (uses built-in default if None) |
summary_message_template | "Conversation summary: {summary}" | Template for formatting the summary when injected into context |
llm | None | Optional separate LLM service for generating summaries (uses pipeline LLM if None) |
summarization_timeout | 120.0 | Maximum time in seconds to wait for summary generation |
What Gets Preserved
Context summarization intelligently preserves:- System messages: The first system message (defining assistant behavior) is always kept
- Recent messages: The last N messages stay uncompressed (configured by
min_messages_after_summary) - Function call sequences: Incomplete function call/result pairs are not split during summarization
Custom Summarization Prompts
You can override the default summarization prompt to control how the LLM generates summaries:Using a Dedicated LLM for Summarization
For cost optimization, you can route summarization requests to a separate, cheaper/faster LLM while keeping your primary model for conversation:Customizing Summary Format
Usesummary_message_template to control how summaries are formatted when injected into context. This is useful for wrapping summaries in custom delimiters (e.g., XML tags) so system prompts can distinguish them from live conversation:
{summary} as a placeholder for the generated summary text.
Monitoring Summarization
Use theon_summary_applied event handler to track summarization activity and observe compression metrics:
original_message_count: Total messages before summarizationnew_message_count: Total messages after summarizationsummarized_message_count: Number of messages compressed into the summarypreserved_message_count: Number of recent messages preserved uncompressed