Solving “Too Many Tokens” in Streamlit-Based MCP Servers: A Deep Dive
When you’re building a Model Context Protocol (MCP) server using Streamlit, you may encounter the notorious error: “Too many tokens”. This occurs when the input passed to a large language model (LLM) exceeds the maximum token limit allowed by the API (e.g., OpenAI’s GPT-4, Claude, etc.). It’s a common bottleneck in systems that continuously build up context over time.
In this comprehensive post, we’ll dive into why this error happens and how to fix it using practical strategies including summarization, context pruning, token counting, message prioritization, and more. We’ll also include code snippets that you can use directly in your Streamlit-based MCP apps.
Understanding the Error
Most LLMs have a token limit (e.g., 8K, 16K, 32K tokens). A token can be as small as a character or as long as a word. When you feed too much context into the model, the combined tokens of your prompt + context + system instructions + user input + expected output may exceed this limit.
In MCP systems, where each user interaction adds to a growing conversation history or set of memory artifacts, this is especially easy to hit.
Fix #1: Token Counting and Monitoring
Before fixing the problem, measure it. Use libraries like tiktoken
to count tokens.
📁 Example
import tiktoken
def count_tokens(text, model="gpt-4"):
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
Use this to measure each message or the total context length.
Streamlit Integration
import streamlit as st
if 'messages' not in st.session_state:
st.session_state['messages'] = []
# Count total tokens
total_tokens = sum(count_tokens(msg['content']) for msg in st.session_state['messages'])
st.write(f"Total tokens: {total_tokens}")
Fix #2: Summarization of Prior Messages
If your chat history grows long, summarize previous messages and store them as a single concise message.
✍️ Manual Summarization Trigger
def summarize_conversation(messages, model="gpt-3.5-turbo"):
from openai import OpenAI
client = OpenAI()
content = "\n".join([f"{msg['role']}: {msg['content']}" for msg in messages])
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Summarize the following conversation:"},
{"role": "user", "content": content}
]
)
return response.choices[0].message.content
Use a button to summarize old messages:
if st.button("Summarize conversation"):
summary = summarize_conversation(st.session_state['messages'][:-5]) # keep recent messages
st.session_state['messages'] = [{"role": "system", "content": summary}] + st.session_state['messages'][-5:]
Fix #3: Context Pruning (Last-N Messages)
A fast fix: limit context to the last N interactions.
MAX_MESSAGES = 10
context = st.session_state['messages'][-MAX_MESSAGES:]
If the total tokens still exceed the budget, fall back to more aggressive pruning or summarization.
Fix #4: Message Prioritization
Not all messages are equally important. Prioritize based on:
- Recency
- Relevance (e.g., does it include an instruction?)
- Role (e.g., system > assistant > user)
Prioritize Using Metadata
def get_priority_score(msg):
score = 0
if msg['role'] == 'system': score += 2
if 'important' in msg.get('metadata', {}): score += 2
if msg['role'] == 'user': score += 1
return score
pruned = sorted(st.session_state['messages'], key=get_priority_score, reverse=True)
context = pruned[:10] # or as many as tokens allow
Fix #5: Use External Memory Store
For long-term memory, move older context to an external memory store like a database or vector store. Retrieve relevant chunks using embeddings.
🔗 Example Using FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
index = faiss.IndexFlatL2(384) # 384 = embedding size for the model
texts = [msg['content'] for msg in st.session_state['messages']]
embeddings = model.encode(texts)
index.add(np.array(embeddings))
# Later, retrieve top-k similar context chunks
query_embedding = model.encode([user_input])[0]
D, I = index.search(np.array([query_embedding]), k=5)
relevant_texts = [texts[i] for i in I[0]]
Fix #6: Chunking Long Text Inputs
If the user pastes a large document, chunk it before sending.
📝 Chunk Utility
def chunk_text(text, max_tokens=500):
words = text.split()
chunks = []
current = []
for word in words:
current.append(word)
if count_tokens(" ".join(current)) > max_tokens:
current.pop()
chunks.append(" ".join(current))
current = [word]
if current:
chunks.append(" ".join(current))
return chunks
Fix #7: Use Models With Larger Context Windows
Sometimes, the best fix is architectural. Switch to a model with a 32K or 128K token limit, like:
- GPT-4 Turbo (128K tokens)
- Claude 2/3
- Gemini 1.5 Pro (1M context)
Make sure your Streamlit code can dynamically select the model:
model = st.selectbox("Choose a model", ["gpt-3.5-turbo", "gpt-4", "gpt-4-128k"])
Bonus: Visualization for Debugging
Use Streamlit’s visual tools to understand what’s going into your model.
with st.expander("View Raw Prompt"):
st.code("\n".join([f"{m['role']}: {m['content']}" for m in context]))
Also display token budget:
TOKEN_LIMIT = 8000
context_tokens = sum(count_tokens(m['content']) for m in context)
st.progress(context_tokens / TOKEN_LIMIT)
st.write(f"{context_tokens} / {TOKEN_LIMIT} tokens used")
Wrap Up
Building an MCP server in Streamlit gives you powerful interactivity and rapid prototyping, but handling token overflow is a critical technical challenge. By implementing smart strategies like summarization, pruning, external memory, and better model choices, you can keep your system efficient and responsive.
Always start with visibility: track your tokens, inspect context, and allow debugging. Then apply the right combination of strategies based on your specific app’s complexity, user behavior, and model capabilities.