The Context

Everyone's shipping chatbots. Most of them are wrappers around a chat API with a system prompt and a knowledge base. You ask a question, they answer from their training data, maybe they do a vector search over your docs. None of it is specific to your app's actual data. The agent can tell you about your product in generic terms; it can't tell you which of your users are currently free on Saturday.

The pitch for RSVPed was supposed to be different. Event discovery is the app. Natural-language search across a live Postgres database. "Tech meetups this weekend." "What's on like TechCon?" "Free things near me tomorrow night." The assistant — called Stir — would query the same tables the rest of the app queries, through typed tools, not prose summaries. When a user says "show me events like TechCon", the agent doesn't hallucinate a list; it calls getSimilarEvents({ eventId }), which runs the same Prisma query the "Similar Events" section on the event page runs.

That part was easy. Wire up the Vercel AI SDK, define a handful of tool functions that call Prisma, give Claude the tool list, stream the response. Done in an afternoon. The problem showed up in the logs.

The Problem

A capable model with ten tools burns through its step budget before it can answer anything.

The first implementation of Stir had ten Prisma-backed tools: searchEvents, searchCommunities, getEventDetails, getCategories, getUserProfile, getUserRsvps, getUserCommunities, getFriendsAttending, getTrending, getSimilarEvents. Great coverage. Awful behavior.

Someone would type "what's trending" and the model would:

Call searchCommunities (it's looking for communities trending? maybe?)
Call getCategories (does trending mean a category?)
Call searchEvents with query="trending" (finally)
Sometimes give up because stepCountIs(5) tripped

For a single-word query. The issue wasn't the tools — each one did exactly what it said on the tin. The issue was that a generalist model staring at ten possible tools will try too many. Giving it more options costs latency, tokens, and occasionally correctness. The first few token generations burned on deciding; the user had to wait three seconds for the model to start choosing.

Meanwhile, the easy queries didn't need the LLM at all. "trending", "popular", "hi", "help" — you can classify those in the JavaScript layer and save the 2-second round trip entirely. An agent architecture that treats a "hi" and a "compare TechCon vs DevConf over wine-and-jazz events next weekend" as the same problem is paying the cost of the second one on every query.

A related problem: anonymous users. Stir is public. Half the traffic doesn't have a session. The getUserProfile and getFriendsAttending tools are meaningless for anon users, but the model doesn't know that — it happily calls them and gets 401 errors back, which it then tries to reason about. Per-tool auth checks fix this but add noise to every tool implementation.

A third problem, more existential: agent latency compounds. Classification is ~300ms, each tool call is ~50-400ms of Prisma plus round-trip, the final synthesis stream is 1-3s. Do one classifier call, three tool calls, and a synthesis stream, and you're at 4-5 seconds for the first visible token. For event discovery, which is an impatient use case, that's the edge of acceptable.

The architecture needed to do three things at once: remove unnecessary LLM calls entirely, cap the cost of the ones that remained, and keep graceful degradation at every layer in case any of the LLM pieces failed.

The Solution

The architecture is built around one idea: don't ask the model to decide things a smaller model or a lookup table can decide.

Three layers of progressive degradation

The request flows through three gates, each cheaper than the last:

Short-circuit map (no LLM). Single-word queries hit a static map.
Classifier (Haiku, ~300ms, 3s timeout). Structured-output intent classification.
Main stream (Sonnet, quality). The classified intent drives tool scoping.

Here's the short-circuit:

export const SHORT_CIRCUIT_PATTERNS: Record<string, Intent> = {
  trending: 'search',
  popular: 'search',
  new: 'search',
  latest: 'search',
  upcoming: 'search',
  help: 'general',
  hi: 'general',
  hello: 'general',
  hey: 'general',
  thanks: 'general',
  thank: 'general',
}

// In classifier.ts
const words = trimmed.toLowerCase().split(/\s+/)
if (words.length <= 2) {
  const match = SHORT_CIRCUIT_PATTERNS[words[0]]
  if (match) return { intent: match, reasoning: `Short-circuit: "${words[0]}"` }
}

A user who types "trending" gets routed straight to search intent without any LLM call. A user who types "hi" gets routed to general. Zero-latency classification for the top of the distribution.

If the short-circuit misses, the classifier runs. Haiku with a structured-output schema, wrapped in a 3-second race:

const result = await Promise.race([
  generateText({
    model: getModel('fast'),   // Haiku
    output: Output.object({ schema: intentSchema }),
    system: CLASSIFIER_SYSTEM_PROMPT,
    prompt: truncated,
  }),
  new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error('Classification timeout')), 3000)
  ),
])

If Haiku is having a bad day, we don't block on it forever — we degrade to general intent (which gets all tools) and keep going. The SDK's Output.object enforces the Zod schema on the response; if Haiku tries to invent a seventh intent, it's rejected upstream and we fall back.

Intent-scoped tool maps

Each intent exposes only the tools that matter:

export const INTENT_TOOL_MAP: Record<Intent, string[]> = {
  search:    ['searchEvents', 'searchCommunities', 'getCategories', 'getEventDetails'],
  recommend: ['searchEvents', 'getUserProfile', 'getUserRsvps', 'getUserCommunities', 'getTrending', 'getCategories'],
  detail:    ['getEventDetails', 'getFriendsAttending', 'getSimilarEvents', 'searchEvents'],
  compare:   ['getEventDetails', 'searchEvents', 'getFriendsAttending', 'getSimilarEvents'],
  general:   ['searchEvents', 'searchCommunities', 'getEventDetails', 'getCategories'],
}

When a user asks "tell me about TechCon," the main model only sees 4 tools, not 10. It doesn't have to consider whether this is a trending query or a user-profile question. Intent was resolved upstream, cheaply. The model's first-token latency drops noticeably; the "deliberating" phase goes from 3-4 turns of potential-tool-evaluation to 1-2.

The scoping happens right before the streamText call:

const activeToolNames = INTENT_TOOL_MAP[classification.intent] ?? Object.keys(ALL_TOOLS)

// Anonymous users don't get user-context tools
const filteredToolNames = userId
  ? activeToolNames
  : activeToolNames.filter((name) => !USER_CONTEXT_TOOLS.includes(name))

const activeTools = Object.fromEntries(
  filteredToolNames
    .filter((name): name is keyof typeof ALL_TOOLS => name in ALL_TOOLS)
    .map((name) => [name, ALL_TOOLS[name]])
)

Anonymous filtering happens in the same pass. No per-tool auth check, no "am I allowed to call this?" inside each tool implementation — just strip them from the visible set before the model ever sees them. One filter, ten cleaner tools.

Tiered models

Two model tiers, one centralized config, one place to update during migrations:

export const MODEL_OPTIONS = {
  fast:    { id: 'claude-haiku-4-5-20251001',  label: 'Claude Haiku 4.5' },
  quality: { id: 'claude-sonnet-4-20250514',   label: 'Claude Sonnet 4' },
} as const

export const getModel = (tier: ModelTier = 'quality') => {
  return anthropic(MODEL_OPTIONS[tier].id)
}

Haiku runs the classifier, the follow-up suggestions endpoint (3 short continuations after each response), and any other low-stakes short-output task. Sonnet streams the main conversation. Cost per conversation averages ~$0.003 — most of which is the Sonnet synthesis stream, because tool results are comparatively small.

When Anthropic releases a new model, I change two lines. This is not hypothetical — we migrated from Haiku 3.5 to 4.5 in November 2025 with exactly those two edits, which is the kind of ergonomics you notice only when you've painfully not had them.

Context enrichment before the stream starts

Before streamText is called, buildSystemPrompt does two lookups:

Page context. If the user is on /events/{slug} or /communities/{slug}, Prisma fetches the entity and splices it into the system prompt. When the user says "this event," the model already knows what "this" means — no round trip to ask, no tool call to resolve it.
User profile. If logged in, fetch interests, profession, recent RSVPs, preferred categories. Inject as a ## User Profile block. The model sees the user's context before the first token streams.

const user = await prisma.user.findUnique({
  where: { id: userId },
  select: {
    name: true,
    interests: true,
    profession: true,
    industry: true,
    location: { select: { name: true } },
    categoryInterests: { select: { category: { select: { name: true } } } },
    rsvps: {
      where: { status: 'CONFIRMED' },
      select: { event: { select: { title: true, categories: { select: { category: { select: { name: true } } } } } } },
      orderBy: { createdAt: 'desc' },
      take: 5,
    },
  },
})

const userLines = ['## User Profile']
if (user.location?.name) userLines.push(`Location: ${user.location.name}`)
if (user.interests.length > 0) userLines.push(`Interests: ${user.interests.join(', ')}`)
if (user.rsvps.length > 0) userLines.push(`Recent RSVPs: ${user.rsvps.map(r => r.event.title).join(', ')}`)
userLines.push('Use this profile to personalize. Do NOT ask the user for info you already have.')

Both enrichment lookups are wrapped in try/catch with a fallback to the base prompt — if the enrichment fetches fail (Prisma timeout, stale cache, anything), we still respond, just with less personalization. The stream never fails because enrichment failed.

Tools are Prisma, straight through

Each tool is a ~30-line file: inline Zod inputSchema, execute calls Prisma with the session-aware ctx, returns a shaped result. The AI SDK handles the tool-call protocol:

// lib/ai/agent/tools/searchEvents.ts
export const searchEvents = tool({
  description: 'Search events by keyword, category, date range, or location. Pass an empty query to get popular events.',
  inputSchema: z.object({
    query: z.string().describe('Keywords to match in title/description, or empty for trending'),
    city: z.string().optional(),
    categoryId: z.string().optional(),
    startDate: z.string().datetime().optional(),
    limit: z.number().int().min(1).max(20).default(10),
  }),
  execute: async ({ query, city, categoryId, startDate, limit }) => {
    return prisma.event.findMany({
      where: {
        isPublished: true,
        deletedAt: null,
        ...(query ? { OR: [{ title: { contains: query, mode: 'insensitive' } }, { description: { contains: query, mode: 'insensitive' } }] } : {}),
        ...(city ? { location: { slug: city } } : {}),
        ...(categoryId ? { categories: { some: { categoryId } } } : {}),
        ...(startDate ? { startDate: { gte: new Date(startDate) } } : {}),
      },
      include: { location: true, _count: { select: { rsvps: true } } },
      orderBy: query ? undefined : [{ rsvps: { _count: 'desc' } }, { startDate: 'asc' }],
      take: limit,
    })
  },
})

No vector DB, no RAG, no embedding pipeline. The model queries the production tables with typed parameters. When the schema changes, the tool changes; when the tool changes, the agent's capability changes. No drift. The inputSchema.describe() strings become part of the tool description the model sees — documentation and schema are the same artifact.

Observability falls out of the SDK

The AI SDK's onStepFinish and onFinish callbacks give you structured access to every tool call, every step, every token count:

onStepFinish: ({ usage, toolResults }) => {
  for (const toolResult of toolResults ?? []) {
    logToolCall({
      toolName: toolResult.toolName,
      args: toolResult.input as Record<string, unknown>,
      resultCount: Array.isArray(toolResult.output) ? toolResult.output.length : undefined,
      error: extractError(toolResult.output),
    })
  }
  totalTokens += usage?.totalTokens ?? 0
  logStepComplete({
    stepIndex,
    inputTokens: usage?.inputTokens ?? 0,
    outputTokens: usage?.outputTokens ?? 0,
  })
  stepIndex++
},
onFinish: () => {
  logConversationComplete({
    totalSteps: stepIndex,
    totalTokens,
    durationMs: Date.now() - conversationStart,
    userId,
  })
},

Every conversation gets a correlated log line per step plus a summary at finish (total steps, total tokens, duration, userId). No separate observability tool, no wrapping layer — just callbacks. Vercel logs show you the full trace of "user said X → classifier returned Y → model called searchEvents({city: 'sf'}) → got 10 results in 83ms → model called getEventDetails({eventId: '...'}) → streamed response" with timings.

Rate limiting without Redis

20 requests/hour for authenticated users, 5/hour for anon. Kept in a process-local Map<string, number[]> with auto-pruning at 1000 entries (LRU-ish). No Redis, no Upstash, no external dependency.

function checkRateLimit(key: string, limit: number): boolean {
  const now = Date.now()
  const windowStart = now - RATE_LIMIT.windowMs
  const timestamps = (rateLimitMap.get(key) ?? []).filter((t) => t > windowStart)
  if (timestamps.length >= limit) return false
  timestamps.push(now)
  rateLimitMap.set(key, timestamps)
  if (rateLimitMap.size > RATE_LIMIT.pruneThreshold) pruneOldEntries()
  return true
}

If Stir ever scales past one Vercel function instance, it gets promoted to Redis. For now, local is fine, it's zero infrastructure, and it survives a 10× traffic spike without paging anyone.

What I considered and rejected

Let the model figure out which tools to use. This is the default agent design and it's the problem I started with. A capable generalist with too many options wastes steps on deliberation.
One tool per intent. Collapsing searchEvents and searchCommunities into one "search" tool that takes a type parameter. Cleaner in theory, but the two have very different result shapes and different downstream rendering. Separate tools keep the system prompt's rendering instructions simple.
Embeddings + vector search. Considered for getSimilarEvents. Rejected: Prisma + category overlap + location scoring already gets the same answers with zero new infrastructure, zero embedding drift, and zero cost.
Streaming the classifier too. The 300ms response is fast enough that streaming adds no user-perceived win. Blocking is simpler.
Caching classifier results. Keys are user queries, which are high-cardinality. Not enough repeat to justify the cache.

The Impact

<100ms for short-circuit queries (zero LLM calls)
~300ms of Haiku + 1-3s of Sonnet = 1.3-3.3s for classified queries
$0.003 average cost per conversation
10 typed Prisma tools querying the real database, not a vector store or generic API
Tool scoping cuts the model's deliberation by 50-80% on specific intents (search no longer considers user-profile tools; detail no longer considers trending)
Graceful degradation at every layer — classifier timeout, tool failure, missing page context, anonymous session — none of them break the stream
Zero per-tool auth checks in tool implementations — scoping handles it
Zero external observability tools — AI SDK callbacks + Vercel logs cover the need

Closing

The lesson that matters most here isn't about AI. It's about decomposition. A chatbot that does everything is a generalist staring at a wall of options. A chatbot with an intent router is a receptionist pointing at the right door.

You don't need a bigger model. You need a smaller one in front of it. And a static map in front of that one. Each gate does less work than the next; each is cheaper and faster; each has a clear failure mode that degrades to the gate below. The model at the end of the chain is capable, which is what you wanted — but by the time it's invoked, most of the decisions have already been made by things cheaper than it.

That's the architecture for agents in production. Fewer tools, better routed. The model's job is synthesis, not triage.

The AI Agent's Best Tool Is Fewer Tools