Faker Gives You Fake Data. That's The Problem.

29 minutes ago

Technologies
Next.jsTypeScriptPrisma ORMPostgreSQLVercel AI SDKClaudeZodLLMsBatch APIPerformance Optimization
Illustration

The Context

Every event platform demo looks the same, and every one of them is lying.

You scroll the events page, and it's populated. You open a user profile, and it has friends. You click into a community, and it has members. But spend ten seconds actually looking — who is friends with whom, who RSVPed to what — and the whole thing disintegrates. A payroll engineer in Berlin is friends with a food blogger in Lisbon. A HIGH-spending CTO bought the free tier. The "Urban Gardening Club" has sixty members, three of whom are crypto traders in Dubai. The dates don't line up, the venues don't make sense, and the recommendations page surfaces matches that are funny in a way recommendations pages should not be funny.

This is what Faker gets you. It fills every column, so the page renders. But there's no story linking the rows together. For an event discovery app where the entire pitch is "find things you'd actually care about", a demo that looks plastic is worse than no demo. The whole point of RSVPed is "the recommendations are sharp because they know you"; if a prospective user opens the demo and sees a 58-year-old accountant in Mumbai friends with a DJ in São Paulo, the pitch is dead before it started.

So the seed had to be better than Faker. Not ten percent better — categorically different. And doing it in a way that's reproducible (schema changes a lot in early-stage projects), cheap (I was paying for it out of pocket), and resumable (LLM passes fail, and when they do you don't want to rerun the ones that succeeded).

The Problem

Random generation breaks three ways, all of them ugly:

  • Graph incoherence. Users, events, and communities all exist in isolation. Friendships are random edges. RSVPs are random assignments. There's no reason this user RSVPed to this event other than Math.random() picked them. The social graph is technically connected but semantically nonsense.
  • Economic incoherence. Free events and $300 VIP tickets both have uniform attendance distributions. No signal in the data about who buys what. A user with spendingPower: LOW has VIP tickets; a user with HIGH bought the free tier. The dataset encodes no domain logic.
  • Cultural incoherence. Every city looks the same. Lisbon and Tokyo both have "Tech Meetup #47" and "Urban Gardening Club" with the same descriptions. The browse-by-city feature, which is core to the UX, reveals the illusion immediately.

You can paper over some of this with rules — "pick friendships within the same city" — but the moment you hand-write one rule, you've accepted that the generator needs to encode domain logic. At which point you might as well encode all of it. Which is a lot of rules. What looks like "let's seed some friendships" turns into "let's encode the same compatibility logic the recommendations engine uses, twice, in two places, in two programming styles." That's how schema drift is born.

And there's a second problem nobody talks about: LLM-generated data is expensive if you do it naively. Generate 620 users one at a time with the real-time Claude API, each with a few retries, and you're looking at an hour of wall time and a non-trivial bill. Run that every time the schema changes and you'll stop running it, which means by the time you actually demo the app the data is stale relative to the current tables. The pipeline has to be cheap enough to run weekly.

A third problem, more subtle: LLMs are imprecise. Ask Claude to generate a community in Lisbon and you might get { city: "Lisboa" } or { city: "Lisbon, Portugal" } or { city: "lisbon" }. Downstream code has to fuzzy-match those against your canonical slug list, which fails in creative ways the moment you add a new city. The relational keys need to be constrained at generation time, not sanitized after.

The Solution

The pipeline is three stages: generate, process, seed. Each is independently runnable and resumable. If the seed crashes at stage 7 of 10, yarn seed:run picks up at stage 7. If the generation batch for users succeeds but the one for venues times out, you rerun only venues. No rm -rf restarts.

Stage 1 — Three conditional LLM passes, bundled into the Batch API

The LLM runs three times, and the order matters. Each pass conditions the next:

Pass 1 — Communities. Input: 20 categories × 69 city slugs. Output: ~420 communities, each with 2-4 nested events. Each batch request gets a subset of locations and categories. The LLM generates communities authentic to their city — "Lisbon Underground Wine Collective", not "Tech Hub #47". "Shibuya Late-Night Ramen Society", not "Food Community #12". Each community carries a homeLocation slug, enforced via z.enum() to valid values:

const communitySchema = z.object({
  name: z.string(),
  description: z.string(),
  focusArea: z.string(),
  categories: z.array(z.enum(CATEGORY_SLUGS)).min(1),
  homeLocation: z.enum(LOCATION_SLUGS),
  membershipTiers: z.array(tierSchema),
  events: z.array(eventSchema).min(2).max(4),
})

The z.enum() does more work than it looks like. Claude is told, in-schema, that homeLocation has to be one of a fixed list of slugs. The structured-output machinery rejects invalid values at the generation layer. Downstream code never has to fuzzy-match "Lisbon" against "lisbon, portugal" against "pt-lisbon" — the slug is already correct or the record was rejected.

Pass 2 — Users (with community digest). This is the coherence mechanism. Before generating users, the pipeline builds a compact per-city digest of what was generated in Pass 1:

Lisbon:
  - Lisbon Underground Wine Collective (Wine & Beverage Culture, Food & Drinks)
  - Lisbon Digital Nomad Network (Remote Work, Tech & Startups)
  - Alfama Old-City Walking Guild (History & Culture, Outdoors)

Tokyo:
  - Shibuya Late-Night Ramen Society (Food & Drinks)
  - Harajuku Street Photography Collective (Photography, Arts)
  - Tokyo Kickstarter Hackers (Tech & Startups, Product)

That digest goes into the user-generation prompt. When the LLM generates a user in Lisbon, it's looking at the communities available there. So their interests come out aligned — food, wine, remote work, tech. Not random hobbies.

Users also carry spendingPower (LOW/MEDIUM/HIGH), networkingStyle (ACTIVE/SELECTIVE/CASUAL), profession, industry, and experience level. The prompt explicitly asks for small clusters per city — 2-3 clusters of 3-5 users who share overlapping interests but different professions, plus "bridge" users who span clusters. This gives the friendship graph something real to latch onto in Stage 3. Without the cluster directive, the LLM tends to generate a flat distribution; you get 620 users who are each interested in 2-3 things, and the pairwise intersection graph is too sparse to form believable friend groups.

Pass 3 — Venues. 10-15 realistic venue names per city. Conference centers, restaurants, rooftops, coworking spaces. Used by Stage 3 to assign locations to events that don't have community-provided venues.

All three passes run through the Anthropic Batch API — requests bundled into a single batch per pass, polled for completion on a 5s interval with a 30-minute timeout. Model: claude-haiku-4-5-20251001. Structured output via tool_use with Zod-to-JSON-Schema conversion. Total cost for the full generation: about $1.58 (at roughly $0.40/M input and $2.00/M output with 50% batch discount). Generation time: ~2 minutes for all three passes.

The Batch API is the unsung hero here. Real-time Claude with rate limits would have cost roughly $3.15 and taken 45+ minutes. Batch gives you the price break and eliminates rate-limit babysitting. The 5-minute processing latency is fine because generation is a background pipeline step, not a user-facing request.

Stage 2 — Process and distribute

Load and validate every batch file against its Zod schema. Corrupt or schema-drifted files are discarded gracefully — the pipeline doesn't crash on one bad batch:

for (const file of batchFiles) {
  const parsed = communityArraySchema.safeParse(await loadJSON(file))
  if (!parsed.success) {
    logger.warn(`Skipping corrupt batch ${file}:`, parsed.error.flatten())
    continue
  }
  communities.push(...parsed.data)
}

Events are distributed across cities via slug lookup. No fuzzy matching — if a community says it's in Lisbon, its events go to Lisbon. If a city ends up short of the MIN_EVENTS_PER_CITY threshold (default 15), event templates get cloned with city-specific titles to pad it out. Output: timestamped communities-final-*.json, users-final-*.json, events-distributed-*.json files, ready for Stage 3.

Stage 3 — 10-stage checkpoint/resume DB writer

wipe-db → load-static → create-users → create-communities → create-events →
create-tickets → create-orders → create-friendships → backfill-activities →
analytics → demo-user

PipelineRunner persists state to pipeline-state.json between stages. Stages are idempotent where possible (upserts for categories and locations). Bcrypt password hashes are cached to disk — seeding 620 users with unique hashed passwords twice in a row doesn't re-bcrypt, which alone saves ~15 seconds. Images are fetched from Unsplash once per community/event and cached by URL; rerunning the seed doesn't burn Unsplash API calls.

The wipe-db stage is explicit and off by default. A flag toggles it. This has saved me from three different "oh god I just wiped production" near-misses.

The matching algorithms that make it feel real

RSVP generation isn't random. For each event, every user gets a score:

function scoreUserForEvent(user: SeedUser, event: SeedEvent): number {
  const overlap = intersection(user.interests, event.categories).length
  const overlapScore = overlap / Math.max(event.categories.length, 1)  // 0..1

  const styleModifier = {
    ACTIVE: 0.3,
    SELECTIVE: overlapScore > 0.5 ? 0.2 : -0.2,
    CASUAL: 0.1,
  }[user.networkingStyle]

  const profRelevance = user.profession && event.categories.length > 0 ? 0.1 : 0
  return overlapScore + styleModifier + profRelevance
}

Users below 0.1 are filtered out — they simply wouldn't have shown up for this event. Attendance targets 70% of capacity. The top 40% of slots go to highest-interest users via probability-weighted selection; the rest fill in by a 0.7× probability modifier; filler fills any remaining gap if we still haven't hit 70%.

Tier selection keys off spending power — HIGH users skew to the top two tiers, LOW users to the cheapest. Free events always succeed; paid events have a base 85% success rate modified by spending power (HIGH: +10%, LOW: -10%) and ticket price (sub-$50: baseline; $50-100: -5%; $100+: -15%). Failed orders become PENDING or CANCELLED based on a second roll.

The net effect: a $300 VIP ticket is bought almost exclusively by HIGH-spending users. Not because we hard-coded that, but because the scoring produces it. When you browse the "who's going?" list on an expensive event in the seeded demo, you see a coherent cluster of executives and senior professionals. When you browse the same list for a free community meetup, you see a mix — exactly what you'd see in real life.

Friendships score every user pair and keep top N:

  • Same location: +3
  • Shared category interest: +1 to +2 per category (higher if both have strong interest)
  • Same industry: +1

Sort descending, pick ~4 friends per user on average, 80% ACCEPTED / 20% PENDING. Friendships are always between users who share at least a city or strong category overlap. The Berlin-engineer / Lisbon-blogger pairing is impossible by construction.

Collaborators for each event are found by category relevance (>0.3 threshold), then selected for diversity — you don't end up with four "Senior Engineers" collaborating on the same event. ACTIVE networkers are more likely to get CO_HOST roles; users with "manager" in their profession get MANAGER. ACTIVE users also RSVP to more events, which is visible in the user profile activity feed.

Graceful fallback

No API key? USE_LLM=false skips the generation pass. The seed stage falls back to Faker for names and descriptions. You lose the coherence — Berlin ends up friends with São Paulo again — but the schema populates and every code path runs. Essential for CI, where we don't want to hit the Anthropic API on every test run. The scored matching still applies to the Faker-generated users, so you get at least intra-dataset consistency even without the LLM layer.

What I considered and rejected

  • Random seeds pinned per environment. Makes the demo deterministic, but also makes it stale. Fresh generation each run is better.
  • Hand-writing the community catalog. Briefly tried it. 420 entries. Gave up after 40. LLMs are specifically good at this.
  • A single monolithic LLM pass. One massive prompt generating the whole dataset at once. Token budgets make this infeasible, and single-pass generation loses the "users know about communities" conditioning.
  • Real-time Claude instead of Batch. ~2× the cost, plus rate-limit handling code, plus 30-40 minutes of wall time. Batch is a strict upgrade for anything non-interactive.
  • Retry logic inside the LLM call. The Batch API already retries transient failures internally, and Zod rejection is not usually transient. Added complexity for no real gain.

The Impact

  • ~420 communities, ~620 users, ~1000 events generated and seeded end to end
  • ~$1.58 total generation cost via Batch API (vs ~$3.15 real-time)
  • ~2 minutes of generation, ~30 seconds of DB writes
  • ~200 MB peak memory, ~50 MB final DB size
  • Resumable from any of 10 stages — no "crashed at stage 7, nuke and restart"
  • Zero hand-written rules for domain logic — the scoring is the rule
  • Zero fuzzy-matching code in the seed pipeline — z.enum() constrains at generation time

The more satisfying outcome is subjective: the demo stopped feeling like a demo. A designer in Berlin has friends who are also in Berlin, and they share interest in design and tech. A food blogger in Lisbon has RSVPed to wine tastings and neighborhood food crawls. Clicking "trending" in each city surfaces things that make sense for that city. You can spend ten minutes browsing the app and not find a single thing that breaks the illusion. The recommendations engine — which uses exactly the same signals the seed pipeline used to generate the data — surfaces matches that look eerily good, because the dataset was built to make them look good. This is the opposite of a demo hack; it's the seed pipeline and the recommendation engine encoding the same domain model, by construction.

Closing

The lesson is that coherence isn't something you add on top of random data. It's a property of how the data is generated. Generate in the right order, make each pass condition on the last, constrain outputs to the same vocabulary the app uses (slugs, enum values), and the dataset builds a story instead of filling columns. Make the generator encode the same domain logic the app enforces, and the demo doesn't just look consistent — it is consistent, for the same reason a real dataset would be.

Faker fills columns. The pipeline tells a story. That's the difference between a demo that holds up to clicking around and one that falls apart the moment someone looks twice.