Benchmark pack-deeperbench-v0

Pack DeeperBench: travel agents should survive the inbox before they book the trip.

Name: Pack DeeperBench
Creator: Pack
License: https://www.apache.org/licenses/LICENSE-2.0

Pack DeeperBench measures the complete workflow: extracting a household travel history from realistic email and calendar data, planning from a short human prompt, and selecting flights and hotels from large deterministic inventories.

Full 100-case Pack run pending.The current published page is wired for final Pack and external-provider runs, but only verified smoke results are shown until the full 100-case run completes.

Household: 4 people
Inbox: 40,000
Travel items: ~16,000
Search inventory: 1M + 1M

Benchmark Phases

Each phase is scored separately and then rolled into an end-to-end case score. Pack runs through its native code path; external systems run against the same tool protocol and result schemas.

40k emails

1. Travel Extractor

Runs Pack's real streaming extractor over Gmail-shaped messages and calendar events, then emits profile JSON for trips, cancellations, stale evidence, loyalty, preferences, and costs.

100 hard prompts

2. Trip Planner

Runs Pack's real planner on human-written requests with extracted family context, obligations, public-event timing, prior travel, and red-herring private context.

1M + 1M inventory

3. Travel Search

Executes deterministic flight and hotel search from Pack planner outlines, then scores seat fit, price, stops, refundability, room capacity, location, and preference match.

Official Protocol

Extractor receives Gmail-like and Calendar-like APIs over the synthetic household corpus.
Planner receives the extracted private travel context plus the natural user prompt.
Search receives Pack planner outlines and ranks deterministic flight and hotel inventory.
Official scoring gives no grace for failed phases, invalid IDs, missing evidence, or malformed outputs.

Scoring

Extraction Accuracy

Correct trips, travelers, cancellations, changes, stale bookings, loyalty, and preference evidence.

Planning Accuracy

Correct dates, travelers, public timing, calendar constraints, active context, and trip outline structure.

User Value

Whether selected flights and hotels actually match family seats, fare flexibility, price, rooms, and location preferences.

Grounding

Evidence quality for inbox, calendar, public-event, planner, and search decisions.

Runtime

Wall-clock time by extractor, planner, search, and full case execution.

Fully Loaded Cost

Model calls, Pack phase costs, and local/AWS runner costs reported separately.

Latest Verified Pack Run

Latest verified local Pack smoke

Book two weeks in Japan for @family in June 2027. Start with Tokyo and keep our usual two-room family split.

Cases1

Extractor profiles4

Historical trips8

Flight scan2,000,000

Hotel scan1,000,000

Total cost$0.3464

Planning accuracy1.00

User value1.00

Run Path

cd PackServer
npm run bench:travel-context:pack-extractor-hartwell -- --email-task-concurrency 2 --out-dir tmp/hartwell-pack-real-extractor-full
npm run bench:travel-context:pack-phased-hartwell -- --extraction-dir tmp/hartwell-pack-real-extractor-full --limit 100 --flight-count 1000000 --hotel-count 1000000

Release repository coming next