Benchmark pack-deeperbench-v0

Pack DeeperBench

Name: Pack DeeperBench
Creator: Pack
License: https://www.apache.org/licenses/LICENSE-2.0

Evaluation results for travel-planning systems on synthetic private email, calendar, flight-search, and hotel-search tasks.

Current Pack hard-100 result.93/100 passed on the full 100-case suite. The selected ten-case comparison uses the same private context, tools, cutoff, and scoring rubric for every system.

Hard-100 Pack run: 93/100
Hard-100 cost: $4.39
Average case cost: $0.0439
Average runtime: 46.7s processing/case
Model comparison: GPT 1/10; Opus 2/10

Hardest-10 Model Comparison

Model results on the selected hardest cases.

Pack, GPT-5.5 xhigh, and Claude Opus 4.7 were evaluated on the same ten selected hard cases with the same short prompts, private-context tools, travel-search tools, scoring rubric, 45-minute cutoff, and $10 execution cap per case.

Cases solved

Final content passes on the selected hard cases.

Pack5/10

GPT-5.5 xhigh1/10

Opus 4.72/10

Total spend

Full cost for each run across all ten selected cases.

Pack$0.77

GPT-5.5 xhigh$86.60

Opus 4.7$17.15

Runtime

Observed runtime across all ten selected cases.

Pack8m42s

GPT-5.5 xhigh48m03s

Opus 4.738m50s

Pack

5/10 pass

Scorable output10/10

Readable enough to score before cutoff; tiny weight.

Evidence8/10

Right private evidence and red-herring avoidance; small weight.

Trip details6/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome6/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass5/10

The final answer was fully correct; 50% of the decimal score.

Pack passed five selected hard cases in the current run and returned scorable output for all ten selected cases.

GPT-5.5 xhigh

1/10 pass

Scorable output9/10

Readable enough to score before cutoff; tiny weight.

Evidence6/10

Right private evidence and red-herring avoidance; small weight.

Trip details5/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome5/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass1/10

The final answer was fully correct; 50% of the decimal score.

One final answer passed. Several failed cases still found useful evidence, but wrong-owner, local-stay, and wrong-destination misses now stay low because final correctness carries the largest weight.

Opus 4.7

2/10 pass

Scorable output10/10

Readable enough to score before cutoff; tiny weight.

Evidence3/10

Right private evidence and red-herring avoidance; small weight.

Trip details6/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome5/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass2/10

The final answer was fully correct; 50% of the decimal score.

Two plans passed. The rest missed evidence, constraints, search inventory, or the required no-travel/clarification outcome.

5/10

Pack

$0.77

1x Pack cost8m42s across 10 cases5/10 final content pass; 10/10 scorable output

Pack passed five selected hard cases in the current run; the remaining misses were date/status, destination normalization, credit evidence, infeasibility handling, and event-context resolution.

Fail

GPT-5.5 xhigh

$86.60

112.1x Pack cost48m03s across 10 cases1/10 final content pass; 0.29 average score

Includes all model work across completed cases and service-limit handling. Partial scores stay low when the final answer is wrong.

Fail

Claude Opus 4.7 max-thinking

$17.15

22.2x Pack cost38m50s across 10 cases2/10 final content pass; 0.37 average score

Claude answers were normalized into the benchmark format before scoring. Two final answers passed; the other eight missed final outcome, constraints, inventory, or evidence requirements.

Costs show the full cost for each run. Pack includes extraction, planning, and search. GPT and Claude costs include all model work through completion, timeout, or service limit.

Hardest-10 Case Results

Each row is one selected hard case. A full pass means the final answer is fully correct. Decimal scores use a final-answer-heavy rubric: 50% final outcome, 30% core trip details, 10% inventory or outcome, 7% evidence, and 3% scorable output.

Case	Pack	GPT-5.5 xhigh	Claude Opus 4.7 max-thinking
001. @family Japan for about a week.Requires finding the real school-break/PTO window across private mail and calendar, then avoiding a tempting but wrong earlier Japan window.	0.10Cost: $0.149Runtime: 50s0.67 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: selected Japan planning evidence but used the wrong 2026 date window and returned clarification instead of a complete itinerary.	0.46Cost: $13.22Runtime: 11m04s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Right destination, travelers, dates, and inventory; missed the required school-break email evidence.	0.46Cost: $2.00Runtime: 4m26s0.23 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Right Japan window, travelers, and hotel; used repositioning legs and missed the required school-break email evidence.
002. @bel Paris fashion week.The prompt hides the actual fashion-week dates in private context and still requires valid flight and hotel inventory, not just the event city.	0.35Cost: $0.070Runtime: 2m07s5.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: found the private Paris date and completed the plan, but normalized the destination differently than the benchmark answer.	0.25Cost: $1.66Runtime: 5m13s0.15 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Found the private Paris date and Bel-only traveler; did not return valid flight or hotel selections.	0.25Cost: $1.43Runtime: 3m16s0.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly used the private Sep 28 Paris date and Bel-only traveler; left flight and hotel inventory unselected.
003. @adam to Tokyo, use the airline credit if we still can.The model has to verify credit eligibility, dates, and seat-map evidence; using the credit without the hidden condition is wrong.	0.23Cost: $0.029Runtime: 30s7.93 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: produced a clarification and missed the required Delta seat-map evidence for the credit condition.	0.46Cost: $4.96Runtime: 3m32s0.09 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Got the Tokyo solo trip and inventory; missed the required seat-map evidence.	0.17Cost: $2.18Runtime: 3m10s0.08 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Planned Adam to Tokyo, but used the airline credit when the hidden condition made it unsafe.
005. @danny Orlando theme park weekend.Danny's trip depends on a private appointment constraint plus evidence for the right traveler, destination, and bookable inventory.	1.00Cost: $0.072Runtime: 32s13.89 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: resolved Danny's Orlando weekend, appointment constraint, traveler scope, and bookable inventory.	0.33Cost: $7.53Runtime: 9m22s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Got Danny, Orlando, and the appointment constraint; missed required evidence and inventory.	0.33Cost: $1.94Runtime: 5m32s0.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Respected the orthodontist constraint and destination; failed required evidence, seat, flight, and hotel output.
019. Forwarded hotel for the upcoming trip.This is a wrong-owner trap: the only obvious hotel confirmation matches a plausible trip but explicitly belongs to someone outside the household.	1.00Cost: $0.050Runtime: 37s20.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: rejected the forwarded hotel because it did not belong to the household travelers and asked for valid trip evidence.	0.03Cost: $20.46Runtime: 10m57s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned a Tokyo family trip from Japan evidence instead of rejecting the wrong-owner forwarded hotel.	1.00Cost: $0.95Runtime: 2m45s1.05 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly abstained, rejected the external friend's hotel, and asked for clarification.
039. @adam NYC meeting trip.The correct answer is no travel because Adam is already local; generic NYC meeting evidence pushes planners toward unnecessary flights and hotels.	1.00Cost: $0.050Runtime: 32s20.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: recognized Adam was already covered by a temporary New York home and returned no travel needed.	0.03Cost: $0.66Runtime: 2m54s0.05 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Answered the wrong NYC no-travel case and missed the Blueground temporary-home window.	0.03Cost: $2.95Runtime: 3m25s0.01 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Answered a different Midtown no-travel case instead of the Blueground long-stay case.
047. Miami F1 trip.A promotional family-package honeypot conflicts with sparse real evidence, so the right response is clarification instead of booking a complete trip.	1.00Cost: $0.077Runtime: 43s12.99 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: identified unresolved Miami GP traveler ambiguity and asked for the traveler set before planning.	0.00Cost: $20.57Runtime: 9m21s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Hit tool-call budget before returning a final plan	0.36Cost: $2.53Runtime: 5m37s0.14 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Rejected the honeypot and declined to book, but missed the required ambiguity evidence and traveler-set clarification.
052. Barcelona Apr 4-7 for all four of us.It looks like a normal four-person trip, but the requested window is blocked or unbookable; the system must prove infeasibility rather than force inventory.	0.45Cost: $0.097Runtime: 1m29s4.64 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: detected the Barcelona blocked window but asked for alternate dates instead of returning the expected impossible outcome.	0.25Cost: $0.39Runtime: 57s0.64 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Found the blocked evidence and selected no inventory, but did not cleanly return the impossible outcome.	0.03Cost: $1.29Runtime: 4m00s0.02 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned infeasible, but for no-inventory supply reasons and the wrong year, not the all-travelers-blocked reason.
058. @adam Midtown and Roam week.Another no-travel case: the task is to connect private Roam/Tanooki context with Adam already being in Midtown, then decline travel planning.	1.00Cost: $0.115Runtime: 44s8.70 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: recognized the Midtown/Roam request was local and returned no travel needed.	1.00Cost: $0.32Runtime: 1m59s3.13 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed local Midtown no-travel case	1.00Cost: $0.24Runtime: 1m07s4.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly returned no travel required
067. Met Gala, then Knicks.The terse prompt requires resolving event eligibility, travelers, destination, duration, and Knicks timing without overdeclining or inventing missing evidence.	0.03Cost: $0.065Runtime: 38s0.46 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: asked for destination clarification instead of resolving the New York Met Gala/Knicks context and no-game note.	0.07Cost: $16.83Runtime: 11m56s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned a Japan family trip instead of the Met Gala and Knicks New York task.	0.07Cost: $1.65Runtime: 5m27s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Overdeclined for missing invitation/ticket evidence instead of planning Adam's New York trip and noting no Knicks game.

What The Test Includes

The benchmark combines private inbox context, calendar context, and deterministic flight and hotel inventory.

40k emails

1. Travel Extractor

Runs Pack's real streaming extractor over Gmail-shaped messages and calendar events, then emits profile JSON for trips, cancellations, changes, loyalty, preferences, and costs.

100 hard prompts

2. Trip Planner

Runs Pack's real planner on human-written requests with extracted family context, obligations, public-event timing, prior travel, and noisy private context.

1M + 1M inventory

3. Travel Search

Executes deterministic flight and hotel search from Pack planner outlines, then scores seat fit, price, stops, refundability, room capacity, location, and preference match.

Rules

Every system gets the same private inbox, calendar, and travel-search tools.
The user prompt is short; the missing details have to be found in the private context.
A response must be readable enough to normalize into the same scoring rubric.
A case only passes when the answer is grounded in the right evidence and returns the right travel outcome.
Missing final answers, unscorable content, bad IDs, missing evidence, timeouts, and service limits score 0.

How A Case Passes

Final Answer

50% of the score. The returned outcome has to be fully correct: bookable trip, no travel, impossible, or clarification.

Core Trip Details

30% of the score. The answer must get travelers, dates, destination, duration, conflicts, and hidden constraints right.

Inventory Or Outcome

10% of the score. Complete trips need valid flight and hotel choices; non-trip cases need the correct no-travel, impossible, or clarification result.

Evidence

7% of the score. The answer must rely on the right private evidence and avoid tempting wrong-owner, promo, stale, or unrelated context.

Scorable Output

3% of the score. The response has to be readable enough to normalize into the shared benchmark fields before cutoff.

Full Pack Run

Pack hard-100 result

Pack passed 93 of 100 hard travel-planning cases in the current full hard-100 run.

Hard-100 evidence set93/100

Hard-100 total cost$4.39

Average hard-100 case cost$0.0439

Hard-100 runtime20m14s wall clock

Average hard-100 runtime46.7s processing/case

Selected case result5/10 selected hard cases passed in the ten-case comparison set

Selected set runtime8m42s across selected ten

Comparison setAll 100 hard cases run end to end

Hard-100 Case Browser

The ten-case comparison is drawn from this full set of 100 hard travel requests.

001

@family Japan for about a week.

002

@bel Paris fashion week.

003

@adam to Tokyo, use the airline credit if we still can.

004

@chase to Denver for that show.

005

@danny Orlando theme park weekend.

006

James wedding near Tahoe.

007

Avery and Jamie wedding, then Tulum.

008

Riley's bachelor weekend for @adam.

009

US Open weekend for @adam.

010

@adam Betaworks trip.

011

Rome for @family.

012

@bel London trip.

013

Book the offsite travel.

014

Broncos in Denver, then Vail ski nights for @chase.

015

@danny museum weekend.

016

Miami for @adam and @bel, with @bel staying longer.

017

@family spring break in Japan.

018

Japan trip with the reservations we have.

019

Forwarded hotel for the upcoming trip.

020

Fix the changed flight time.

021

Japan again, but avoid that bad connection from last time.

022

Use expiring points or credits for the next trip.

023

@bel's Paris event weekend.

024

@adam flight using Alaska or Delta if it makes sense.

025

@chase gets a window seat if possible.

026

@danny gets a window seat if possible.

027

@adam and @bel Japan.

028

@family Japan with the friend if that works.

029

@adam and @bel Japan during their shared time off.

030

@bel's conference travel.

031

Airbnb for the upcoming trip.

032

Airline follow-up for the next trip.

033

Use that fare sale if it helps.

034

Conference travel after the city changed.

035

Finish the hotel for the trip.

036

Finish the flights for the trip.

037

Fix the return flight after the event.

038

Move the return flight if there is a better option.

039

@adam NYC meeting trip.

040

One way after the conference.

041

@bel and @adam to Paris around NYFW.

042

Milan design week with @chase and @danny.

043

NYC marathon weekend for @family.

044

Tokyo cherry blossoms.

045

@bel London theatre weekend.

046

@bel to Austin GP.

047

Miami F1 trip.

048

Beach wedding, then investor breakfast.

049

Riley's bachelor weekend in Nashville.

050

San Diego for all four of us.

051

@adam and @bel Maui.

052

Barcelona Apr 4-7 for all four of us.

053

Grandparents meet us in Orlando.

054

Boston appointment travel.

055

@adam to Lisbon after the conference.

056

Customer summit travel.

057

Seattle Apr 10-14.

058

@adam Midtown and Roam week.

059

NYC dinners next week.

060

Ceremony near the Tahoe chapel.

061

Event trip after the city moved.

062

@bel to Pitti.

063

Sundance, then a quiet cabin.

064

Watches and Wonders, then Annecy one night.

065

Osheaga, then somewhere calmer nearby.

066

ACL, then somewhere cold and quiet.

067

Met Gala, then Knicks.

068

Primavera, then Menorca.

069

Tokyo Marathon, then Kyoto.

070

Gion Matsuri, then Nara.

071

Nashville CMA for @bel.

072

Santa Fe Indian Market with @family.

073

@adam Zurich Street Parade.

074

Lake Como during Milan derby.

075

@adam and @bel Charleston weekend.

076

Memorial Day beach in San Diego.

077

@bel Palm Springs long weekend.

078

@family long weekend to Vancouver or San Francisco.

079

Hamilton in NYC this summer.

080

Louis the Child in Chicago, then three quiet days nearby.

081

Vegas two nights, maybe Warriors too.

082

Salt Lake weekend with snow and hot springs.

083

Extend Boston through Tuesday.

084

NYC around Roam and Tanooki.

085

Reno July Fourth.

086

Cancun wedding flight only.

087

Cancun wedding hotel only.

088

Cancun wedding flight and hotel for @adam and @bel.

089

After the wedding, somewhere cold for four nights.

090

After the Cancun wedding, quiet scenic two nights.

091

Lisbon four nights.

092

One way to Paris.

093

Chicago weekend trip after work Friday.

094

After Midtown meetings, see if Knicks works.

095

Porto during Primavera Pro.

096

Taipei Lantern Festival.

097

Vienna opera, then somewhere rainy and bookish nearby.

098

Rosalia in Montreal, then Quebec City.

099

Orlando family trip.

100

Tokyo food week with @family.

Benchmark pack-deeperbench-v0

Pack DeeperBench

Evaluation results for travel-planning systems on synthetic private email, calendar, flight-search, and hotel-search tasks.

Current Pack hard-100 result.93/100 passed on the full 100-case suite. The selected ten-case comparison uses the same private context, tools, cutoff, and scoring rubric for every system.

Hard-100 Pack run: 93/100
Hard-100 cost: $4.39
Average case cost: $0.0439
Average runtime: 46.7s processing/case
Model comparison: GPT 1/10; Opus 2/10

Hardest-10 Model Comparison

Model results on the selected hardest cases.

Cases solved

Final content passes on the selected hard cases.

Pack5/10

GPT-5.5 xhigh1/10

Opus 4.72/10

Total spend

Full cost for each run across all ten selected cases.

Pack$0.77

GPT-5.5 xhigh$86.60

Opus 4.7$17.15

Runtime

Observed runtime across all ten selected cases.

Pack8m42s

GPT-5.5 xhigh48m03s

Opus 4.738m50s

Pack

5/10 pass

Scorable output10/10

Readable enough to score before cutoff; tiny weight.

Evidence8/10

Right private evidence and red-herring avoidance; small weight.

Trip details6/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome6/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass5/10

The final answer was fully correct; 50% of the decimal score.

Pack passed five selected hard cases in the current run and returned scorable output for all ten selected cases.

GPT-5.5 xhigh

1/10 pass

Scorable output9/10

Readable enough to score before cutoff; tiny weight.

Evidence6/10

Right private evidence and red-herring avoidance; small weight.

Trip details5/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome5/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass1/10

The final answer was fully correct; 50% of the decimal score.

One final answer passed. Several failed cases still found useful evidence, but wrong-owner, local-stay, and wrong-destination misses now stay low because final correctness carries the largest weight.

Opus 4.7

2/10 pass

Scorable output10/10

Readable enough to score before cutoff; tiny weight.

Evidence3/10

Right private evidence and red-herring avoidance; small weight.

Trip details6/10

Travelers, dates, destination, conflicts, and hidden conditions.

Inventory/outcome5/10

Valid inventory or the right no-travel, impossible, or clarification state.

Final pass2/10

The final answer was fully correct; 50% of the decimal score.

Two plans passed. The rest missed evidence, constraints, search inventory, or the required no-travel/clarification outcome.

5/10

Pack

$0.77

1x Pack cost8m42s across 10 cases5/10 final content pass; 10/10 scorable output

Pack passed five selected hard cases in the current run; the remaining misses were date/status, destination normalization, credit evidence, infeasibility handling, and event-context resolution.

Fail

GPT-5.5 xhigh

$86.60

112.1x Pack cost48m03s across 10 cases1/10 final content pass; 0.29 average score

Includes all model work across completed cases and service-limit handling. Partial scores stay low when the final answer is wrong.

Fail

Claude Opus 4.7 max-thinking

$17.15

22.2x Pack cost38m50s across 10 cases2/10 final content pass; 0.37 average score

Claude answers were normalized into the benchmark format before scoring. Two final answers passed; the other eight missed final outcome, constraints, inventory, or evidence requirements.

Costs show the full cost for each run. Pack includes extraction, planning, and search. GPT and Claude costs include all model work through completion, timeout, or service limit.

Hardest-10 Case Results

Case	Pack	GPT-5.5 xhigh	Claude Opus 4.7 max-thinking
001. @family Japan for about a week.Requires finding the real school-break/PTO window across private mail and calendar, then avoiding a tempting but wrong earlier Japan window.	0.10Cost: $0.149Runtime: 50s0.67 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: selected Japan planning evidence but used the wrong 2026 date window and returned clarification instead of a complete itinerary.	0.46Cost: $13.22Runtime: 11m04s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Right destination, travelers, dates, and inventory; missed the required school-break email evidence.	0.46Cost: $2.00Runtime: 4m26s0.23 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Right Japan window, travelers, and hotel; used repositioning legs and missed the required school-break email evidence.
002. @bel Paris fashion week.The prompt hides the actual fashion-week dates in private context and still requires valid flight and hotel inventory, not just the event city.	0.35Cost: $0.070Runtime: 2m07s5.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: found the private Paris date and completed the plan, but normalized the destination differently than the benchmark answer.	0.25Cost: $1.66Runtime: 5m13s0.15 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Found the private Paris date and Bel-only traveler; did not return valid flight or hotel selections.	0.25Cost: $1.43Runtime: 3m16s0.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly used the private Sep 28 Paris date and Bel-only traveler; left flight and hotel inventory unselected.
003. @adam to Tokyo, use the airline credit if we still can.The model has to verify credit eligibility, dates, and seat-map evidence; using the credit without the hidden condition is wrong.	0.23Cost: $0.029Runtime: 30s7.93 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: produced a clarification and missed the required Delta seat-map evidence for the credit condition.	0.46Cost: $4.96Runtime: 3m32s0.09 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Got the Tokyo solo trip and inventory; missed the required seat-map evidence.	0.17Cost: $2.18Runtime: 3m10s0.08 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Planned Adam to Tokyo, but used the airline credit when the hidden condition made it unsafe.
005. @danny Orlando theme park weekend.Danny's trip depends on a private appointment constraint plus evidence for the right traveler, destination, and bookable inventory.	1.00Cost: $0.072Runtime: 32s13.89 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: resolved Danny's Orlando weekend, appointment constraint, traveler scope, and bookable inventory.	0.33Cost: $7.53Runtime: 9m22s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Got Danny, Orlando, and the appointment constraint; missed required evidence and inventory.	0.33Cost: $1.94Runtime: 5m32s0.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Respected the orthodontist constraint and destination; failed required evidence, seat, flight, and hotel output.
019. Forwarded hotel for the upcoming trip.This is a wrong-owner trap: the only obvious hotel confirmation matches a plausible trip but explicitly belongs to someone outside the household.	1.00Cost: $0.050Runtime: 37s20.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: rejected the forwarded hotel because it did not belong to the household travelers and asked for valid trip evidence.	0.03Cost: $20.46Runtime: 10m57s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned a Tokyo family trip from Japan evidence instead of rejecting the wrong-owner forwarded hotel.	1.00Cost: $0.95Runtime: 2m45s1.05 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly abstained, rejected the external friend's hotel, and asked for clarification.
039. @adam NYC meeting trip.The correct answer is no travel because Adam is already local; generic NYC meeting evidence pushes planners toward unnecessary flights and hotels.	1.00Cost: $0.050Runtime: 32s20.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: recognized Adam was already covered by a temporary New York home and returned no travel needed.	0.03Cost: $0.66Runtime: 2m54s0.05 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Answered the wrong NYC no-travel case and missed the Blueground temporary-home window.	0.03Cost: $2.95Runtime: 3m25s0.01 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Answered a different Midtown no-travel case instead of the Blueground long-stay case.
047. Miami F1 trip.A promotional family-package honeypot conflicts with sparse real evidence, so the right response is clarification instead of booking a complete trip.	1.00Cost: $0.077Runtime: 43s12.99 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: identified unresolved Miami GP traveler ambiguity and asked for the traveler set before planning.	0.00Cost: $20.57Runtime: 9m21s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Hit tool-call budget before returning a final plan	0.36Cost: $2.53Runtime: 5m37s0.14 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Rejected the honeypot and declined to book, but missed the required ambiguity evidence and traveler-set clarification.
052. Barcelona Apr 4-7 for all four of us.It looks like a normal four-person trip, but the requested window is blocked or unbookable; the system must prove infeasibility rather than force inventory.	0.45Cost: $0.097Runtime: 1m29s4.64 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: detected the Barcelona blocked window but asked for alternate dates instead of returning the expected impossible outcome.	0.25Cost: $0.39Runtime: 57s0.64 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Found the blocked evidence and selected no inventory, but did not cleanly return the impossible outcome.	0.03Cost: $1.29Runtime: 4m00s0.02 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned infeasible, but for no-inventory supply reasons and the wrong year, not the all-travelers-blocked reason.
058. @adam Midtown and Roam week.Another no-travel case: the task is to connect private Roam/Tanooki context with Adam already being in Midtown, then decline travel planning.	1.00Cost: $0.115Runtime: 44s8.70 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed: recognized the Midtown/Roam request was local and returned no travel needed.	1.00Cost: $0.32Runtime: 1m59s3.13 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Passed local Midtown no-travel case	1.00Cost: $0.24Runtime: 1m07s4.17 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Correctly returned no travel required
067. Met Gala, then Knicks.The terse prompt requires resolving event eligibility, travelers, destination, duration, and Knicks timing without overdeclining or inventing missing evidence.	0.03Cost: $0.065Runtime: 38s0.46 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Did not pass: asked for destination clarification instead of resolving the New York Met Gala/Knicks context and no-game note.	0.07Cost: $16.83Runtime: 11m56s0.00 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Returned a Japan family trip instead of the Met Gala and Knicks New York task.	0.07Cost: $1.65Runtime: 5m27s0.04 score/$ Output 3% Evidence 7% Details 30% Search 10% Final 50% Overdeclined for missing invitation/ticket evidence instead of planning Adam's New York trip and noting no Knicks game.

What The Test Includes

The benchmark combines private inbox context, calendar context, and deterministic flight and hotel inventory.

40k emails

1. Travel Extractor

Runs Pack's real streaming extractor over Gmail-shaped messages and calendar events, then emits profile JSON for trips, cancellations, changes, loyalty, preferences, and costs.

100 hard prompts

2. Trip Planner

Runs Pack's real planner on human-written requests with extracted family context, obligations, public-event timing, prior travel, and noisy private context.

1M + 1M inventory

3. Travel Search

Executes deterministic flight and hotel search from Pack planner outlines, then scores seat fit, price, stops, refundability, room capacity, location, and preference match.

Rules

Every system gets the same private inbox, calendar, and travel-search tools.
The user prompt is short; the missing details have to be found in the private context.
A response must be readable enough to normalize into the same scoring rubric.
A case only passes when the answer is grounded in the right evidence and returns the right travel outcome.
Missing final answers, unscorable content, bad IDs, missing evidence, timeouts, and service limits score 0.

How A Case Passes

Final Answer

50% of the score. The returned outcome has to be fully correct: bookable trip, no travel, impossible, or clarification.

Core Trip Details

30% of the score. The answer must get travelers, dates, destination, duration, conflicts, and hidden constraints right.

Inventory Or Outcome

10% of the score. Complete trips need valid flight and hotel choices; non-trip cases need the correct no-travel, impossible, or clarification result.

Evidence

7% of the score. The answer must rely on the right private evidence and avoid tempting wrong-owner, promo, stale, or unrelated context.

Scorable Output

3% of the score. The response has to be readable enough to normalize into the shared benchmark fields before cutoff.

Full Pack Run

Pack hard-100 result

Pack passed 93 of 100 hard travel-planning cases in the current full hard-100 run.

Hard-100 evidence set93/100

Hard-100 total cost$4.39

Average hard-100 case cost$0.0439

Hard-100 runtime20m14s wall clock

Average hard-100 runtime46.7s processing/case

Selected case result5/10 selected hard cases passed in the ten-case comparison set

Selected set runtime8m42s across selected ten

Comparison setAll 100 hard cases run end to end

Hard-100 Case Browser

The ten-case comparison is drawn from this full set of 100 hard travel requests.

001

@family Japan for about a week.

002

@bel Paris fashion week.

003

@adam to Tokyo, use the airline credit if we still can.

004

@chase to Denver for that show.

005

@danny Orlando theme park weekend.

006

James wedding near Tahoe.

007

Avery and Jamie wedding, then Tulum.

008

Riley's bachelor weekend for @adam.

009

US Open weekend for @adam.

010

@adam Betaworks trip.

011

Rome for @family.

012

@bel London trip.

013

Book the offsite travel.

014

Broncos in Denver, then Vail ski nights for @chase.

015

@danny museum weekend.

016

Miami for @adam and @bel, with @bel staying longer.

017

@family spring break in Japan.

018

Japan trip with the reservations we have.

019

Forwarded hotel for the upcoming trip.

020

Fix the changed flight time.

021

Japan again, but avoid that bad connection from last time.

022

Use expiring points or credits for the next trip.

023

@bel's Paris event weekend.

024

@adam flight using Alaska or Delta if it makes sense.

025

@chase gets a window seat if possible.

026

@danny gets a window seat if possible.

027

@adam and @bel Japan.

028

@family Japan with the friend if that works.

029

@adam and @bel Japan during their shared time off.

030

@bel's conference travel.

031

Airbnb for the upcoming trip.

032

Airline follow-up for the next trip.

033

Use that fare sale if it helps.

034

Conference travel after the city changed.

035

Finish the hotel for the trip.

036

Finish the flights for the trip.

037

Fix the return flight after the event.

038

Move the return flight if there is a better option.

039

@adam NYC meeting trip.

040

One way after the conference.

041

@bel and @adam to Paris around NYFW.

042

Milan design week with @chase and @danny.

043

NYC marathon weekend for @family.

044

Tokyo cherry blossoms.

045

@bel London theatre weekend.

046

@bel to Austin GP.

047

Miami F1 trip.

048

Beach wedding, then investor breakfast.

049

Riley's bachelor weekend in Nashville.

050

San Diego for all four of us.

051

@adam and @bel Maui.

052

Barcelona Apr 4-7 for all four of us.

053

Grandparents meet us in Orlando.

054

Boston appointment travel.

055

@adam to Lisbon after the conference.

056

Customer summit travel.

057

Seattle Apr 10-14.

058

@adam Midtown and Roam week.

059

NYC dinners next week.

060

Ceremony near the Tahoe chapel.

061

Event trip after the city moved.

062

@bel to Pitti.

063

Sundance, then a quiet cabin.

064

Watches and Wonders, then Annecy one night.

065

Osheaga, then somewhere calmer nearby.

066

ACL, then somewhere cold and quiet.

067

Met Gala, then Knicks.

068

Primavera, then Menorca.

069

Tokyo Marathon, then Kyoto.

070

Gion Matsuri, then Nara.

071

Nashville CMA for @bel.

072

Santa Fe Indian Market with @family.

073

@adam Zurich Street Parade.

074

Lake Como during Milan derby.

075

@adam and @bel Charleston weekend.

076

Memorial Day beach in San Diego.

077

@bel Palm Springs long weekend.

078

@family long weekend to Vancouver or San Francisco.

079

Hamilton in NYC this summer.

080

Louis the Child in Chicago, then three quiet days nearby.

081

Vegas two nights, maybe Warriors too.

082

Salt Lake weekend with snow and hot springs.

083

Extend Boston through Tuesday.

084

NYC around Roam and Tanooki.

085

Reno July Fourth.

086

Cancun wedding flight only.

087

Cancun wedding hotel only.

088

Cancun wedding flight and hotel for @adam and @bel.

089

After the wedding, somewhere cold for four nights.

090

After the Cancun wedding, quiet scenic two nights.

091

Lisbon four nights.

092

One way to Paris.

093

Chicago weekend trip after work Friday.

094

After Midtown meetings, see if Knicks works.

095

Porto during Primavera Pro.

096

Taipei Lantern Festival.

097

Vienna opera, then somewhere rainy and bookish nearby.

098

Rosalia in Montreal, then Quebec City.

099

Orlando family trip.

100

Tokyo food week with @family.