Skip to content
Pack
TSA WaitsFeaturesHow It WorksFAQTermsPrivacy

Pack

Past, present, and the future of agentic travel

Pack
584 Castro St, Suite #4036
San Francisco, CA 94114

AboutPack DeeperBenchTermsPrivacyAccessibilitySupportYour Privacy Rights

2026 Pack. All rights reserved.

Benchmark pack-deeperbench-v0

Pack DeeperBench

Evaluation results for travel-planning systems on synthetic private email, calendar, flight-search, and hotel-search tasks.

Current Pack hard-100 result.93/100 passed on the full 100-case suite. The selected ten-case comparison uses the same private context, tools, cutoff, and scoring rubric for every system.
Hard-100 Pack run
93/100
Hard-100 cost
$4.39
Average case cost
$0.0439
Average runtime
46.7s processing/case
Model comparison
GPT 1/10; Opus 2/10

Hardest-10 Model Comparison

Model results on the selected hardest cases.

Pack, GPT-5.5 xhigh, and Claude Opus 4.7 were evaluated on the same ten selected hard cases with the same short prompts, private-context tools, travel-search tools, scoring rubric, 45-minute cutoff, and $10 execution cap per case.

Cases solved

Final content passes on the selected hard cases.

Pack5/10
GPT-5.5 xhigh1/10
Opus 4.72/10

Total spend

Full cost for each run across all ten selected cases.

Pack$0.77
GPT-5.5 xhigh$86.60
Opus 4.7$17.15

Runtime

Observed runtime across all ten selected cases.

Pack8m42s
GPT-5.5 xhigh48m03s
Opus 4.738m50s

Pack

5/10 pass
Scorable output10/10
Readable enough to score before cutoff; tiny weight.
Evidence8/10
Right private evidence and red-herring avoidance; small weight.
Trip details6/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome6/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass5/10
The final answer was fully correct; 50% of the decimal score.

Pack passed five selected hard cases in the current run and returned scorable output for all ten selected cases.

GPT-5.5 xhigh

1/10 pass
Scorable output9/10
Readable enough to score before cutoff; tiny weight.
Evidence6/10
Right private evidence and red-herring avoidance; small weight.
Trip details5/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome5/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass1/10
The final answer was fully correct; 50% of the decimal score.

One final answer passed. Several failed cases still found useful evidence, but wrong-owner, local-stay, and wrong-destination misses now stay low because final correctness carries the largest weight.

Opus 4.7

2/10 pass
Scorable output10/10
Readable enough to score before cutoff; tiny weight.
Evidence3/10
Right private evidence and red-herring avoidance; small weight.
Trip details6/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome5/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass2/10
The final answer was fully correct; 50% of the decimal score.

Two plans passed. The rest missed evidence, constraints, search inventory, or the required no-travel/clarification outcome.

5/10

Pack

$0.77
1x Pack cost8m42s across 10 cases5/10 final content pass; 10/10 scorable output

Pack passed five selected hard cases in the current run; the remaining misses were date/status, destination normalization, credit evidence, infeasibility handling, and event-context resolution.

Fail

GPT-5.5 xhigh

$86.60
112.1x Pack cost48m03s across 10 cases1/10 final content pass; 0.29 average score

Includes all model work across completed cases and service-limit handling. Partial scores stay low when the final answer is wrong.

Fail

Claude Opus 4.7 max-thinking

$17.15
22.2x Pack cost38m50s across 10 cases2/10 final content pass; 0.37 average score

Claude answers were normalized into the benchmark format before scoring. Two final answers passed; the other eight missed final outcome, constraints, inventory, or evidence requirements.

Costs show the full cost for each run. Pack includes extraction, planning, and search. GPT and Claude costs include all model work through completion, timeout, or service limit.

Hardest-10 Case Results

Each row is one selected hard case. A full pass means the final answer is fully correct. Decimal scores use a final-answer-heavy rubric: 50% final outcome, 30% core trip details, 10% inventory or outcome, 7% evidence, and 3% scorable output.

CasePackGPT-5.5 xhighClaude Opus 4.7 max-thinking
001. @family Japan for about a week.Requires finding the real school-break/PTO window across private mail and calendar, then avoiding a tempting but wrong earlier Japan window.
0.10Cost: $0.149Runtime: 50s0.67 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Did not pass: selected Japan planning evidence but used the wrong 2026 date window and returned clarification instead of a complete itinerary.

0.46Cost: $13.22Runtime: 11m04s0.04 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Right destination, travelers, dates, and inventory; missed the required school-break email evidence.

0.46Cost: $2.00Runtime: 4m26s0.23 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Right Japan window, travelers, and hotel; used repositioning legs and missed the required school-break email evidence.

002. @bel Paris fashion week.The prompt hides the actual fashion-week dates in private context and still requires valid flight and hotel inventory, not just the event city.
0.35Cost: $0.070Runtime: 2m07s5.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Did not pass: found the private Paris date and completed the plan, but normalized the destination differently than the benchmark answer.

0.25Cost: $1.66Runtime: 5m13s0.15 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Found the private Paris date and Bel-only traveler; did not return valid flight or hotel selections.

0.25Cost: $1.43Runtime: 3m16s0.17 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly used the private Sep 28 Paris date and Bel-only traveler; left flight and hotel inventory unselected.

003. @adam to Tokyo, use the airline credit if we still can.The model has to verify credit eligibility, dates, and seat-map evidence; using the credit without the hidden condition is wrong.
0.23Cost: $0.029Runtime: 30s7.93 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Did not pass: produced a clarification and missed the required Delta seat-map evidence for the credit condition.

0.46Cost: $4.96Runtime: 3m32s0.09 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Got the Tokyo solo trip and inventory; missed the required seat-map evidence.

0.17Cost: $2.18Runtime: 3m10s0.08 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Planned Adam to Tokyo, but used the airline credit when the hidden condition made it unsafe.

005. @danny Orlando theme park weekend.Danny's trip depends on a private appointment constraint plus evidence for the right traveler, destination, and bookable inventory.
1.00Cost: $0.072Runtime: 32s13.89 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: resolved Danny's Orlando weekend, appointment constraint, traveler scope, and bookable inventory.

0.33Cost: $7.53Runtime: 9m22s0.04 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Got Danny, Orlando, and the appointment constraint; missed required evidence and inventory.

0.33Cost: $1.94Runtime: 5m32s0.17 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Respected the orthodontist constraint and destination; failed required evidence, seat, flight, and hotel output.

019. Forwarded hotel for the upcoming trip.This is a wrong-owner trap: the only obvious hotel confirmation matches a plausible trip but explicitly belongs to someone outside the household.
1.00Cost: $0.050Runtime: 37s20.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: rejected the forwarded hotel because it did not belong to the household travelers and asked for valid trip evidence.

0.03Cost: $20.46Runtime: 10m57s0.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned a Tokyo family trip from Japan evidence instead of rejecting the wrong-owner forwarded hotel.

1.00Cost: $0.95Runtime: 2m45s1.05 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly abstained, rejected the external friend's hotel, and asked for clarification.

039. @adam NYC meeting trip.The correct answer is no travel because Adam is already local; generic NYC meeting evidence pushes planners toward unnecessary flights and hotels.
1.00Cost: $0.050Runtime: 32s20.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: recognized Adam was already covered by a temporary New York home and returned no travel needed.

0.03Cost: $0.66Runtime: 2m54s0.05 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Answered the wrong NYC no-travel case and missed the Blueground temporary-home window.

0.03Cost: $2.95Runtime: 3m25s0.01 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Answered a different Midtown no-travel case instead of the Blueground long-stay case.

047. Miami F1 trip.A promotional family-package honeypot conflicts with sparse real evidence, so the right response is clarification instead of booking a complete trip.
1.00Cost: $0.077Runtime: 43s12.99 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: identified unresolved Miami GP traveler ambiguity and asked for the traveler set before planning.

0.00Cost: $20.57Runtime: 9m21s0.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Hit tool-call budget before returning a final plan

0.36Cost: $2.53Runtime: 5m37s0.14 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Rejected the honeypot and declined to book, but missed the required ambiguity evidence and traveler-set clarification.

052. Barcelona Apr 4-7 for all four of us.It looks like a normal four-person trip, but the requested window is blocked or unbookable; the system must prove infeasibility rather than force inventory.
0.45Cost: $0.097Runtime: 1m29s4.64 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Did not pass: detected the Barcelona blocked window but asked for alternate dates instead of returning the expected impossible outcome.

0.25Cost: $0.39Runtime: 57s0.64 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Found the blocked evidence and selected no inventory, but did not cleanly return the impossible outcome.

0.03Cost: $1.29Runtime: 4m00s0.02 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned infeasible, but for no-inventory supply reasons and the wrong year, not the all-travelers-blocked reason.

058. @adam Midtown and Roam week.Another no-travel case: the task is to connect private Roam/Tanooki context with Adam already being in Midtown, then decline travel planning.
1.00Cost: $0.115Runtime: 44s8.70 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: recognized the Midtown/Roam request was local and returned no travel needed.

1.00Cost: $0.32Runtime: 1m59s3.13 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed local Midtown no-travel case

1.00Cost: $0.24Runtime: 1m07s4.17 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly returned no travel required

067. Met Gala, then Knicks.The terse prompt requires resolving event eligibility, travelers, destination, duration, and Knicks timing without overdeclining or inventing missing evidence.
0.03Cost: $0.065Runtime: 38s0.46 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Did not pass: asked for destination clarification instead of resolving the New York Met Gala/Knicks context and no-game note.

0.07Cost: $16.83Runtime: 11m56s0.00 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned a Japan family trip instead of the Met Gala and Knicks New York task.

0.07Cost: $1.65Runtime: 5m27s0.04 score/$
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Overdeclined for missing invitation/ticket evidence instead of planning Adam's New York trip and noting no Knicks game.

What The Test Includes

The benchmark combines private inbox context, calendar context, and deterministic flight and hotel inventory.

40k emails

1. Travel Extractor

Runs Pack's real streaming extractor over Gmail-shaped messages and calendar events, then emits profile JSON for trips, cancellations, changes, loyalty, preferences, and costs.

100 hard prompts

2. Trip Planner

Runs Pack's real planner on human-written requests with extracted family context, obligations, public-event timing, prior travel, and noisy private context.

1M + 1M inventory

3. Travel Search

Executes deterministic flight and hotel search from Pack planner outlines, then scores seat fit, price, stops, refundability, room capacity, location, and preference match.

Rules

  1. Every system gets the same private inbox, calendar, and travel-search tools.
  2. The user prompt is short; the missing details have to be found in the private context.
  3. A response must be readable enough to normalize into the same scoring rubric.
  4. A case only passes when the answer is grounded in the right evidence and returns the right travel outcome.
  5. Missing final answers, unscorable content, bad IDs, missing evidence, timeouts, and service limits score 0.

How A Case Passes

Final Answer

50% of the score. The returned outcome has to be fully correct: bookable trip, no travel, impossible, or clarification.

Core Trip Details

30% of the score. The answer must get travelers, dates, destination, duration, conflicts, and hidden constraints right.

Inventory Or Outcome

10% of the score. Complete trips need valid flight and hotel choices; non-trip cases need the correct no-travel, impossible, or clarification result.

Evidence

7% of the score. The answer must rely on the right private evidence and avoid tempting wrong-owner, promo, stale, or unrelated context.

Scorable Output

3% of the score. The response has to be readable enough to normalize into the shared benchmark fields before cutoff.

Full Pack Run

Pack hard-100 result

Pack passed 93 of 100 hard travel-planning cases in the current full hard-100 run.

Hard-100 evidence set93/100
Hard-100 total cost$4.39
Average hard-100 case cost$0.0439
Hard-100 runtime20m14s wall clock
Average hard-100 runtime46.7s processing/case
Selected case result5/10 selected hard cases passed in the ten-case comparison set
Selected set runtime8m42s across selected ten
Comparison setAll 100 hard cases run end to end

Hard-100 Case Browser

The ten-case comparison is drawn from this full set of 100 hard travel requests.

001
@family Japan for about a week.
002
@bel Paris fashion week.
003
@adam to Tokyo, use the airline credit if we still can.
004
@chase to Denver for that show.
005
@danny Orlando theme park weekend.
006
James wedding near Tahoe.
007
Avery and Jamie wedding, then Tulum.
008
Riley's bachelor weekend for @adam.
009
US Open weekend for @adam.
010
@adam Betaworks trip.
011
Rome for @family.
012
@bel London trip.
013
Book the offsite travel.
014
Broncos in Denver, then Vail ski nights for @chase.
015
@danny museum weekend.
016
Miami for @adam and @bel, with @bel staying longer.
017
@family spring break in Japan.
018
Japan trip with the reservations we have.
019
Forwarded hotel for the upcoming trip.
020
Fix the changed flight time.
021
Japan again, but avoid that bad connection from last time.
022
Use expiring points or credits for the next trip.
023
@bel's Paris event weekend.
024
@adam flight using Alaska or Delta if it makes sense.
025
@chase gets a window seat if possible.
026
@danny gets a window seat if possible.
027
@adam and @bel Japan.
028
@family Japan with the friend if that works.
029
@adam and @bel Japan during their shared time off.
030
@bel's conference travel.
031
Airbnb for the upcoming trip.
032
Airline follow-up for the next trip.
033
Use that fare sale if it helps.
034
Conference travel after the city changed.
035
Finish the hotel for the trip.
036
Finish the flights for the trip.
037
Fix the return flight after the event.
038
Move the return flight if there is a better option.
039
@adam NYC meeting trip.
040
One way after the conference.
041
@bel and @adam to Paris around NYFW.
042
Milan design week with @chase and @danny.
043
NYC marathon weekend for @family.
044
Tokyo cherry blossoms.
045
@bel London theatre weekend.
046
@bel to Austin GP.
047
Miami F1 trip.
048
Beach wedding, then investor breakfast.
049
Riley's bachelor weekend in Nashville.
050
San Diego for all four of us.
051
@adam and @bel Maui.
052
Barcelona Apr 4-7 for all four of us.
053
Grandparents meet us in Orlando.
054
Boston appointment travel.
055
@adam to Lisbon after the conference.
056
Customer summit travel.
057
Seattle Apr 10-14.
058
@adam Midtown and Roam week.
059
NYC dinners next week.
060
Ceremony near the Tahoe chapel.
061
Event trip after the city moved.
062
@bel to Pitti.
063
Sundance, then a quiet cabin.
064
Watches and Wonders, then Annecy one night.
065
Osheaga, then somewhere calmer nearby.
066
ACL, then somewhere cold and quiet.
067
Met Gala, then Knicks.
068
Primavera, then Menorca.
069
Tokyo Marathon, then Kyoto.
070
Gion Matsuri, then Nara.
071
Nashville CMA for @bel.
072
Santa Fe Indian Market with @family.
073
@adam Zurich Street Parade.
074
Lake Como during Milan derby.
075
@adam and @bel Charleston weekend.
076
Memorial Day beach in San Diego.
077
@bel Palm Springs long weekend.
078
@family long weekend to Vancouver or San Francisco.
079
Hamilton in NYC this summer.
080
Louis the Child in Chicago, then three quiet days nearby.
081
Vegas two nights, maybe Warriors too.
082
Salt Lake weekend with snow and hot springs.
083
Extend Boston through Tuesday.
084
NYC around Roam and Tanooki.
085
Reno July Fourth.
086
Cancun wedding flight only.
087
Cancun wedding hotel only.
088
Cancun wedding flight and hotel for @adam and @bel.
089
After the wedding, somewhere cold for four nights.
090
After the Cancun wedding, quiet scenic two nights.
091
Lisbon four nights.
092
One way to Paris.
093
Chicago weekend trip after work Friday.
094
After Midtown meetings, see if Knicks works.
095
Porto during Primavera Pro.
096
Taipei Lantern Festival.
097
Vienna opera, then somewhere rainy and bookish nearby.
098
Rosalia in Montreal, then Quebec City.
099
Orlando family trip.
100
Tokyo food week with @family.