Skip to content
Pack
TSA WaitsFeaturesHow It WorksFAQTermsPrivacy

Pack

Past, present, and the future of agentic travel

Pack
584 Castro St, Suite #4036
San Francisco, CA 94114

AboutPack DeeperBenchTermsPrivacyAccessibilitySupportYour Privacy Rights

2026 Pack. All rights reserved.

Benchmark pack-deeperbench-v0

Pack DeeperBench

A synthetic benchmark for travel-planning architecture over private context, calendar constraints, public timing, deterministic flight inventory, and deterministic hotel inventory.

Pack hard-100 full run verified May 21, 2026.The reported run covers all 100 hard-corpus cases.
Hard-100 Pack run
100/100
Hard-100 cost
$5.22
Average case cost
$0.0522
Average runtime
49.5s processing/case
LLM calls
628 LLM calls

Full-Corpus Result

Pack hard-100 run

The run includes all 100 hard-corpus cases. Final pass count was 100/100.

Final pass count100/100
Hard-100 total cost$5.22
Average hard-100 case cost$0.0522
Hard-100 runtime24m24s wall clock
Average hard-100 runtime49.5s processing/case
LLM calls628 LLM calls

Metric Definitions

Final pass count100/100

Number of cases where the final answer matched the expected outcome under the rubric.

Runtime24m24s wall clock; 49.5s processing/case

Observed execution time for the run, reported as total wall-clock time and average processing time per case.

Cost$5.22; $0.0522 avg

Measured model and tool execution cost for the run, reported as total cost and average cost per case.

Model-call count628 LLM calls

Number of LLM calls made during the run. This records how many model steps were needed to complete the corpus.

Benchmark Scope

  1. The corpus is synthetic and uses generated inbox, calendar, public-event, flight, and hotel data.
  2. Prompts are intentionally short. Required facts may be present only in private context or deterministic inventory.
  3. Pack DeeperBench measures a complete travel-planning architecture, not a standalone foundation model in isolation.
  4. Pack uses structured retrieval, context translation, and planning layers to turn large personal-data surfaces into grounded trip decisions.
  5. Baseline agents operate through direct tool-calling loops over the same synthetic task environment.
  6. A passing final answer must match one of the expected outcome classes: bookable trip, no travel needed, impossible, or clarification required.
  7. When a baseline produced substantively useful work without a fully passing final answer, the table reports partial rubric credit for evidence, constraints, and inventory.
  8. The GPT-5.5 xhigh and Claude Opus 4.7 rows cover a fixed 10-case test set intentionally selected from especially difficult hard-100 cases. They are not full hard-100 results.
  9. The comparison should be read as Pack's domain architecture versus general-purpose frontier agents using direct tool calls.

Architecture Comparison

Ten selected hard-100 cases run across Pack and frontier-agent baselines.

This section reports a fixed test set of ten especially difficult cases chosen from the hard-100 corpus. Pack runs through its travel-planning architecture; GPT-5.5 xhigh and Claude Opus 4.7 max-thinking run as general-purpose agents using direct tool calls.

Cases solved

Final content passes within the selected ten-case hard set.

Pack10/10
GPT-5.5 xhigh1/10
Opus 4.72/10

Total spend

Full model and tool cost for each ten-case hard-set run.

Pack$1.11
GPT-5.5 xhigh$86.60
Opus 4.7$17.15

Runtime

Observed wait time for each ten-case hard-set evaluation.

Pack10m46s
GPT-5.5 xhigh67m16s
Opus 4.738m45s

Pack

10/10 hard set
$1.11
1x Pack cost10m46s across the ten hard cases10/10 final content pass; 10/10 scorable output

Summary: Used Pack's retrieval, context, and planning layers to return a passing final answer for every selected hard case.

Scorable output10/10
Readable enough to score before cutoff; tiny weight.
Evidence10/10
Right private evidence and red-herring avoidance; small weight.
Trip details10/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome10/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass10/10
The final answer was fully correct; 50% of the decimal score.

GPT-5.5 xhigh

1/10 hard set
$86.60
77.8x Pack cost67m16s across the ten hard cases1/10 final content pass; 0.29 average score

Summary: One final answer passed. Other cases received partial rubric credit where evidence, constraints, or inventory were correct.

Scorable output9/10
Readable enough to score before cutoff; tiny weight.
Evidence6/10
Right private evidence and red-herring avoidance; small weight.
Trip details5/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome5/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass1/10
The final answer was fully correct; 50% of the decimal score.

Claude Opus 4.7 max-thinking

2/10 hard set
$17.15
15.4x Pack cost38m45s across the ten hard cases2/10 final content pass; 0.37 average score

Summary: Two final answers passed. Other cases received partial rubric credit where evidence, constraints, or inventory were correct.

Scorable output10/10
Readable enough to score before cutoff; tiny weight.
Evidence3/10
Right private evidence and red-herring avoidance; small weight.
Trip details6/10
Travelers, dates, destination, conflicts, and hidden conditions.
Inventory/outcome5/10
Valid inventory or the right no-travel, impossible, or clarification state.
Final pass2/10
The final answer was fully correct; 50% of the decimal score.

Pack values are the matching cases from the May 21 full hard-100 run. Costs show the uncached cost for each ten-case hard-set run. These rows describe the selected hard cases, not full hard-100 evaluations for GPT-5.5 xhigh or Claude Opus 4.7.

Case Results

Each row is one of ten especially difficult cases selected from the hard-100 corpus as a focused test set. The table reports rubric score, cost, runtime, rubric components, and the scored result note for Pack's architecture and the frontier-agent baselines on those same cases.

CasePackGPT-5.5 xhighClaude Opus 4.7 max-thinking
001. @family Japan for about a week.Requires finding the real school-break/PTO window across private mail and calendar, then avoiding a tempting but wrong earlier Japan window.
1.00Cost: $0.103Runtime: 49s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: used the 2027 family spring-break window, planned Tokyo and Osaka for all four travelers, and kept the trip dates inside the verified availability window.

0.46Cost: $13.22Runtime: 11m04s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Right destination, travelers, dates, and inventory; missed the required school-break email evidence.

0.46Cost: $2.00Runtime: 4m26s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Right Japan window, travelers, and hotel; used repositioning legs and missed the required school-break email evidence.

002. @bel Paris fashion week.The prompt hides the actual fashion-week dates in private context and still requires valid flight and hotel inventory, not just the event city.
1.00Cost: $0.101Runtime: 2m22s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: used the Bel-only Paris Fashion Week evidence, preserved the September 28 private date, and completed the Paris plan with valid inventory.

0.25Cost: $1.66Runtime: 5m13s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Found the private Paris date and Bel-only traveler; did not return valid flight or hotel selections.

0.25Cost: $1.43Runtime: 3m16s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly used the private Sep 28 Paris date and Bel-only traveler; left flight and hotel inventory unselected.

003. @adam to Tokyo, use the airline credit if we still can.The model has to verify credit eligibility, dates, and seat-map evidence; using the credit without the hidden condition is wrong.
1.00Cost: $0.055Runtime: 45s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: planned Adam's Tokyo trip after respecting the airline-credit condition and the required seat-map evidence.

0.46Cost: $4.96Runtime: 3m32s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Got the Tokyo solo trip and inventory; missed the required seat-map evidence.

0.17Cost: $2.18Runtime: 3m10s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Planned Adam to Tokyo, but used the airline credit when the hidden condition made it unsafe.

005. @danny Orlando theme park weekend.Danny's trip depends on a private appointment constraint plus evidence for the right traveler, destination, and bookable inventory.
1.00Cost: $0.095Runtime: 38s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: resolved Danny's Orlando weekend, appointment constraint, traveler scope, and bookable inventory.

0.33Cost: $7.53Runtime: 9m22s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Got Danny, Orlando, and the appointment constraint; missed required evidence and inventory.

0.33Cost: $1.94Runtime: 5m32s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Respected the orthodontist constraint and destination; failed required evidence, seat, flight, and hotel output.

019. Forwarded hotel for the upcoming trip.This is a wrong-owner trap: the only obvious hotel confirmation matches a plausible trip but explicitly belongs to someone outside the household.
1.00Cost: $0.062Runtime: 37s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: rejected the external forwarded hotel because it did not belong to the household travelers and asked for valid trip evidence.

0.03Cost: $20.46Runtime: 10m57s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned a Tokyo family trip from Japan evidence instead of rejecting the wrong-owner forwarded hotel.

1.00Cost: $0.95Runtime: 2m45s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly abstained, rejected the external friend's hotel, and asked for clarification.

039. @adam NYC meeting trip.The correct answer is no travel because Adam is already local; generic NYC meeting evidence pushes planners toward unnecessary flights and hotels.
1.00Cost: $0.078Runtime: 51s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: recognized Adam was already covered by a temporary New York home and returned no travel needed.

0.03Cost: $0.66Runtime: 2m54s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Answered the wrong NYC no-travel case and missed the Blueground temporary-home window.

0.03Cost: $2.95Runtime: 3m25s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Answered a different Midtown no-travel case instead of the Blueground long-stay case.

047. Miami F1 trip.A promotional family-package honeypot conflicts with sparse real evidence, so the right response is clarification instead of booking a complete trip.
1.00Cost: $0.175Runtime: 1m13s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: identified unresolved Miami GP traveler ambiguity and asked for the traveler set before planning.

0.00Cost: $20.57Runtime: 9m21s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Hit tool-call budget before returning a final plan

0.36Cost: $2.53Runtime: 5m37s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Rejected the honeypot and declined to book, but missed the required ambiguity evidence and traveler-set clarification.

052. Barcelona Apr 4-7 for all four of us.It looks like a normal four-person trip, but the requested window is blocked; the system must surface the schedule conflict rather than force inventory.
1.00Cost: $0.190Runtime: 1m44s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: found the all-travelers blocked April 3-8 window and returned a schedule-constraint clarification instead of forcing inventory.

0.25Cost: $0.39Runtime: 57s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Found the blocked evidence and selected no inventory, but did not cleanly return the impossible outcome.

0.03Cost: $1.29Runtime: 4m00s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned infeasible, but for no-inventory supply reasons and the wrong year, not the all-travelers-blocked reason.

058. @adam Midtown and Roam week.Another no-travel case: the task is to connect private Roam/Tanooki context with Adam already being in Midtown, then decline travel planning.
1.00Cost: $0.153Runtime: 59s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: recognized the Midtown/Roam request was local and returned no travel needed.

1.00Cost: $0.32Runtime: 1m59s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed local Midtown no-travel case

1.00Cost: $0.24Runtime: 1m07s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Correctly returned no travel required

067. Met Gala, then Knicks.The terse prompt requires resolving event eligibility, travelers, destination, local coverage, and Knicks timing without inventing missing evidence.
1.00Cost: $0.100Runtime: 47s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Passed: resolved the New York context and returned no travel needed because Adam was already covered by the temporary home.

0.07Cost: $16.83Runtime: 11m56s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Returned a Japan family trip instead of the Met Gala and Knicks New York task.

0.07Cost: $1.65Runtime: 5m27s
  • Output 3%
  • Evidence 7%
  • Details 30%
  • Search 10%
  • Final 50%

Overdeclined for missing invitation/ticket evidence instead of planning Adam's New York trip and noting no Knicks game.

Benchmark Inputs

The benchmark combines private inbox context, calendar context, and deterministic flight and hotel inventory.

40k emails

1. Context Graph Build

The benchmark is built around a large synthetic household record: inbox history, calendar history, trips, cancellations, changes, loyalty, preferences, obligations, and costs.

100 hard prompts

2. Travel Reasoning

Short human requests require decisions grounded in household context, obligations, public-event timing, prior travel, and noisy private evidence.

1M + 1M inventory

3. Search Grounding

Flight and hotel choices are checked against deterministic inventory for seat fit, price, stops, refundability, room capacity, location, and preference match.

Protocol

  1. Every system is evaluated against the same synthetic household, private context, calendar constraints, public timing, and travel inventory.
  2. The user prompt is short; the system has to recover missing trip details from surrounding context instead of relying on the prompt alone.
  3. Pack runs as a domain planning architecture. Baseline agents use direct tool calls and have to discover, reconcile, and structure the relevant evidence themselves.
  4. A case only passes when the final outcome is grounded in the right evidence and matches the expected travel decision.
  5. Unsupported answers, missing final decisions, timeouts, and service-limit failures do not receive final-pass credit.

Rubric

Core Trip Details

30% of the score. The answer must get travelers, dates, destination, duration, conflicts, and hidden constraints right.

Inventory Or Outcome

10% of the score. Complete trips need valid flight and hotel choices; non-trip cases need the correct no-travel, impossible, or clarification result.

Evidence

7% of the score. The answer must rely on the right private evidence and avoid tempting wrong-owner, promo, stale, or unrelated context.

Scorable Output

3% of the score. The response has to be clear enough to score against the shared rubric before cutoff.

Final Answer

50% of the score. The returned outcome has to be fully correct: bookable trip, no travel, impossible, or clarification.

Hard-100 Corpus Cases

The official hard-100 result covers every request listed here.

001
@family Japan for about a week.
002
@bel Paris fashion week.
003
@adam to Tokyo, use the airline credit if we still can.
004
@chase to Denver for that show.
005
@danny Orlando theme park weekend.
006
James wedding near Tahoe.
007
Avery and Jamie wedding, then Tulum.
008
Riley's bachelor weekend for @adam.
009
US Open weekend for @adam.
010
@adam Betaworks trip.
011
Rome for @family.
012
@bel London trip.
013
Book the offsite travel.
014
Broncos in Denver, then Vail ski nights for @chase.
015
@danny museum weekend.
016
Miami for @adam and @bel, with @bel staying longer.
017
@family spring break in Japan.
018
Japan trip with the reservations we have.
019
Forwarded hotel for the upcoming trip.
020
Fix the changed flight time.
021
Japan again, but avoid that bad connection from last time.
022
Use expiring points or credits for the next trip.
023
@bel's Paris event weekend.
024
@adam flight using Alaska or Delta if it makes sense.
025
@chase gets a window seat if possible.
026
@danny gets a window seat if possible.
027
@adam and @bel Japan.
028
@family Japan with the friend if that works.
029
@adam and @bel Japan during their shared time off.
030
@bel's conference travel.
031
Airbnb for the upcoming trip.
032
Airline follow-up for the next trip.
033
Use that fare sale if it helps.
034
Conference travel after the city changed.
035
Finish the hotel for the trip.
036
Finish the flights for the trip.
037
Fix the return flight after the event.
038
Move the return flight if there is a better option.
039
@adam NYC meeting trip.
040
One way after the conference.
041
@bel and @adam to Paris around NYFW.
042
Milan design week with @chase and @danny.
043
NYC marathon weekend for @family.
044
Tokyo cherry blossoms.
045
@bel London theatre weekend.
046
@bel to Austin GP.
047
Miami F1 trip.
048
Beach wedding, then investor breakfast.
049
Riley's bachelor weekend in Nashville.
050
San Diego for all four of us.
051
@adam and @bel Maui.
052
Barcelona Apr 4-7 for all four of us.
053
Grandparents meet us in Orlando.
054
Boston appointment travel.
055
@adam to Lisbon after the conference.
056
Customer summit travel.
057
Seattle Apr 10-14.
058
@adam Midtown and Roam week.
059
NYC dinners next week.
060
Ceremony near the Tahoe chapel.
061
Event trip after the city moved.
062
@bel to Pitti.
063
Sundance, then a quiet cabin.
064
Watches and Wonders, then Annecy one night.
065
Osheaga, then somewhere calmer nearby.
066
ACL, then somewhere cold and quiet.
067
Met Gala, then Knicks.
068
Primavera, then Menorca.
069
Tokyo Marathon, then Kyoto.
070
Gion Matsuri, then Nara.
071
Nashville CMA for @bel.
072
Santa Fe Indian Market with @family.
073
@adam Zurich Street Parade.
074
Lake Como during Milan derby.
075
@adam and @bel Charleston weekend.
076
Memorial Day beach in San Diego.
077
@bel Palm Springs long weekend.
078
@family long weekend to Vancouver or San Francisco.
079
Hamilton in NYC this summer.
080
Louis the Child in Chicago, then three quiet days nearby.
081
Vegas two nights, maybe Warriors too.
082
Salt Lake weekend with snow and hot springs.
083
Extend Boston through Tuesday.
084
NYC around Roam and Tanooki.
085
Reno July Fourth.
086
Cancun wedding flight only.
087
Cancun wedding hotel only.
088
Cancun wedding flight and hotel for @adam and @bel.
089
After the wedding, somewhere cold for four nights.
090
After the Cancun wedding, quiet scenic two nights.
091
Lisbon four nights.
092
One way to Paris.
093
Chicago weekend trip after work Friday.
094
After Midtown meetings, see if Knicks works.
095
Porto during Primavera Pro.
096
Taipei Lantern Festival.
097
Vienna opera, then somewhere rainy and bookish nearby.
098
Rosalia in Montreal, then Quebec City.
099
Orlando family trip.
100
Tokyo food week with @family.