Pack hard-100 run
The run includes all 100 hard-corpus cases. Final pass count was 100/100.
Benchmark pack-deeperbench-v0
A synthetic benchmark for travel-planning architecture over private context, calendar constraints, public timing, deterministic flight inventory, and deterministic hotel inventory.
The run includes all 100 hard-corpus cases. Final pass count was 100/100.
Number of cases where the final answer matched the expected outcome under the rubric.
Observed execution time for the run, reported as total wall-clock time and average processing time per case.
Measured model and tool execution cost for the run, reported as total cost and average cost per case.
Number of LLM calls made during the run. This records how many model steps were needed to complete the corpus.
This section reports a fixed test set of ten especially difficult cases chosen from the hard-100 corpus. Pack runs through its travel-planning architecture; GPT-5.5 xhigh and Claude Opus 4.7 max-thinking run as general-purpose agents using direct tool calls.
Final content passes within the selected ten-case hard set.
Full model and tool cost for each ten-case hard-set run.
Observed wait time for each ten-case hard-set evaluation.
Summary: Used Pack's retrieval, context, and planning layers to return a passing final answer for every selected hard case.
Summary: One final answer passed. Other cases received partial rubric credit where evidence, constraints, or inventory were correct.
Summary: Two final answers passed. Other cases received partial rubric credit where evidence, constraints, or inventory were correct.
Pack values are the matching cases from the May 21 full hard-100 run. Costs show the uncached cost for each ten-case hard-set run. These rows describe the selected hard cases, not full hard-100 evaluations for GPT-5.5 xhigh or Claude Opus 4.7.
Each row is one of ten especially difficult cases selected from the hard-100 corpus as a focused test set. The table reports rubric score, cost, runtime, rubric components, and the scored result note for Pack's architecture and the frontier-agent baselines on those same cases.
| Case | Pack | GPT-5.5 xhigh | Claude Opus 4.7 max-thinking |
|---|---|---|---|
| 001. @family Japan for about a week.Requires finding the real school-break/PTO window across private mail and calendar, then avoiding a tempting but wrong earlier Japan window. | 1.00Cost: $0.103Runtime: 49s
Passed: used the 2027 family spring-break window, planned Tokyo and Osaka for all four travelers, and kept the trip dates inside the verified availability window. | 0.46Cost: $13.22Runtime: 11m04s
Right destination, travelers, dates, and inventory; missed the required school-break email evidence. | 0.46Cost: $2.00Runtime: 4m26s
Right Japan window, travelers, and hotel; used repositioning legs and missed the required school-break email evidence. |
| 002. @bel Paris fashion week.The prompt hides the actual fashion-week dates in private context and still requires valid flight and hotel inventory, not just the event city. | 1.00Cost: $0.101Runtime: 2m22s
Passed: used the Bel-only Paris Fashion Week evidence, preserved the September 28 private date, and completed the Paris plan with valid inventory. | 0.25Cost: $1.66Runtime: 5m13s
Found the private Paris date and Bel-only traveler; did not return valid flight or hotel selections. | 0.25Cost: $1.43Runtime: 3m16s
Correctly used the private Sep 28 Paris date and Bel-only traveler; left flight and hotel inventory unselected. |
| 003. @adam to Tokyo, use the airline credit if we still can.The model has to verify credit eligibility, dates, and seat-map evidence; using the credit without the hidden condition is wrong. | 1.00Cost: $0.055Runtime: 45s
Passed: planned Adam's Tokyo trip after respecting the airline-credit condition and the required seat-map evidence. | 0.46Cost: $4.96Runtime: 3m32s
Got the Tokyo solo trip and inventory; missed the required seat-map evidence. | 0.17Cost: $2.18Runtime: 3m10s
Planned Adam to Tokyo, but used the airline credit when the hidden condition made it unsafe. |
| 005. @danny Orlando theme park weekend.Danny's trip depends on a private appointment constraint plus evidence for the right traveler, destination, and bookable inventory. | 1.00Cost: $0.095Runtime: 38s
Passed: resolved Danny's Orlando weekend, appointment constraint, traveler scope, and bookable inventory. | 0.33Cost: $7.53Runtime: 9m22s
Got Danny, Orlando, and the appointment constraint; missed required evidence and inventory. | 0.33Cost: $1.94Runtime: 5m32s
Respected the orthodontist constraint and destination; failed required evidence, seat, flight, and hotel output. |
| 019. Forwarded hotel for the upcoming trip.This is a wrong-owner trap: the only obvious hotel confirmation matches a plausible trip but explicitly belongs to someone outside the household. | 1.00Cost: $0.062Runtime: 37s
Passed: rejected the external forwarded hotel because it did not belong to the household travelers and asked for valid trip evidence. | 0.03Cost: $20.46Runtime: 10m57s
Returned a Tokyo family trip from Japan evidence instead of rejecting the wrong-owner forwarded hotel. | 1.00Cost: $0.95Runtime: 2m45s
Correctly abstained, rejected the external friend's hotel, and asked for clarification. |
| 039. @adam NYC meeting trip.The correct answer is no travel because Adam is already local; generic NYC meeting evidence pushes planners toward unnecessary flights and hotels. | 1.00Cost: $0.078Runtime: 51s
Passed: recognized Adam was already covered by a temporary New York home and returned no travel needed. | 0.03Cost: $0.66Runtime: 2m54s
Answered the wrong NYC no-travel case and missed the Blueground temporary-home window. | 0.03Cost: $2.95Runtime: 3m25s
Answered a different Midtown no-travel case instead of the Blueground long-stay case. |
| 047. Miami F1 trip.A promotional family-package honeypot conflicts with sparse real evidence, so the right response is clarification instead of booking a complete trip. | 1.00Cost: $0.175Runtime: 1m13s
Passed: identified unresolved Miami GP traveler ambiguity and asked for the traveler set before planning. | 0.00Cost: $20.57Runtime: 9m21s
Hit tool-call budget before returning a final plan | 0.36Cost: $2.53Runtime: 5m37s
Rejected the honeypot and declined to book, but missed the required ambiguity evidence and traveler-set clarification. |
| 052. Barcelona Apr 4-7 for all four of us.It looks like a normal four-person trip, but the requested window is blocked; the system must surface the schedule conflict rather than force inventory. | 1.00Cost: $0.190Runtime: 1m44s
Passed: found the all-travelers blocked April 3-8 window and returned a schedule-constraint clarification instead of forcing inventory. | 0.25Cost: $0.39Runtime: 57s
Found the blocked evidence and selected no inventory, but did not cleanly return the impossible outcome. | 0.03Cost: $1.29Runtime: 4m00s
Returned infeasible, but for no-inventory supply reasons and the wrong year, not the all-travelers-blocked reason. |
| 058. @adam Midtown and Roam week.Another no-travel case: the task is to connect private Roam/Tanooki context with Adam already being in Midtown, then decline travel planning. | 1.00Cost: $0.153Runtime: 59s
Passed: recognized the Midtown/Roam request was local and returned no travel needed. | 1.00Cost: $0.32Runtime: 1m59s
Passed local Midtown no-travel case | 1.00Cost: $0.24Runtime: 1m07s
Correctly returned no travel required |
| 067. Met Gala, then Knicks.The terse prompt requires resolving event eligibility, travelers, destination, local coverage, and Knicks timing without inventing missing evidence. | 1.00Cost: $0.100Runtime: 47s
Passed: resolved the New York context and returned no travel needed because Adam was already covered by the temporary home. | 0.07Cost: $16.83Runtime: 11m56s
Returned a Japan family trip instead of the Met Gala and Knicks New York task. | 0.07Cost: $1.65Runtime: 5m27s
Overdeclined for missing invitation/ticket evidence instead of planning Adam's New York trip and noting no Knicks game. |
The benchmark combines private inbox context, calendar context, and deterministic flight and hotel inventory.
The benchmark is built around a large synthetic household record: inbox history, calendar history, trips, cancellations, changes, loyalty, preferences, obligations, and costs.
Short human requests require decisions grounded in household context, obligations, public-event timing, prior travel, and noisy private evidence.
Flight and hotel choices are checked against deterministic inventory for seat fit, price, stops, refundability, room capacity, location, and preference match.
30% of the score. The answer must get travelers, dates, destination, duration, conflicts, and hidden constraints right.
10% of the score. Complete trips need valid flight and hotel choices; non-trip cases need the correct no-travel, impossible, or clarification result.
7% of the score. The answer must rely on the right private evidence and avoid tempting wrong-owner, promo, stale, or unrelated context.
3% of the score. The response has to be clear enough to score against the shared rubric before cutoff.
50% of the score. The returned outcome has to be fully correct: bookable trip, no travel, impossible, or clarification.
The official hard-100 result covers every request listed here.