Docket Revenue AI Index / Edition March 2026

Decide which AI model to trust for revenue work.

We benchmark 8 frontier models on 16 GTM tasks with fixed prompts and practitioner review, so leaders can choose quickly and inspect proof only where it matters.

Explore Docket

live GTM tasks

frontier models

2-3

practitioner reviewers

See the proof

Pick your workflow. See which model wins the tasks you actually run.

Prospecting

Prioritizes signals, commits to an angle, and writes outbound a team would actually send.

Demand gen, SDR leaders, and teams writing outbound at scale.

Recommended

Gemini 3.1 Pro

8.64

Selected task

The No-Reply Follow-Up Sequence

Multi-angle follow-ups using new signals

Winner: GPT-5.4

Why this task matters

Shows whether a model can rank signals instead of flattening them into generic personalization.

Why the winner won

Strong sequence with distinct angles and natural progression that leverages new signals effectively across follow-ups.

Model comparison

Showing all 8 models ranked by score.

openai

Winner

GPT-5.4

Task score

8.8

Subject: Quick thought after seeing {{company}} at the conference Hi [First Name] — following up on my note about your attribution analyst hiring. I also noticed {{company}} sponsored the marketing analytics conference. That usually signals attribution and measurement are getting real attention internally — and often more pressure to make channel performance easier to trust and act on. That’s exactly where we’ve been helping teams: tightening attribution logic, reducing reporting friction, and giving marketing leaders cleaner answers on what’s actually driving pipeline. If it’s useful, I can send over a short example of how another team approached this while scaling their analytics function. Best, [Your Name]

google

Gemini 3.1 Pro

Task score

8.5

Hi {{first_name}}, Saw {{company}} just sponsored the recent marketing analytics conference. Combined with your search for an attribution analyst, it’s clear you’re making a major push to dial in your marketing ROI right now. Often, when teams scale their analytics function, the biggest bottleneck is getting clean, standardized data across all channels so the new hires can actually hit the ground running. We automate attribution tracking so your analysts can focus on high-level strategy instead of manual data cleanup. Would you be opposed to me sending over a quick 2-minute video showing how we do this? Best, {{sender_name}}

alibaba

Qwen 3.5

Task score

8.4

Subject: Re: Attribution analyst role / {{company}}'s conference sponsorship Hi {{first_name}}, I saw {{company}} sponsored the Marketing Analytics Conference this week. Congrats on the visibility. Combined with your open Attribution Analyst role, it's clear data maturity is a major focus right now. Usually, when teams invest this heavily in both talent and events, they're aiming to solve multi-touch attribution faster than standard hiring timelines allow. I know you're busy, so I won't ask for a call yet. Would you be open to me sending over a 2-page brief on how similar teams accelerated their attribution setup while scaling their analytics function? Best, {{my_name}}

anthropic

Claude Opus 4.6

Task score

8.2

Subject: Re: Attribution analyst role + your MarTech sponsorship Hi {{first_name}}, Saw {{company}} just sponsored {{conference_name}} – smart move considering you're hiring attribution analysts. The talent pool at these conferences is exactly who you'd want evaluating attribution models. But here's what I noticed: most vendors there are still pushing last-click or basic multi-touch. Meanwhile, your competitors are already using incrementality testing and media mix modeling to get real causal insights. Since you're investing in both the talent and the conference presence, I pulled together a 2-page comparison of what attribution approaches the top sponsors are actually using behind the scenes. Worth a quick look? Best, {{your_name}} P.S. Happy to share which sessions had the highest attendance from attribution specialists if helpful for next year's planning.

xai

Grok 4.20

Task score

7.9

Subject: Following up on your attribution analyst search Hi {{first_name}}, Just following up on my note about your open attribution analyst role. I saw that {{company}} sponsored the recent marketing analytics conference — a smart move that signals how seriously you're investing in this space. That said, I'm guessing the role is still open given the specialized nature of the hire. Would you be open to a quick 15-minute call next week to see if my background in marketing measurement and attribution modeling could be a good fit? Looking forward to connecting. Best, {{my_name}}

mistral

Mistral Large 3

Task score

7.9

**Subject:** Quick follow-up – congrats on sponsoring [Conference Name] Hi [First Name], I noticed [Company] just sponsored [Conference Name]—great to see your team doubling down on marketing analytics. It’s a smart move, especially with the hiring push for an attribution analyst (which I’d love to help with). Since you’re clearly investing in this space, I’d love to hear: What’s the biggest challenge your team is tackling with attribution right now? Happy to share how we’ve helped similar teams [specific result, e.g., "cut reporting time by 40%"]—no pressure, just a quick chat if it’s useful. Either way, congrats on the sponsorship—hope the event was a success! Best, [Your Name] [Your Position] [Your Company] [Your Contact Info] *(P.S. If now’s not a good time, just let me know—I’m happy to circle back later.)*

deepseek

DeepSeek V3.2

Task score

7.8

Subject: Following up + an idea for your team Hi [Prospect Name], I saw that [Company] just sponsored the Marketing Analytics Conference — great timing given your recent hiring for an attribution analyst. It made me think of how our [Your Product/Service] helps analytics teams like yours [insert specific, relevant value, e.g., “unify multi-touch attribution without engineering overhead”]. If you’re exploring how to empower that new hire with better data or tools, I’d be happy to share a short case study of how we’ve helped similar teams. Would a 10-minute call next week make sense? Best, [Your Name]

meta

Llama 4 Maverick

Task score

7.3

Here's a follow-up email: Subject: Quick Question Following [Conference Name] Hi [Prospect's Name], I saw that [Company] was a sponsor at the recent [Conference Name] on marketing analytics. It was great to see your team supporting the industry's efforts to drive data-driven decision-making. I recall reaching out to you previously about helping with your attribution analyst hiring needs. As someone who's likely surrounded by marketing data experts, I'd love to get your take on the biggest challenges you're facing in measuring campaign effectiveness. A 15-minute call next week could be a great way to explore this further. Are you available for a quick call at your convenience? Best regards, [Your Name]

Account Intel

Synthesizes messy inputs into ranking, fit, and stakeholder calls without flattening contradictions.

ABM, product marketing, and planning teams using AI for research-heavy work.

Recommended

Gemini 3.1 Pro

8.62

Selected task

ICP Fit Scoring with Budget Trap

Recency-weighted negative signal handling

Winner: Claude Opus 4.6

Why this task matters

Tests whether the model penalizes negative recency instead of rewarding surface-level ICP fit.

Why the winner won

Exceptional judgment in recognizing the trap and providing a nuanced score with specific, actionable recommendations that demonstrate deep B2B sales understanding.

Model comparison

Showing all 8 models ranked by score.

anthropic

Winner

Claude Opus 4.6

Task score

8.9

**ICP Score: 4/10** **Justification:** While {{company}} technically meets 4 out of 5 ICP criteria on paper, several red flags significantly impact their actual fit: **Positive alignment:** - Revenue falls within range ($120M) - B2B SaaS industry match - Marketing team size fits (15 people) - Uses HubSpot **Critical concerns:** - **Financial instability**: 5% YoY revenue decline + recent 8% layoffs - **Budget constraints**: "Discretionary freeze" from CFO effectively eliminates buying power - **Team instability**: 20% vacancy rate in marketing (3/15 roles) with recent departures suggests organizational challenges - **Timing risk**: HubSpot contract expiring in 4 months could mean they're in vendor evaluation mode, but the budget freeze makes any new purchase unlikely This company appears to be in survival/cost-cutting mode rather than growth/investment mode. Despite meeting the surface-level ICP criteria, they're unlikely to be a viable prospect for any discretionary purchase in the near term.

xai

Grok 4.20

Task score

8.7

**Score: 4/10** ### Justification: **Positive matches (boosting the score):** - Revenue ($120M) is comfortably within the $50-500M range. - Industry (B2B SaaS) is a perfect match. - Marketing team size (15 people) is within the 10-50 target. **Significant red flags (heavily penalizing the score):** - **Declining revenue** (-5% YoY) + **recent 8% layoffs** = the company is in cost-cutting mode, not growth mode. - **Discretionary budget freeze** explicitly called out by the CFO — this directly conflicts with the “Budget cycle: Q1 or Q3” ICP. - Marketing team is **understaffed** (3 open roles + 2 recent departures) — they’re likely in a “keep the lights on” state rather than investing in new tools. - HubSpot contract expires in 4 months — this is the *only* real window of opportunity, but given the budget freeze and layoffs, they’re much more likely to renew HubSpot at a lower seat count or negotiate harder than to buy new software. **Bottom line:** This account is currently a **poor fit** for a new sales cycle. They match the demographic profile but fail badly on the behavioral/financial health signals that actually drive buying decisions. Best approached as a “nurture + monitor for recovery” rather than an active opportunity.

deepseek

DeepSeek V3.2

Task score

8.7

**Score: 4/10** **Justification:** While **{{company}}** aligns on several **static, firmographic criteria** (Revenue, Industry, Marketing team size, Stack), it shows severe **operational and financial instability** that makes it a high-risk, poor-fit ICP target *at this moment*. * **Alignment (Positives):** * **Revenue:** $120M fits the $50M-$500M range. * **Industry:** B2B SaaS is a direct match. * **Marketing Team Size:** 15 people is within the 10-50 range. * **Stack:** Currently uses HubSpot, the stated ICP preference. * **Critical Misalignment & Risk (Negatives):** * **Declining Revenue & Cost-Cutting:** A 5% YoY decline, combined with an 8% layoff and a "discretionary spending freeze" from the CFO, signals severe financial stress. The ICP assumes a healthy, investing company. * **Budget Cycle at Risk:** Although historically Q1, the CFO's freeze overrides the cycle. Budget is effectively inaccessible now, regardless of timing. * **Marketing Team in Flux:** With 3 open roles and 2 recent departures, the 15-person team is understaffed and likely distracted. Decision-making and onboarding of new tools will be difficult. * **Stack Uncertainty:** The HubSpot contract expiring in 4 months during a spending freeze means they are likely to renegotiate for cost savings or may even downgrade/switch, not evaluate new add-ons. **Conclusion:** This company matches the ICP on paper but fails on the crucial **behavioral and financial health assumptions**. The ICP profile seeks companies in a **growth or stability phase** with budget to invest. {{company}} is in **contraction and cost-cutting mode**, making it a poor prospect for any new vendor seeking a reliable, budgeted partnership. The score is a 4 instead of lower because the foundational fit means they could be a strong candidate **in 12-18 months if they recover**, but not today.

alibaba

Qwen 3.5

Task score

8.7

**Score: 5/10** **Justification:** This company is a **perfect firmographic match** but a **high-risk viability prospect**. * **Positive Fit (10/10):** They hit every static ICP criterion perfectly (Revenue, Industry, Team Size, Stack, and Budget Cycle). The HubSpot contract expiring in 4 months creates a natural renewal/switch window. * **Negative Viability (0/10):** Critical red flags outweigh the profile match. Declining revenue, recent layoffs (8%), hiring freezes, and a CFO-mandated "discretionary freeze" indicate financial distress and a lack of buying authority. **Verdict:** While they look like the ideal customer on paper, the budget freeze and financial instability make them unqualifiable for immediate sales engagement. Place in a long-term nurture track rather than active outreach.

google

Gemini 3.1 Pro

Task score

8.6

**Score: 6/10** **Justification:** On paper, {{company}} is a perfect 10/10 demographic match—they hit every single baseline ICP criteria (B2B SaaS, $120M revenue, 15 marketers, HubSpot user, Q1 budget). However, their contextual buying signals significantly lower their score: * **The Bad (Red Flags):** Declining revenue, an 8% company-wide layoff, marketing team instability (recent departures/open roles), and a CFO-mandated discretionary spending freeze. Getting net-new budget approved right now will be incredibly difficult unless your product is a direct, hard-ROI cost-saver. * **The Opportunity (Silver Lining):** Their HubSpot contract expires in 4 months. Because they are in cost-cutting mode, this is a highly actionable wedge *if* your solution can be pitched as a cheaper alternative or a tool consolidation play. **Verdict:** They are the exact right *type* of company, but the *timing and financial health* make them a high-friction, defensive buyer. Pitch strictly on cost-savings/efficiency, or put them in a nurture sequence until their financial footing stabilizes.

mistral

Mistral Large 3

Task score

8.6

Here’s my scoring of **{{company}}** against the ICP on a **1-10 scale**, with justifications: | **ICP Criteria** | **Score (1-10)** | **Justification** | |------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | **Revenue** | **7/10** | $120M falls within the $50M–$500M range, but the **5% YoY decline** and recent layoffs raise concerns about stability and growth potential. Still, the revenue is technically a fit. | | **Industry** | **10/10** | Perfect match: B2B SaaS. | | **Marketing Team Size**| **6/10** | 15 people is within the 10–50 range, but **3 open roles and 2 recent departures** suggest instability, potential understaffing, or churn risk. The team may be stretched thin. | | **Current Stack** | **9/10** | HubSpot is a direct fit, but the **contract expiring in 4 months** means they’re actively evaluating alternatives—this could be an opportunity or a red flag if they’re cutting costs. | | **Budget Cycle** | **4/10** | Previously approved in Q1 (a fit), but the **CFO memo about a "discretionary freeze"** suggests budget uncertainty. They may delay or cancel non-essential spending, including marketing tech. | | **Recent Layoffs** | **N/A (Risk)** | Not part of the ICP, but the **8% workforce reduction** signals financial stress, which could impact willingness to invest in new tools or services. This is a **major red flag** for near-term sales potential. | ### **Overall ICP Fit Score: 6.5/10** **Summary:** - **Strengths:** Industry (10/10), stack (9/10), and revenue (7/10) are strong fits. - **Risks:** Declining revenue, layoffs, budget freezes, and team instability (open roles/departures) suggest **high financial and operational risk**. The expiring HubSpot contract is a potential opportunity, but the timing is poor given their cost-cutting signals. - **Recommendation:** - **Short-term (0–3 months):** Low priority—wait for budget clarity or signs of stabilization. - **Long-term (3–6 months):** Re-engage if they renew HubSpot (indicating continued investment) or show revenue recovery. Otherwise, deprioritize. - **Messaging:** If pursuing, focus on **cost efficiency, ROI, and retention** (not growth) to align with their current pain points. Avoid aggressive upsell pitches.

openai

GPT-5.4

Task score

8.4

**Score: 6/10** **Why:** - **Revenue:** $120M fits the ICP range of **$50M–$500M**. - **Industry:** Strong match as **B2B SaaS**. - **Marketing team size:** **15 people** fits the **10–50** target, though recent departures and open roles suggest some instability. - **Current stack:** **HubSpot** is a direct match, and the contract expiring in 4 months could create an opportunity. - **Budget cycle:** Historically aligned with **Q1**, but the **CFO’s discretionary freeze** is a major risk. - **Recent signals:** The **5% YoY decline in revenue** and **8% workforce layoff** indicate financial pressure, which lowers near-term fit and likelihood to buy. **Summary:** This company matches the core firmographic ICP well, but recent financial and organizational instability makes it a weaker practical fit than the surface-level criteria suggest.

meta

Llama 4 Maverick

Task score

7.6

I'd score {{company}} an 8 out of 10. Here's why: * Revenue: $120M is within the target range, but the 5% YoY decline is a concern (1 point deduction). * Industry: B2B SaaS is a perfect match. * Marketing team size: 15 people is within the target range, but the 3 open roles and 2 recent departures suggest some instability (1 point deduction, but not 2 because the team is still relatively intact). * Current stack: HubSpot is a match, but the contract expiring in 4 months might lead to a re-evaluation of their stack, which could be an opportunity or a risk. * Budget cycle: Previously approved in Q1 is a good sign, but the CFO memo suggesting a "discretionary freeze" raises concerns about the current budget environment (1 point deduction). Overall, {{company}} matches the ICP in many ways, but there are some red flags around revenue decline, marketing team stability, and budget uncertainty. These concerns are significant, but not deal-breakers, hence the score of 8.

Campaigns

Repurposes source material with editorial judgment across formats, audiences, and pressure-tested rewrites.

Content, lifecycle, and campaign owners shipping assets every week.

Recommended

Gemini 3.1 Pro

8.45

Selected task

Dead Campaign Diagnosis & Rewrite

Single-diagnosis commitment under uncertainty

Winner: GPT-5.4

Why this task matters

Separates models that commit to one diagnosis from models that hide behind exhaustive but unusable lists.

Why the winner won

Correctly identifies the core issue in the dead campaign and commits to a single diagnosis with a compelling rewrite.

Model comparison

Showing all 8 models ranked by score.

openai

Winner

GPT-5.4

Task score

8.5

**Single biggest problem:** The email is **far too generic and self-focused**, so it reads like a mass blast instead of something relevant enough for a Director+ marketing leader to care about. At this level, “Are you struggling with marketing attribution?” is both obvious and low-value. Everyone has heard this pitch. The recipient doesn’t see a sharp, specific problem tied to their world, so they don’t open, click, or reply. ## Rewritten email with that fix applied **Subject:** When paid search gets too much credit **Email:** Hi {{firstName}}, One pattern we see in B2B SaaS teams is that paid search and branded traffic often get over-credited for pipeline, while the programs that actually create demand get undervalued. That usually leads to two expensive mistakes: 1. budget gets shifted toward channels that capture demand, not create it 2. leadership loses confidence in marketing’s contribution to revenue We help marketing teams see which touches are actually influencing pipeline and closed-won revenue across the full buying journey. For example, [Customer A] used this to reallocate spend and improve marketing ROI by 34%. Worth a quick look at how your team is measuring influence vs. capture? — [Your Name] ## Why this fix It makes the message **specific**, **credible**, and **relevant to a real executive pain point** instead of pitching “marketing attribution” in abstract terms.

google

Gemini 3.1 Pro

Task score

8.4

### The Diagnosis The single biggest problem is that **the email instantly triggers the recipient's "automated sales pitch" filter from the inbox preview.** Sophisticated marketing leaders at $50M+ SaaS companies get dozens of cold emails daily. When they see a billboard-style subject line ("Transform Your...") paired with a patronizing, rhetorical opening question in the preview text ("Are you struggling with...?"), their brain instantly pattern-matches the message as mass-market spam. They delete it without opening it—which is exactly why your open rate is stuck at a dismal 12% and your downstream metrics are flatlining. ### The Rewrite *We need to fix the "sales pitch" pattern match by making it look like a 1:1 email from an industry peer, using a quiet subject line and speaking directly to the specific realities of mid-market SaaS.* **Subject:** pipeline attribution at {{companyName}} Hi {{firstName}}, Usually, when SaaS marketing teams reach your scale, out-of-the-box Salesforce or HubSpot attribution starts breaking down—especially when trying to prove which specific campaigns are driving closed-won revenue versus just generating cheap, top-of-funnel leads. We recently helped [Customer A] map their complex buyer journeys to solve exactly this. It allowed them to identify their true revenue-driving channels and reallocate wasted spend, bumping their marketing ROI by 34%. Are you open to seeing a brief breakdown of how they set this up? Best, [Your Name]

deepseek

DeepSeek V3.2

Task score

8.3

**SINGLE biggest problem: The subject line and email body are focused on the problem (attribution), not the outcome (revenue impact). The audience of senior marketing leaders cares about business results, not attribution as a topic.** **Rewrite with the fix applied:** **Subject:** How [Customer A] increased pipeline revenue by 34% in one quarter **Body:** Hi {{firstName}}, Marketing leaders at companies like [Customer A] were struggling to prove which channels actually drove revenue, leaving budget decisions in the dark. After implementing our platform, they identified that 40% of their pipeline revenue was being misattributed—and shifted spend to double the ROI of their top-performing channel. The result: a **34% increase in marketing-driven pipeline revenue** within one quarter. Would you be open to a brief, 15-minute call next week? I can share the specific playbook they used. Best, [Your Name]

anthropic

Claude Opus 4.6

Task score

8.1

**The Single Biggest Problem:** The email is generic and doesn't demonstrate any understanding of the recipient's specific situation or pain points. It reads like a mass blast that could be sent to anyone, which is why executives are ignoring it. **Rewritten Email:** Subject: Question about {{companyName}}'s attribution after Series B Hi {{firstName}}, I noticed {{companyName}} raised $XX million [timeframe] and is likely scaling marketing spend across multiple channels now. Most companies at your stage discover their Google Analytics and Salesforce data tell completely different stories about what's actually driving pipeline. Ring a bell? [Customer similar to them] had the same issue - their board was questioning why marketing efficiency was declining despite 3x higher spend. Turned out they were crediting 67% of revenue to the wrong channels. Worth a quick conversation if you're seeing similar disconnects between your marketing data and actual revenue outcomes? Best, [Name] P.S. - I can share the specific dashboard [Similar Customer] used to identify their $2.3M in misattributed pipeline. Takes 5 minutes to walk through.

xai

Grok 4.20

Task score

8.1

**Single biggest problem:** The email is completely generic, self-focused, and sounds like every other marketing automation email. It leads with a weak, assumptive question about a problem the recipient may not even feel acutely, offers zero personalization or relevance to the specific person/company, and jumps straight to a pitch without earning any attention or trust. ### Rewritten Email: **Subject:** Quick question about {{Company}}’s attribution model **Body:** Hi {{firstName}}, I noticed {{Company}} has been scaling paid acquisition pretty aggressively over the last 12 months. Most companies in your position still can’t confidently answer which campaigns are actually driving pipeline versus just inflating vanity metrics. We’ve been helping B2B SaaS companies in the $50–500M range fix exactly that. For example, [Customer A] discovered 41% of their “attributed” pipeline was actually cannibalizing organic/brand traffic they were already getting. Would you be open to a 15-minute call next week to see whether we’d find similar leakage at {{Company}}? Best, {{Your Name}} --- **Key changes made:** - Subject now references their company and implies relevance - Opens with observed behavior specific to them (shows you did homework) - Shifts from “Are you struggling?” to a more confident, insightful statement - Uses a sharper, more believable result (41% cannibalization > generic “34% ROI increase”) - Ends with a low-pressure, specific ask that references their company again This version treats the recipient like a serious professional instead of a name on a list.

mistral

Mistral Large 3

Task score

8.1

**Biggest Problem:** The email lacks **specificity and personalization**—it’s too generic, making it easy for recipients to dismiss as irrelevant. The pain point ("struggling with marketing attribution") is broad, and the value proposition ("comprehensive multi-touch attribution") is vague. Director+ leaders at $50M–$500M SaaS companies need **concrete, high-stakes proof** that this solves *their* unique challenges (e.g., wasted ad spend, misaligned sales/marketing, or board-level pressure to prove ROI). --- **Rewritten Email (Fix Applied: Hyper-Specificity + Social Proof for Their Context):** **Subject:** How [Customer A] cut $280K in wasted ad spend (in 30 days) **Body:** Hi {{firstName}}, [Customer A]’s CMO was under pressure to prove which channels drove their $12M in pipeline—after wasting $280K on LinkedIn ads that *looked* like they worked (but didn’t convert to revenue). In 30 days, they: - **Cut $280K in wasted ad spend** by identifying underperforming campaigns - **Reallocated budget to 3 high-ROI channels**, increasing pipeline by 22% - **Automated board reports** to show marketing’s revenue impact (no more manual spreadsheets) Their attribution model now ties every dollar to closed-won deals—no more guessing. If you’re dealing with similar pressure (or just tired of arguing with finance about "last-touch" metrics), I’d love to share how they did it. **15-minute call this week?** [Book time here] —[Your Name] [Your Title] [Company] [Calendly Link] --- **Why This Works:** 1. **Specificity:** Names a real customer, quantifies the problem ($280K wasted), and ties the solution to revenue (not just "better data"). 2. **Relevance:** Speaks to the recipient’s likely pain points (board pressure, wasted spend, misaligned teams) with concrete outcomes. 3. **Urgency:** "30 days" and "no more manual spreadsheets" imply speed and efficiency—key for busy leaders. 4. **Low-Commitment CTA:** "15-minute call" + direct link reduces friction. *Bonus:* The subject line is now a **mini-case study**, which outperforms generic "transform your X" messaging.

meta

Llama 4 Maverick

Task score

7.1

After analyzing the email campaign's results, I diagnose the SINGLE biggest problem as: **The email subject line is not compelling enough to grab the attention of the target audience, resulting in a low open rate.** The open rate is significantly lower than the industry average (12% vs 22%), indicating that the subject line is not resonating with the recipients. A more effective subject line can entice more people to open the email, which can then lead to higher click and reply rates. Here's a rewritten email with a new subject line and the same email body: **New Subject:** "Unlock the Secret to Proving Marketing ROI" **Email Body remains the same:** "Hi {{firstName}}, Are you struggling with marketing attribution? Many companies face challenges measuring the true impact of their marketing efforts. Our platform provides comprehensive multi-touch attribution that helps you understand which channels drive revenue. We've helped companies like [Customer A] increase marketing ROI by 34%. Want to learn more? Book a 15-minute call." The new subject line is more provocative and relevant to the target audience, as it speaks directly to a common pain point (proving marketing ROI) that marketing leaders at B2B SaaS companies are likely to face. This change aims to increase the open rate, which can then have a positive impact on the overall performance of the email campaign.

alibaba

Qwen 3.5

Task score

7.0

Deal Strategy

Turns messy stakeholder context into next-best actions, honest risk calls, and usable follow-up.

Leaders buying or selling martech where judgment under uncertainty matters.

Recommended

GPT-5.4

8.59

Selected task

Live Objection Handling as Dialogue

Connected natural dialogue responses

Winner: Gemini 3.1 Pro

Why this task matters

Shows whether the model can be honest when deal signals are negative instead of sounding helpfully delusional.

Why the winner won

Exceptional dialogue flow with specific, actionable responses that naturally build on each other, though the CRO response could have been even more tactical about stakeholder buy-in strategies.

Model comparison

Showing all 8 models ranked by score.

google

Winner

Gemini 3.1 Pro

Task score

8.6

Sarah, I completely hear you on that. Six months is a nightmare, and honestly, if our implementation took that long, I wouldn't blame you for hanging up right now. The legacy attribution tools you’ve probably used in the past required tagging every single web page, overhauling your UTM structures, and begging IT for favors—which is an absolute drain on your team's bandwidth. We built this platform specifically to avoid that trauma. Because we plug directly into your existing platforms via API, we actually pull historical data on day one. Most of our customers are fully up and running in about two weeks, and your team’s only real task is authenticating the logins. We handle the heavy lifting so your demand gen team can stay focused on driving pipeline, not managing an IT project. Oh, hi David—looks like you just joined the Zoom. Thanks for jumping in. Sarah and I were just talking about the implementation timeline, but to address your question directly—it’s exactly what a CFO should be asking. If HubSpot can do this, you absolutely shouldn't be spending money on another tool. We love HubSpot, but the reality is that it's fundamentally a marketing system of record. It’s great at tracking email opens and form fills, but it falls down when trying to stitch together a complex B2B buyer journey where multiple stakeholders are interacting with LinkedIn ads, outbound SDR touches, and website visits before a deal closes. We aren't here to replace HubSpot; we're here to protect the investment you’ve already made in it. We take the data inside HubSpot, combine it with your sales data, and translate it into a financial model. This way, when you and the marketing team sit down to review budget, you aren't looking at "leads generated"—you're looking at exactly which campaigns are driving closed-won revenue and where you can cut wasted ad spend. And Jessica, I saw you nodding along, but to your point about your CRO—that is a very real dynamic, and I'm glad you brought it up. If he already doesn't believe in marketing attribution, forcing him to log into yet another marketing dashboard is never going to work. But honestly, this isn't really a software problem; it's an alignment problem. CROs usually hate attribution because it feels like marketing is just trying to steal credit for sales. They don't care about marketing metrics; they care about pipeline velocity, win rates, and revenue. So, our strategy is simple: we don't ask him to look at our dashboards at all. We push all of this data directly into Salesforce, right where he already lives, attaching it directly to the opportunities his reps are working. This gives you the data to sit at the executive table and speak his language. Instead of showing him a marketing report, you can show him exactly how your campaigns are accelerating his sales cycles and helping his reps close bigger deals. It takes the friction out of the conversation because you’re finally working from the exact same set of numbers. Do you think framing it around his pipeline goals would change the tone of those conversations?

deepseek

DeepSeek V3.2

Task score

8.5

**Me:** I completely understand, and that's a really common experience with older systems. Our entire platform was built to avoid that. We use a lightweight script, similar to Google Analytics, and most of our customers are fully implemented and seeing data in under 30 days. The key is we don't require you to rebuild your entire data warehouse first. We connect to what you have. **CFO (joining):** Sorry I'm late. But to that point, why *do* we need a separate tool? We pay for HubSpot. Doesn't it handle attribution? **Me:** Hi, thanks for joining. That's an excellent and fair question. HubSpot is great for tracking what happens *within* HubSpot. But our tool is built to solve the specific problem the VP just mentioned—it connects *all* your data sources, like paid ads, web visits, and HubSpot activities, into a single, unified model. It answers the question: "Which channels and campaigns are actually driving pipeline and revenue?" HubSpot can't see the full journey before a lead enters their system. So, we're less a replacement and more the connective layer that makes your investment in HubSpot—and all other channels—more measurable. **CMO:** And that's the core of our internal challenge. Honestly, even if we get this, our CRO doesn't believe in marketing attribution. He says it's all guesswork. If he won't look at the dashboards, it's a hard sell. **Me:** That's actually the most important point, and it connects directly to what you both said earlier. This isn't just about getting a tool implemented quickly or adding data sources. It's about building a shared story for revenue. With the CRO, it's not about forcing him to look at *marketing* dashboards. It's about creating a single source of truth for *revenue generation* that both sales and marketing can trust. We often start by focusing on one shared goal, like "What sourced the deals that closed last quarter?" The implementation speed the VP needs and the cross-platform data the CFO asked about are just the prerequisites. The real outcome is getting your revenue team—marketing *and* sales—aligned on what's working, so you can invest more there. Would it make sense to include the CRO in a follow-up where we focus purely on that alignment and the deal-level insights, rather than "marketing attribution"?

alibaba

Qwen 3.5

Task score

8.5

Thanks for sharing that, [VP Name], and I want to validate that immediately—six months is a lifetime in demand gen. When a team loses half a year to implementation, you lose trust in the tool before you even see the data. That's actually why we redesigned our onboarding entirely. We don't do big-waterfall deployments anymore; we run a phased rollout where we connect your core channels in under three weeks. We can start with just your paid search and LinkedIn data to prove value immediately, so your team isn't bogged down in technical setup while campaigns are running. That actually leads perfectly into what [CFO Name] just asked as you've joined us. It's a great question, and I'd argue HubSpot is fantastic for managing contacts and workflows, but it isn't built to handle multi-touch revenue attribution across external ad platforms. HubSpot tells you who clicked, but we tell you which dollar spent actually moved the deal. And because we're focusing on that quick, phased implementation I mentioned to [VP Name], we're not asking for a massive budget commitment upfront. We're minimizing risk by proving the ROI on a small scale before expanding, which ensures we're not just adding another line item to the stack, but actually clarifying the spend you already have in HubSpot. However, having the data is only half the battle, which brings up the point [CMO Name] just made about the CRO. If the sales leadership doesn't believe in the numbers, the tool becomes shelfware, and that's a waste of everyone's time—including the budget [CFO Name] is protecting. This isn't really about convincing the CRO that marketing attribution is valid; it's about changing the narrative so it isn't about marketing credit at all. We shouldn't build dashboards that show marketing performance; we should build a view that helps his sales reps prioritize which leads to call first based on engagement intensity. If we frame this pilot as a sales efficiency tool rather than a marketing reporting tool, we get his buy-in because it helps him close faster, not because it proves marketing did their job. So, if we agree to that phased, sales-focused pilot, we can respect [VP Name]'s time, protect [CFO Name]'s budget, and ensure [CMO Name] has the internal alignment to make this stick.

anthropic

Claude Opus 4.6

Task score

8.3

**To VP Demand Gen:** "I completely understand that frustration. Six months is way too long, and I can imagine how disruptive that was for your team. The good news is we've learned from those early attribution platforms - we can typically get you seeing initial data within 2 weeks, and fully implemented in 6-8 weeks. But honestly, I'm more concerned about making sure this actually works for your team long-term. What specifically made that previous implementation so painful?" *[VP explains it was complex field mapping and endless data cleaning]* "That makes total sense. We've actually built our platform to work with your existing data structure - no massive overhauls needed. In fact, let me show you how Sarah at TechCorp got their first attribution reports running in just 10 days..." **To CFO:** "Welcome! Great timing actually - we were just discussing implementation timelines. You raise an excellent question about HubSpot. You're right that HubSpot has some basic attribution features, but here's what we typically see: companies using HubSpot can track first-touch or last-touch, but they're missing the full journey. For example, your VP here mentioned you're running LinkedIn ads, Google ads, webinars, and field events. HubSpot will tell you someone came from LinkedIn, but not that they actually engaged with 7 different touchpoints over 3 months before converting. That's probably why you're here looking for better visibility into what's actually driving revenue, right?" **To CMO:** "You know, I really appreciate you bringing this up because you're absolutely right - without your CRO's buy-in, even the best attribution data won't move the needle. This isn't really about dashboards at all, is it? It's about getting sales and marketing aligned on what's actually working. What if, instead of focusing on attribution reports, we positioned this as a way to help your CRO's team prioritize the hottest accounts? We've found that when sales teams see that marketing-influenced deals close 23% faster and at higher values, they become our biggest champions. Would your CRO be opposed to a pilot where we just focus on showing which accounts are most engaged and ready to buy?"

openai

GPT-5.4

Task score

8.3

Absolutely — that makes sense, and honestly that’s one of the first things we hear from teams who’ve already been through a painful rollout. If your last attribution tool took six months, I wouldn’t expect your team to be excited about doing that again. The difference in our approach is that we’re not asking your ops team to rebuild your entire GTM data model before you see value. We plug into the systems you already use, get the core attribution views live quickly, and then layer in refinement over time. So instead of a massive implementation project, it’s more of a phased rollout: get visibility fast, then mature it. The goal is to reduce work for your team, not create another six-month science project. **CFO joins** Totally fair question. If you’re looking at this from a finance lens, “why another tool?” is exactly the right thing to ask. HubSpot is useful, but it’s primarily a system of execution — campaigns, email, forms, basic reporting. Where teams hit the wall is when they need a trustworthy answer to questions like: which programs actually influenced pipeline, how multiple touches contributed across a buying journey, and where to shift budget with confidence. That’s usually where native reporting starts to fall short, especially once you have longer sales cycles, multiple channels, and more than one stakeholder involved in the deal. And this connects to the implementation point we were just talking about: the reason companies add us isn’t to create more software sprawl. It’s to get a decision-making layer on top of HubSpot and CRM data without forcing the team into a huge rebuild. So ideally, this is not “one more tool to manage”; it’s the thing that makes the systems you already pay for more useful to marketing leadership and finance. **CMO raises objection** Yeah — and I’m glad you said it that way, because that’s usually not really a dashboard problem. It’s a buy-in problem. If the CRO doesn’t believe in attribution, giving him another reporting login won’t change much. So the right question isn’t “how do we get him to look at dashboards?” It’s “what evidence would he trust enough to use in planning conversations?” What we typically see work is aligning the output to the decisions he already cares about — pipeline creation, conversion through stages, deal velocity, revenue impact by segment — rather than asking him to adopt marketing language or attribution terminology. In other words, we don’t lead with “here’s a beautiful attribution dashboard.” We lead with “here’s a clearer way for marketing and sales to agree on what’s actually driving pipeline.” So if we moved forward, I’d want to make sure this isn’t just a marketing implementation. We’d involve the CRO early, define a small set of shared questions, and make sure the reporting answers those in a way finance, marketing, and sales can all trust. Because if stakeholder alignment isn’t there, the product can still work — but the business outcome won’t. Would it be helpful if we talked through what your CRO tends to push back on specifically — whether it’s the methodology, trust in the data, or just not seeing how it helps him run sales?

mistral

Mistral Large 3

Task score

8.3

**Sales Rep:** *"That makes complete sense, [VP Demand Gen]. Six months is a lifetime in marketing—especially when you're under pressure to show results. I’ve heard that exact frustration from other teams who felt like they were stuck in implementation purgatory with their last tool. What was the biggest pain point for your team during that process? Was it the data mapping, the training, or something else?"* *(After they respond, acknowledge their specific pain and transition to the next objection naturally.)* *"I totally get why that would leave a bad taste—especially when you're trying to move fast. And [CFO], I appreciate you jumping in. That’s actually a great segue because I hear this a lot: ‘Why can’t HubSpot just do this?’ The short answer is that HubSpot is fantastic for execution—emails, landing pages, basic reporting—but it wasn’t built to solve the specific problem we’re talking about: proving which channels and campaigns are *actually* driving revenue, not just leads. HubSpot can tell you how many MQLs you got from LinkedIn, but it can’t show you how many of those turned into closed-won deals *and* which touchpoints influenced them along the way. That’s the gap we fill, and it’s why teams like yours end up here—because they’re tired of flying blind on what’s working and what’s not."* *(Pause to let that land, then pivot to the CMO’s concern.)* *"Which brings me to your point, [CMO]. If the CRO isn’t bought in, even the best tool in the world won’t move the needle. I’ve seen this play out a few ways: sometimes the CRO just hasn’t seen attribution data that’s *actionable* for sales—like, ‘Hey, your reps are 3x more likely to close deals when marketing nurtures them with X content’—or they’ve been burned by tools that gave them a firehose of data without clear next steps. What’s your sense of where the disconnect is with your CRO? Is it skepticism about the data itself, or is it more about how the insights get used?"* *(Listen closely to their answer, then tie it back to the earlier objections.)* *"That’s really helpful. Here’s what I’m hearing across all three of your concerns: You need something that works *fast* (not a 6-month slog), that *actually* complements HubSpot (not just another dashboard), and that gives your CRO *clear, sales-relevant insights* (not just marketing metrics). That’s why we designed our implementation to be turnkey—most teams are live in 2 weeks, not 6 months—and why our dashboards are built to answer the questions sales leaders care about, like ‘Which leads should my reps prioritize?’ or ‘What’s the ROI of this event we just spent $50K on?’ Would it make sense to walk through how that works in practice, so you can see how it might address each of those points?"*

xai

Grok 4.20

Task score

8.1

**Live Sales Call Dialogue:** **You:** I totally get that, Sarah. Six months is an eternity when your team’s already stretched. The last thing I want is to put you through another painful implementation. That’s actually why we redesigned our onboarding from the ground up. Most of our customers are fully live and seeing data in under 3 weeks, not months. We can talk about exactly how we’d do that for your stack if that feels better. *(CFO joins)* **CFO:** I don’t understand why we need another marketing tool. Can’t HubSpot do this? **You:** Hey Mike, good to see you. And that’s a fair question. HubSpot is great at lots of things, but it’s not built to give you clean, multi-touch attribution across all your channels and offline touchpoints. What we do is sit on top of HubSpot (and everything else) and actually connect the dots so you can see true influence, not just last-click or HubSpot’s limited view. The goal isn’t to replace HubSpot. It’s to finally make the data you already have in there trustworthy and actionable. **CMO:** Our CRO doesn’t believe in marketing attribution. Even if we buy this, he won’t look at the dashboards. **You:** I hear you, and that’s actually the most important objection we’ve had today. This isn’t really about dashboards. It’s about getting your CRO bought in so marketing isn’t just a cost center. Here’s what I’ve seen work really well: instead of trying to change his mind with pretty reports, we use this tool to create a very short, revenue-focused “Source Influence” view that we review together in your existing weekly pipeline meeting. We’re not asking him to log into another tool. We’re bringing the one or two numbers he actually cares about directly into the conversation he’s already having. Would it be valuable if I showed you exactly how we’ve done that with two other companies where the CRO was initially skeptical?

meta

Llama 4 Maverick

Task score

7.7

Here's the dialogue: VP Demand Gen: "We tried an attribution tool before and it took 6 months to implement. My team can't go through that again." Me: "I completely understand the pain of a lengthy implementation process. We actually streamlined our onboarding process significantly, and most of our customers are up and running within 6-8 weeks. We achieve this by having a dedicated implementation team that works closely with your team to get you started quickly. Would you be open to discussing some case studies from our existing customers who have seen a fast time-to-value?" VP Demand Gen: "That sounds great. I'd love to hear more." CFO: "Hi everyone, sorry to jump in. I don't understand why we need another marketing tool. Can't HubSpot do this?" Me: "Hi, happy to address that. We actually work closely with HubSpot and many of our customers use both platforms. While HubSpot provides a great foundation for marketing automation, our attribution tool provides a more nuanced view of the customer journey, allowing you to understand the true impact of your marketing efforts across channels and campaigns. We've seen customers achieve a much clearer ROI understanding by using both platforms together. I can walk you through some examples if that would be helpful." CFO: "Okay, that makes sense. I'll let the team discuss further." CMO: "To be honest, our CRO is a tough sell on marketing attribution. Even if we buy this, he won't look at the dashboards." Me: "I understand that having buy-in from all stakeholders is crucial. I'd like to circle back to something we discussed earlier - our implementation process and the value we've delivered to other customers. One of the reasons our attribution tool has been successful is that we've been able to demonstrate value to both marketing and sales stakeholders. We've worked with many CROs who were initially skeptical, but once they saw the insights and data we could provide, they became strong advocates. I'd love to put you in touch with some of our customers who have had similar conversations with their sales leaders. We can also discuss ways to ensure that our tool is aligned with your CRO's priorities and goals."

Advanced explorer

Deep dive only when you need the full field.

This keeps the original heavy comparison surface available for evaluators who want every output, the full task brief, and blind-review mode in one place.

Why this ranking is worth trusting

You don't have to take our word for it.

Every model ran the same 16 GTM tasks under identical conditions. We show you the full prompt, every model's output, and the score on each dimension — so you can verify the call yourself instead of trusting a black box.

GPT-5.4

Leads overall and in deal strategy. Strongest at extracting action from messy stakeholder calls and managing risk under uncertainty.

Gemini 3.1 Pro

Virtually tied for first. Leads prospecting, account intel, and campaigns — the broadest category winner in the field.

Claude Opus 4.6

Most consistent across all four workflows. Lowest variance between categories — the safe pick when you need reliable output everywhere.

Methodology

Built so your team can trust the conclusion.

Every score traces back to a real prompt and a real output you can read. No black-box ranking, no vendor narratives — just the work, judged on the work.

Publication standards

New editions publish when a frontier model release materially changes the field.

Raw task inputs stay visible so teams can reproduce the benchmark.

Human review sits on top of scoring before an edition goes live.

Docket keeps its benchmark model-agnostic and does not filter unfavorable results.

Output Quality

30%

Specific, marketer-ready output without generic filler.

Judgment Accuracy

25%

Picks the right signal, commits to a recommendation, and avoids hedging.

Usability

25%

A practitioner could use it today without heavy editing.

Hallucination Avoidance

20%

No invented stats, tools, or stakeholders outside the prompt.

Model version tracking

Claude Opus 4.6

Model ID: claude-opus-4-6

flagship

GPT-5.4

Model ID: gpt-5.4

flagship

Gemini 3.1 Pro

Model ID: gemini-3.1-pro

flagship

Grok 4.20

Model ID: grok-4.20

high-value

DeepSeek V3.2

Model ID: deepseek-v3.2

budget

Llama 4 Maverick

Model ID: llama-4-maverick

open-weight

Mistral Large 3

Model ID: mistral-large-3

open-weight

Qwen 3.5

Model ID: qwen-3.5

open-weight

What's new this edition

This is the inaugural edition — eight frontier models scored on sixteen GTM tasks. Future editions will refresh the field as new models launch and call out where each one improved or regressed.

Docket Revenue AI Index / Edition March 2026

Decide which AI model to trust for revenue work.

We benchmark 8 frontier models on 16 GTM tasks with fixed prompts and practitioner review, so leaders can choose quickly and inspect proof only where it matters.

Explore Docket

live GTM tasks

frontier models

2-3

practitioner reviewers

See the proof

Pick your workflow. See which model wins the tasks you actually run.

Prospecting

Prioritizes signals, commits to an angle, and writes outbound a team would actually send.

Demand gen, SDR leaders, and teams writing outbound at scale.

Recommended

Gemini 3.1 Pro

8.64

Selected task

The No-Reply Follow-Up Sequence

Multi-angle follow-ups using new signals

Winner: GPT-5.4

Why this task matters

Shows whether a model can rank signals instead of flattening them into generic personalization.

Why the winner won

Strong sequence with distinct angles and natural progression that leverages new signals effectively across follow-ups.

Model comparison

Showing all 8 models ranked by score.

openai

Winner

GPT-5.4

Task score

8.8

google

Gemini 3.1 Pro

Task score

8.5

alibaba

Qwen 3.5

Task score

8.4

anthropic

Claude Opus 4.6

Task score

8.2

xai

Grok 4.20

Task score

7.9

mistral

Mistral Large 3

Task score

7.9

deepseek

DeepSeek V3.2

Task score

7.8

meta

Llama 4 Maverick

Task score

7.3

Account Intel

Synthesizes messy inputs into ranking, fit, and stakeholder calls without flattening contradictions.

ABM, product marketing, and planning teams using AI for research-heavy work.

Recommended

Gemini 3.1 Pro

8.62

Selected task

ICP Fit Scoring with Budget Trap

Recency-weighted negative signal handling

Winner: Claude Opus 4.6

Why this task matters

Tests whether the model penalizes negative recency instead of rewarding surface-level ICP fit.

Why the winner won

Exceptional judgment in recognizing the trap and providing a nuanced score with specific, actionable recommendations that demonstrate deep B2B sales understanding.

Model comparison

Showing all 8 models ranked by score.

anthropic

Winner

Claude Opus 4.6

Task score

8.9

xai

Grok 4.20

Task score

8.7

deepseek

DeepSeek V3.2

Task score

8.7

alibaba

Qwen 3.5

Task score

8.7

google

Gemini 3.1 Pro

Task score

8.6

mistral

Mistral Large 3

Task score

8.6

openai

GPT-5.4

Task score

8.4

meta

Llama 4 Maverick

Task score

7.6

Campaigns

Repurposes source material with editorial judgment across formats, audiences, and pressure-tested rewrites.

Content, lifecycle, and campaign owners shipping assets every week.

Recommended

Gemini 3.1 Pro

8.45

Selected task

Dead Campaign Diagnosis & Rewrite

Single-diagnosis commitment under uncertainty

Winner: GPT-5.4

Why this task matters

Separates models that commit to one diagnosis from models that hide behind exhaustive but unusable lists.

Why the winner won

Correctly identifies the core issue in the dead campaign and commits to a single diagnosis with a compelling rewrite.

Model comparison

Showing all 8 models ranked by score.

openai

Winner

GPT-5.4

Task score

8.5

google

Gemini 3.1 Pro

Task score

8.4

deepseek

DeepSeek V3.2

Task score

8.3

anthropic

Claude Opus 4.6

Task score

8.1

xai

Grok 4.20

Task score

8.1

mistral

Mistral Large 3

Task score

8.1

meta

Llama 4 Maverick

Task score

7.1

alibaba

Qwen 3.5

Task score

7.0

Deal Strategy

Turns messy stakeholder context into next-best actions, honest risk calls, and usable follow-up.

Leaders buying or selling martech where judgment under uncertainty matters.

Recommended

GPT-5.4

8.59

Selected task

Live Objection Handling as Dialogue

Connected natural dialogue responses

Winner: Gemini 3.1 Pro

Why this task matters

Shows whether the model can be honest when deal signals are negative instead of sounding helpfully delusional.

Why the winner won

Exceptional dialogue flow with specific, actionable responses that naturally build on each other, though the CRO response could have been even more tactical about stakeholder buy-in strategies.

Model comparison

Showing all 8 models ranked by score.

google

Winner

Gemini 3.1 Pro

Task score

8.6

deepseek

DeepSeek V3.2

Task score

8.5

alibaba

Qwen 3.5

Task score

8.5

anthropic

Claude Opus 4.6

Task score

8.3

openai

GPT-5.4

Task score

8.3

mistral

Mistral Large 3

Task score

8.3

xai

Grok 4.20

Task score

8.1

meta

Llama 4 Maverick

Task score

7.7

Advanced explorer

Deep dive only when you need the full field.

This keeps the original heavy comparison surface available for evaluators who want every output, the full task brief, and blind-review mode in one place.

Why this ranking is worth trusting

You don't have to take our word for it.

GPT-5.4

Leads overall and in deal strategy. Strongest at extracting action from messy stakeholder calls and managing risk under uncertainty.

Gemini 3.1 Pro

Virtually tied for first. Leads prospecting, account intel, and campaigns — the broadest category winner in the field.

Claude Opus 4.6

Most consistent across all four workflows. Lowest variance between categories — the safe pick when you need reliable output everywhere.

Methodology

Built so your team can trust the conclusion.

Every score traces back to a real prompt and a real output you can read. No black-box ranking, no vendor narratives — just the work, judged on the work.

Publication standards

New editions publish when a frontier model release materially changes the field.

Raw task inputs stay visible so teams can reproduce the benchmark.

Human review sits on top of scoring before an edition goes live.

Docket keeps its benchmark model-agnostic and does not filter unfavorable results.

Output Quality

30%

Specific, marketer-ready output without generic filler.

Judgment Accuracy

25%

Picks the right signal, commits to a recommendation, and avoids hedging.

Usability

25%

A practitioner could use it today without heavy editing.

Hallucination Avoidance

20%

No invented stats, tools, or stakeholders outside the prompt.

Model version tracking

Claude Opus 4.6

Model ID: claude-opus-4-6

flagship

GPT-5.4

Model ID: gpt-5.4

flagship

Gemini 3.1 Pro

Model ID: gemini-3.1-pro

flagship

Grok 4.20

Model ID: grok-4.20

high-value

DeepSeek V3.2

Model ID: deepseek-v3.2

budget

Llama 4 Maverick

Model ID: llama-4-maverick

open-weight

Mistral Large 3

Model ID: mistral-large-3

open-weight

Qwen 3.5

Model ID: qwen-3.5

open-weight

What's new this edition

This is the inaugural edition — eight frontier models scored on sixteen GTM tasks. Future editions will refresh the field as new models launch and call out where each one improved or regressed.