The Confidence Layer

Where is AI-addressable work a robust signal, and where is the concept itself contested?

Five models rated the same 17,344 tasks. They did not always agree.

This volume is about what that disagreement reveals and why the pattern matters more than the fact. The dataset established in Volume 0 treats the median model rating as the consensus, and every quantitative claim in this series rests on that consensus. Before those claims can carry any weight, the instrument has to be interrogated. Where models converge, the dataset has an internally stable signal. Where they diverge, the concept of AI-addressable work is being interpreted differently, and that difference is rarely random.

This volume maps where each condition holds. The stable signal is where subsequent instalments make their primary arguments. The divergence zones are used differently, and for reasons the data itself suggests.

The shape of disagreement

First of all, good news: most disagreement is small, and most token-weighted demand sits in the stable part of the map.

In 63.6% of tasks, no model deviates from another by more than one band. These high-consensus tasks account for 68.3% of employment and 68.9% of baseline token-weighted demand, which means the stable core carries the bulk of the series' analytical load before a single macro argument has been made.

The contested tail is much smaller: tasks where models are separated by three or four bands account for 6.8% of tasks, 5.0% of employment, and 3.9% of baseline token-weighted demand. On the consensus measure, disagreement concentrates in tasks that are less economically weighty though that share is partly shaped by the median method itself. A single high rating outvoted by four lower ones will produce a consensus band that understates what the most expansive model is seeing.

Band 0Unanimous

Band 1Near-consensus

Band 2Moderate

Band 3Sharp

Band 4Polar split

Band 0 · Unanimous

Biostatisticians

"Write program code to analyze data with statistical analysis software."

OpenAI

medium

Google

medium

DeepSeek

medium

Anthropic

medium

xAI

medium

Band 1 · Near-consensus

Financial Quantitative Analysts

"Develop tools to assess green technologies or green financial products."

OpenAI

medium

Google

high

DeepSeek

medium

Anthropic

high

xAI

high

Band 2 · Moderate

Construction Managers

"Inspect or review projects to monitor compliance with building and safety codes."

OpenAI

low

Google

high

DeepSeek

medium

Anthropic

medium

xAI

medium

Band 3 · Sharp

General Internal Medicine Physicians

"Operate on patients to remove, repair, or improve functioning of diseased body parts."

OpenAI

high

Google

very high

DeepSeek

very high

Anthropic

low

xAI

low

Band 4 · Polar split

Farmers & Agricultural Managers

"Inspect facilities and equipment for signs of disrepair, and perform necessary maintenance work."

OpenAI

low

Google

very high

DeepSeek

medium

Anthropic

very low

xAI

low

% of tasks

Share of the 17,344 O*NET tasks in each band — an unweighted count of distinct work activities.

% of employment

Tasks reweighted by the number of workers who perform them — larger occupations count for more.

% of token demand

Tasks weighted by estimated AI compute consumed — reflecting both worker count and token intensity of each task.

Aggregating from tasks to occupations moderates the extremes. Only 1.7% of occupations fall into a polar split, a two-point or greater gap between the highest and lowest model score, accounting for just 0.3% of employment. At the other end, true stability is also rare: only 2.2% of occupations have a model range below 0.5. The largest shares sit in the middle. Near-consensus occupations account for 36.2% of occupations and 47.3% of baseline token demand; moderate-spread occupations account for 45.2% of occupations and 39.3% of baseline token demand.

StableAll models agree

Near-consensusOne step apart

ModerateTwo steps apart

SharpThree steps apart

Polar splitFull range

Stable · 19 occupations

2.2% of occupations

Elementary School Teachers Secondary School Teachers Dining Room Attendants

Near-consensus · 312 occupations

36.2% of occupations

Landscaping Supervisors Landscaping Workers Retail Salespersons

Moderate · 390 occupations

45.2% of occupations

Tree Trimmers & Pruners Farmworkers & Crop Laborers Stockers & Order Fillers

Sharp · 127 occupations

14.7% of occupations

Heavy Truck Drivers Agricultural Equipment Operators Construction Laborers

Polar split · 15 occupations

1.7% of occupations

Slaughterers & Meat Packers Forest & Conservation Workers Structural Iron & Steel Workers

% of occupations

Share of the 863 O*NET occupations in each band — each occupation weighted equally regardless of size.

% of employment

Occupations weighted by total employed workers — giving more weight to large occupations like retail or nursing.

% of token demand

Occupations weighted by summed baseline token demand across all tasks — reflecting total estimated AI compute per occupation.

At the occupation level, aggregation lowers the drama without eliminating the uncertainty. Most occupation-level estimates are directionally informative. Very few are tight enough to bear strong claims without caveat.

What is driving the disagreement

Among the 1,180 tasks where model ratings differ by three or more bands, 814 are outlier-driven: removing a single model brings the remaining four into near-consensus. The main driver is Google Gemini. With all five models included, 6.8% of tasks fall into the high-disagreement zone. Excluding Gemini reduces that share to 0.4%.

That is a striking number, and it deserves to be treated as one.

It does not mean Gemini is wrong. Consider the boilermaker task: "Clean pressure vessel equipment, using scrapers, wire brushes, and cleaning solvents." Four models rate this as low-token. The physical act, scraping and brushing, is not meaningfully addressable by a language model alone. Gemini rates it high. The gap is quite understandable once you see what Gemini appears to be counting: checking safety procedures, retrieving maintenance history, updating the work order, logging compliance information, documenting the result. The workflow the task sits inside.

External examples also backs up this reading. In field service, Microsoft’s Copilot already lets technicians describe completed work in natural language or speech and converts it into structured work-order updates. In maintenance, AI-assisted tagging has been used to extract information from unstructured wind-turbine work orders. In construction, vision-language systems are being tested to connect site observations to safety rules and generate inspection reports. Altogether, these examples show the administrative trace of physical work is becoming AI-addressable.

This distinction runs through most of the high-disagreement cases. The models are drawing different boundaries around what the task includes. Whether that boundary falls at the described physical act or extends to the administrative layer surrounding it changes the token estimate substantially, and the dataset, built from O*NET task descriptions, cannot fully resolve which reading is correct.

What that means for the series is this: the disagreement could be interpreted as a signal about where the concept of AI-addressable work is genuinely contested, and therefore not yet ready to carry confident policy claims.

Where the token demand sits

The real question lies in is where disagreement changes the economic story.

The scatterplot maps each occupation by two questions. First, how token-intensive does the job look per worker? Second, how much do the models disagree about that estimate? Bubble size adds a third dimension: whether the occupation is large enough to matter for global token demand.

Source: ricky-li.com

The relationship slopes slightly downward: the occupations with the highest tokens per worker tend, on average, to show lower model disagreement. It suggests that the model ecosystem is more internally stable when rating information-intensive work than when rating physical or operational work.

Four types of occupations emerge:

The high-intensity, lower-disagreement quadrant is the strongest foundation. These are jobs where token demand per worker is high and the models broadly agree: general and operations managers, lawyers, and software developers. The macro implication means faster cycle times, higher output expectations, and widening productivity gaps between firms that redesign workflows and firms that only add AI tools on top of old processes.

The lower-intensity, lower-disagreement quadrant is the quiet majority. These jobs may not be highly token-intensive per worker, but they matter because they employ many people. Small AI gains spread across very large occupations can still become macroeconomically material.

The lower-intensity, higher-disagreement quadrant is the scope-ambiguity zone. Many physical and operational jobs sit here. The models are disagreeing about what part of the job counts. Instead of performing the physical act, AI may take over the paperwork around it: work orders, safety notes, compliance checks, customer updates, scheduling and reporting. The near-term opportunity is less worker replacement than removal of administrative drag. The risk is that firms use the same tools to increase monitoring rather than unleashing productivity.

The high-intensity, higher-disagreement quadrant is the caution zone. These jobs look token-intensive, but the estimate is less stable. They could potentially be treated as workflow-validation cases. The question is where AI can actually be integrated, where regulation or judgment blocks use, and whether the model is rating real work or theoretical information content.

The map changes the role of disagreement in the article. Disagreement tells us how AI may enter different kinds of work: through the core information structure of professional roles, through scale in large occupations, through administrative relief in physical work, and through unresolved workflow constraints in the caution zone.

Interpreting the consensus with caution

The upper-left quadrant seems to be the strongest signal: high-token, lower-disagreement occupations account for 23.5% of occupations and 11.1% of employment, but 30.2% of baseline token demand. This is a concentrated and material part of the work economy.

But the promising stat warns us to interpret the result with caution: the models broadly agree that these jobs are token-intensive. They do not prove that firms will adopt AI well, that workers will use it effectively, or that token intensity will translate into output gains.

External evidence supports this distinction. In specific workflows, generative AI has produced measurable gains: professional writing tasks became faster and higher-quality in one experiment, and customer-support agents resolved more issues per hour in a firm deployment. But the broader productivity literature warns against treating task-level capability as aggregate productivity. AI gains depend on where the tool is deployed, how work is redesigned, whether workers use it effectively, and whether complementary systems are in place

That is the distinction this volume needs to preserve: the high-token, lower-disagreement quadrant shows where the measurement is stable enough to carry forward. However, it does not settle the adoption question. Nor does it settle the displacement question. It does not settle the productivity question.

Those questions belong to the rest of the series.

What the disagreement lets us ask next

This volume began with a problem: every estimate in the series rests on model ratings, and the models do not always agree. The disagreement map tells us which parts of the work economy can carry the next set of macro claims. The stable core is large enough to matter: most token-weighted demand sits where the models broadly converge. That is the part of the dataset that can support the main estimates in later instalments.

The contested zones carry a different kind of information. In physical and operational work, disagreement exposes a boundary problem: whether AI is being measured against the visible task or the administrative layer around it. In high-token but unstable occupations, disagreement marks the places where workflow-level validation is still needed. That distinction is what this volume contributes to the series. It separates measurement confidence from adoption, displacement and productivity. It shows where the signal is strong enough to use, where it is only directional, and where the concept of AI-addressable work itself is still unsettled.

Volume 2 can now move from confidence to concentration. If this volume asks where the signal is stable, the next asks where that stable signal accumulates: which jobs and sectors generate the structural pull for compute, and what kind of AI capital demand the existing work economy may already be creating.

Cover photo by Bob Brewer on Unsplash