How to Measure AI from Work Up

A methodological foundation for measuring AI’s economic reach from the structure of work upward.

There are now several ways to measure AI’s economic reach. Capability benchmarks track what models can do. Adoption surveys count how many firms use AI tools. Labour exposure scores rate which occupations contain tasks that AI can currently perform. Compute investment data aggregates infrastructure and chip spending.

Each approach measures something real. None connects the others. Capability ratings don’t say how many workers perform the relevant tasks. Exposure scores don’t translate into infrastructure requirements. Adoption surveys don’t reveal which segments of the workforce are driving compute demand.

This series uses a different measurement altogether. It asks: if AI were deployed across existing work at the intensity current systems appear capable of, what volume of token-processing would that work require? The answer, which the series calls AI-addressable demand, requires connecting global employment, occupational task structure, and AI token intensity simultaneously. No existing data source performs that connection.

AI-addressable demand is the theoretical volume of token-processing work that would be generated if every covered occupation’s tasks were AI-supported at the estimated intensity, performed at their observed frequency, for every employed worker globally. This is demand that measures work that could require token processing. Think of it more as a load calculation: the kind an electrical engineer performs before deciding where to run the grid.

This is Volume 0 of the series “Mapping the AI Economy, One Token at a Time.”

The Finding That Organizes the Series

One structural result belongs at the front because it shapes every subsequent instalment.

AI-addressable demand is determined by two forces that are analytically independent: how token-intensive a task is per AI-supported instance, and how many workers perform it how often. High intensity applied to moderate employment produces the same aggregate demand as moderate intensity applied to massive employment. But the two routes identify entirely different occupational families, with different implications for infrastructure, trade, wages, and adoption dynamics.

The intensity route leads where most AI discourse points: clinical and legal work. Pathologists generate an estimated 4.4 million baseline tokens per worker per year; Emergency Medicine Physicians 4.3 million; Physician Assistants 4.2 million; Judges and Magistrates 3.9 million; Film and Video Editors 3.9 million. Their task profiles, extended case analysis, multi-document synthesis, complex professional reasoning, are genuinely intensive. But their global employment is modest, which limits their share of aggregate demand.

The scale route leads somewhere different. Building and Grounds occupations account for 12.0% of global baseline demand, the largest single major-group share. Sales accounts for 9.1%, Management 7.7%, Healthcare Practitioners 7.0%. Computer and Mathematical occupations contribute 2.8%. Legal, 3.3%.

The implication runs through the entire series: AI’s macroeconomic footprint will be shaped by where task intensity, task repetition, and employment scale converge, and that intersection does not fall where conventional AI discourse has mostly been looking.

Filter by group:

No groups selected

Building the Instrument

Layer one: Global employment.

The ILO publishes modelled employment estimates by ISCO-08 sub-major group, broad two-digit categories like “Business and Administration Professionals” or “Metal, Machinery and Related Trades Workers.” Detailed task data, however, lives at the eight-digit O*NET-SOC level. Six classification levels separate them.

The methodology bridges this gap by treating US occupational composition as a proxy for within-group global composition. US Bureau of Labor Statistics Occupational Employment and Wage Statistics data establishes, within each ISCO sub-major group, the employment share each detailed occupation represents. Those shares are then applied to the ILO’s global employment envelope for each group, so if an occupation represents 30% of a given ISCO group’s US employment, it is allocated 30% of that group’s global employment. A chain of four official concordance files connects the classifications: ISCO-08 to SOC 2010, SOC 2010 to SOC 2018, and SOC 2018 to O NET-SOC 2019. The result covers 863 ONET-SOC occupations representing approximately 2.84 billion workers globally.

This approach has a well-defined limitation. The US-proxy assumption is most defensible for OECD economies with similar labour market structures; it is most fragile where the same broad ISCO category describes structurally different work. This affects detailed occupational rankings in the global aggregate, not only country-level estimates. The dataset is strongest for structural patterns across occupational families. It is weaker for precise rankings of individual occupations. It is weakest for country-level estimates in labour markets unlike the United States.

One specific result carries this caveat directly. First-Line Supervisors of Landscaping, Lawn Service, and Groundskeeping Workers receives an estimated global employment of 204 million workers, the single largest occupational figure in the dataset. This is a US-specific O*NET category sitting within a broad ISCO group that encompasses a much wider range of outdoor supervisory roles globally. Whether that allocation reflects genuine global employment in this specific role or is a crosswalk artifact cannot be resolved from the data alone. The structural pattern it represents - that scale-route occupations dominate aggregate demand - does not depend on this single figure. The figure itself should not be treated as a precise count.

A China coverage adjustment, a uniform multiplier of 1.25 derived from the ratio of ILO world employment (approximately 3.67 billion) to non-China world employment (approximately 2.94 billion), is applied because the ILO’s occupational dataset underrepresents China at the sub-major group level. This blunt correction scales all occupational estimates upward uniformly. It should be read as an adjustment for under-coverage.

Layer two: Task structure.

For each of the 863 covered occupations, O*NET provides task descriptions alongside importance ratings (scale 1-5), relevance scores (0-100), and frequency distributions across seven categories. The dataset covers 17,344 occupation-task pairs, approximately 20 tasks per occupation on average.

Task weight is computed as importance multiplied by relevance divided by 100, then normalised within each occupation to sum to 1.0. As a key design choice, the normalization means every occupation occupies the same total weight-space regardless of how many tasks it involves. An occupation with five heavily-weighted tasks does not outrank one with thirty tasks of equivalent average intensity. Task count is a failure mode that unnormalised approaches would be vulnerable to.

Frequency is converted to expected annual task instances, ranging from 1 per year for the least frequent category to 500 for the most frequent, with daily tasks assigned 180 instances. This scale is intentionally approximate, an ordering of magnitude rather than a precise count. What depends on it is the relative weight given to tasks that are performed constantly versus those performed rarely; the specific annual figures affect absolute token totals but do not alter the relative ranking of occupations, which is where the analytical weight sits.

Layer three: Token intensity.

Five major language model providers, OpenAI, Google Gemini, DeepSeek, Anthropic Claude, and xAI Grok, each rated all 17,344 occupation-task pairs using an identical prompt. Each model classified the token intensity of one completed AI-supported workflow instance (including input context, output generation, and typical revision) across five bands: very low, low, medium, high, very high.

The consensus is the median ordinal value across valid provider ratings. Median rather than mean prevents a single extreme rating from distorting the consensus. Equal weighting across providers reflects the absence of calibration data that would justify ranking model accuracy on this specific task; no external benchmark exists for occupation-level token intensity ratings.

The resulting distribution: medium (42% of task-pairs), low (38%), very low (12%), high (8%), very high (<1%). Weighted by baseline token demand: medium (44%), low (36%), high (15%). The mass of AI-addressable demand sits in low-to-medium intensity tasks.

Of 17,270 task-pairs with complete five-model ratings, 8.5% achieve full consensus across all providers. A further 55% have a spread of one ordinal step. Approximately 6.8% have a spread of three or more steps. These high-disagreement tasks cluster at a recognisable boundary: tasks where a physical performance component and an administrative or analytical component coexist, leaving the models to disagree about which governs the token demand. That disagreement pattern is the subject of Volume 1 of this series.

Token Intensity in 3 Scenarios

Token intensity bands are ordinal. Converting them to token counts requires scenario anchors for which no externally validated source exists. Under the baseline scenario: very low = 500 tokens; low = 2,000; medium = 8,000; high = 32,000; very high = 128,000. Conservative halves each value; expansive doubles each.

Token-addressable demand for each occupation-task pair: global employment × task weight × annual frequency × token intensity value.

Global totals: conservative approximately 7.7 × 10¹⁴ tokens; baseline approximately 1.5 × 10¹⁵; expansive approximately 3.1 × 10¹⁵, a 4x range. Because the multiplier is uniform, occupational rankings are stable across scenarios. What the three scenarios do not test is band-spacing uncertainty: whether medium intensity genuinely requires 4x the tokens of low, and high requires 4x medium. If those ratios changed, rankings could shift, because occupations differ in their distribution across intensity bands. Resolving that uncertainty requires empirical calibration, running representative tasks through live APIs, recording actual token usage, and comparing against the scenario anchors. That calibration has not been performed. It is the primary open methodological question.

The Distribution

Demand is highly concentrated. The top 5% of occupations, 44 out of 863, account for 54.5% of global baseline demand. The top 10% account for 68.3%. This concentration is stable across all three scenarios and is not an artifact of any single large occupational estimate, it holds when the landscaping supervisor figure is treated with appropriate skepticism.

The curve below makes the same point in shape: a steep initial climb that captures most of the distribution before flattening into a long, nearly horizontal tail.

The dashed diagonal shows what the curve would look like if all 863 occupations contributed equally — at 10% of occupations, exactly 10% of demand. The actual curve's distance above it shows how far the real distribution departs from that baseline.

What This Instrument Now Makes Possible

A structural ceiling on token-addressable demand is not, by itself, an economic finding. It becomes one when it is used to locate pressure inside the economy before that pressure surfaces in any aggregate statistic.

The instalments that follow use this structural map to pursue a chain of consequential questions. Where is demand concentrated geographically, and how reliable are those signals? If even a fraction of this ceiling were realised, what does that imply for capital investment, and is the shock inflationary or productivity-enhancing? Where does the structural ceiling exist but cannot be reached, because the infrastructure, institutions, or absorptive capacity are absent? If AI-mediated services become internationally tradeable at scale, who runs a structural deficit in intelligence before any balance of payments account captures it? How does AI reshape the economics of offshore work rather than simply eliminating it? And where will the adoption lag be longest, exactly because the conditions for doing it don’t yet exist?

These are macroeconomic questions that cannot be answered from capability benchmarks, adoption surveys, or compute investment data. They are answerable from a map of where the structure of existing work creates the conditions for AI to matter, which is what this dataset provides.

The absolute token totals are illustrative. The structural patterns, the concentration, the scale-intensity decomposition, the occupational distribution, are the instrument. The rest of the series is what it measures.

Cover photo by Sven Mieke on Unsplash