What We Actually Know: A Critical Survey of the Enterprise AI Evidence Base

Working paper, version 2, updated June 2026.

Summary

The dominant discourse on enterprise AI deployment is large, confident, and largely produced by parties with commercial interests in its conclusions. Consulting firms publish surveys that recommend the services they sell. Vendors publish research that vindicates their products. Academic studies of named deployments are conducted in partnership with the firms being studied. When this material is set aside, the genuinely independent evidence paints a narrower, more sober, and more historically familiar picture. The current moment is, in important respects, a familiar one: a powerful new general-purpose technology has arrived, adoption is rising rapidly, measurable productivity effects are small, and the mistakes visible in independently documented cases are not novel AI problems but recurrences of well-understood failure modes from earlier automation deployments. The aim is not confident practitioner recommendations but a framework for reading the evidence honestly during a period when independent research has not yet caught up with the technology.

1. The Question We Cannot Yet Answer

Toward the end of 2025, a senior executive at a large bank told a reporter that her organisation was about to spend several hundred million dollars on enterprise AI deployment over the next two years. Asked what evidence she had drawn on in making the case to her board, she named four sources: the McKinsey State of AI survey, the Boston Consulting Group AI Value Gap report, a study from a major university research centre on AI maturity, and a vendor case study from a peer institution. Each of these is a competent piece of work in its own terms. Each was widely circulated. Each contained numbers that supported the investment thesis being defended.

None of them was independent of the parties whose conclusions they were used to support.

This is not a story about that executive, who appears to be doing her job in the manner her role expects, drawing on the most-cited evidence available to her. It is a story about the evidence base she had to work with. The dominant discourse on enterprise AI deployment in 2024 to 2026 is enormous in volume and substantial in apparent authority. It is also, on close examination, predominantly produced by parties with commercial interests in particular conclusions. Consulting firms publish reports that recommend the services they sell. Technology vendors publish research that vindicates their products. Academic case studies of high-profile deployments are conducted in partnership with the deploying firms, with the firms holding approval rights over publication. University research centres that produce influential maturity models and segmentation frameworks are funded substantially by member organisations that include the firms being studied. Even the apparently independent labour-economics and information-systems literatures, when one reads the acknowledgments rather than just the abstracts, turn out to contain industry funding at varying levels of remove.

A serious working paper on enterprise AI deployment cannot pretend this is incidental. The composition of the evidence base is not a footnote; it is the substantive condition under which any inquiry into the field has to operate. A paper that ignores the problem and draws on the dominant sources without flagging them will reproduce the narratives those sources are designed to reproduce. A paper that takes the problem seriously has to do something different. It has to confront the evidence base directly, distinguish what is genuinely independent from what is not, and accept that its conclusions will be narrower and less confident than the consulting-firm literature would suggest.

This paper takes the second path. It is, in the first instance, a paper about what we can and cannot know about enterprise AI deployment given the evidence base that actually exists. It is, in the second instance, a synthesis of what the genuinely independent literatures show: government statistical work on adoption and productivity, public-sector inquiries into algorithmic failure, pre-LLM theoretical and historical research on technology adoption and organisational change, and the small set of peer-reviewed academic papers that are conducted at arm’s length from the firms whose deployments they discuss. It is, in the third instance, an argument that the current moment is more historically familiar than the dominant narratives suggest, and that what the independent evidence does show points to a less dramatic and more sobering picture of enterprise AI than the consulting literature paints.

I should say plainly what this paper is not. It is not a guide to enterprise AI deployment. It is not a recommendation that organisations should or should not invest in AI. It does not segment the population of enterprises into leaders and laggards, because the segmentations that exist in the literature come from sources that fail the independence standard I apply here, and the genuinely independent evidence does not yet support such segmentations. It does not predict the future, because the historical literature on general-purpose technologies suggests that productivity effects from such technologies typically take a decade or more to become measurable, and we are roughly three years into the LLM era. It does not adjudicate between optimistic and pessimistic accounts of AI’s potential, because the evidence to adjudicate does not yet exist.

What the paper does do, I hope, is to clear some ground. It says what we actually know, what we do not yet know, and why so much of the apparent knowledge in the field is less reliable than it appears. It identifies a small number of patterns that are visible in the independent evidence and that are worth taking seriously. And it offers a way of reading the current discourse that allows a careful observer to distinguish what is supported by evidence from what is supported only by interested framings.

The paper proceeds in seven further sections. Section 2 documents the evidence problem in enterprise AI research and lays out the standard I apply for treating sources as evidence. Section 3 surveys what the genuinely independent literature shows about AI adoption, productivity effects, and the broader technological moment. Section 4 examines the small set of AI deployment cases where forced disclosure under adversarial conditions has produced real evidence, almost all of which come from public-sector contexts where parliamentary inquiries, royal commissions, and legal proceedings have compelled organisations to surrender information they would otherwise have controlled. Section 5 draws out the recurring patterns visible across these cases. Section 6 considers what these patterns suggest about how enterprises are actually deploying AI in 2024 to 2026, drawing on the historical literature on general-purpose technology adoption to set the current moment in context. Section 7 offers a framework for reading enterprise AI evidence honestly, intended for practitioners and observers who have to make decisions during a period when the independent research apparatus has not yet caught up with the technology. Section 8 is the honest reckoning with limitations and open questions.

2. The Evidence Problem

To make sense of the current discourse on enterprise AI deployment, one has to start by mapping the field of research that produces it. The mapping I offer here is the product of several months of reading across the available material, and it converges on a finding I had not expected to make when I began: corporate involvement in enterprise AI research is not concentrated in obvious places but is structurally distributed across most of the field, in forms that vary from explicit commercial sponsorship to subtle institutional shaping of research agendas. I want to describe these forms carefully because the right response to them differs by category.

The most obvious form of corporate involvement is direct vendor and consulting-firm research. The major consultancies publish annual or quarterly reports on AI deployment that have come to function as the field’s reference data. McKinsey’s State of AI surveys, the Boston Consulting Group’s AI Value Gap reports, Deloitte’s State of Generative AI in the Enterprise series, the Accenture and IBM Institute for Business Value reports. These are produced with serious methodologies, often with substantial sample sizes, often with careful attention to question design. They are also commercial outputs. The framings of the findings, the segmentations of the populations, and the implicit recommendations are constructed in ways that align with the publishing firm’s interests. A McKinsey report finds that high-performing organisations behave in ways McKinsey recommends. A BCG report finds that “future-built” companies invest in ways BCG advises clients to invest. A process-mining vendor’s report finds that process visibility is the precondition for AI value. These are not surprises; they are the structure of commercial research. The honest treatment is to set this material aside almost entirely, citing it occasionally as evidence of what executives report to surveyors but never as evidence of what is true about deployment outcomes.

A second form of corporate involvement, more difficult because it presents the surface of academic research, is what I will call vendor-academic partnership research. A firm grants a research team access to its data on the condition that the firm is studied as the case, the research is conducted with proper academic methods, and the results are published in a peer-reviewed journal. The Brynjolfsson, Li, and Raymond customer-support study eventually published in the Quarterly Journal of Economics is one example. The Dell’Acqua, McFowland, Mollick, Lifshitz-Assaf, Kellogg, Rajendran, Krayer, Candelon, and Lakhani study of Boston Consulting Group consultants is another. The Dillon, Jaffe, Immorlica, and Stanton field experiment on Microsoft 365 Copilot is another. The Microsoft researchers are co-authors on the last of these. The BCG study has BCG researchers as co-authors and acknowledges Karim Lakhani as a BCG advisor. The Brynjolfsson study does not name the firm, but the firm was a partner with approval over publication. The methods in these studies are real, the data is real, the analysis is competent. They are not, however, arms-length observations of the deployments they describe. The firms had a stake in the outcomes being publishable and the framings being favourable. The selection effects in which deployments get studied, in which conditions, with which measurements, and which results get foregrounded, all run through the firms’ interests. A careful paper has to treat these as case material from cooperating organisations rather than as independent evidence of what AI deployments produce.

The third form of corporate involvement, and the one I had been least alert to before this project, is research-centre member funding. Many of the academic research centres that produce influential work on enterprise AI are funded substantially by the firms whose deployments they study. The MIT Center for Information Systems Research operates on a member-organisation funding model that includes the major US banks, technology firms, consulting firms, and large corporates. The research produced is not commissioned in a paid-research sense; the institutional model is closer to a subscription consortium. The output is rigorous within its own terms, and the individual researchers are competent and conscientious. The institutional incentives are nonetheless real. A research centre whose funding depends on member renewal does not produce work that would alienate its members. The framings tend toward “your organisation can become a leader by investing in these capabilities,” because that is what members want to hear. The findings tend to validate the kind of investment the members are already making. This is a softer form of corporate involvement than direct vendor research, but it is corporate involvement, and a careful paper has to account for it.

The fourth form, harder still to spot because it appears in the acknowledgments rather than the author affiliations, is industry-funded academic research conducted by faculty at independent universities and published in peer-reviewed venues. The affiliation looks clean; the methodology looks academic; only the funding declaration reveals the involvement. Daron Acemoglu’s macroeconomic work on AI productivity acknowledges Google funding through the MIT Shaping the Future of Work initiative. David Autor’s “Applying AI to Rebuild Middle Class Jobs,” which would be a natural foundational source for a paper of this kind, discloses funding from Google, the William and Flora Hewlett Foundation, the NOMIS Foundation, and the Smith Richardson Foundation. The Bick, Blandin, and Deming “Rapid Adoption of Generative AI” paper, eventually published in Management Science, was funded in part by the Walmart Foundation. The pattern repeats across the major NBER working papers on AI deployment. The funding may have no effect on the conclusions; competent academics with established reputations are unlikely to shade their findings for a grant. The funding nonetheless shapes the questions asked and the topics pursued, and the cumulative effect of an entire field’s research agenda being partly funded by the firms whose products are being studied is not negligible.

The fifth form is more subtle still: conference and special-issue capture. Academic journals run special issues on emerging topics; conferences run special tracks; the editors and chairs who organise these have industry connections; and the call-for-papers framings reflect the interests of the field as it has been shaped by industry funding. The 2021 MIS Quarterly special issue on Managing AI is competent academic work; the papers in it are mostly published by independent academics; but the framing of “managing AI” as an organisational and managerial problem, rather than a question of whether AI should be deployed at all in certain settings, is itself a framing that benefits the industry by treating deployment as the default and management as the question. This is not the editors’ fault. The result, however, is that the literature is shaped by what the field’s centre of gravity has come to consider worth studying, and that centre of gravity is itself partly an artefact of who funds the work.

The sixth form sits outside academic research but interacts with it: business journalism. The major business publications cover AI deployment, but very little of what they produce is independent observation. The bulk is rewriting of corporate communications, with quotes from executives, framings from public-relations teams, and occasional commentary from sources the deploying firm has approved. There are exceptions: occasional investigative pieces, the small number of reporters who have built genuine independent sources, and the work of certain magazines that still do long-form reporting. The exceptions are rare and valuable. The default condition of business journalism on enterprise AI is closer to laundered corporate self-presentation than to independent observation.

The six forms together can be set out as a spectrum, which is how I will treat them through the rest of the paper. Figure 1 places each form on a single axis running from fully independent material to direct commercial output, and indicates how each category is treated.

Figure 1: The spectrum of source independence in enterprise AI research. Six forms of corporate involvement, arranged from least to most compromising of evidentiary value.

Putting these six forms together, what survives a careful application of an independence standard is much narrower than the apparent volume of the field would suggest. Five categories of material remain genuinely usable as evidence.

Government statistical work is the first, and the cleanest. The US Census Bureau’s Business Trends and Outlook Survey, the Annual Business Survey, the Federal Reserve’s adoption-monitoring work, Eurostat’s surveys, the OECD’s working papers, and the various national statistical offices’ work on AI use. This material is methodologically careful, authoritative as far as it goes, and free of commercial entanglement. It tells us about prevalence of AI use, about the firm characteristics correlated with adoption, and (with appropriate caveats) about early aggregate productivity observations. It does not tell us about deployment outcomes at the firm level or about what does and does not work.

Independent inquiries into public-sector algorithmic failures are the second. The Australian Royal Commission into the Robodebt Scheme, the Dutch parliamentary inquiry into the toeslagenaffaire, the various US state-level investigations into criminal-justice and welfare algorithms, the UK exam-grading inquiry following the A-level algorithm failure of 2020. These are forced disclosures under adversarial conditions, produced by institutions whose job is to extract evidence against the will of the organisations involved. They are, in my view, the highest-quality empirical evidence in the entire AI-in-organisations literature, and they are systematically under-cited in the management research because they sit in a different intellectual tradition (public administration, law, policy studies).

Pre-LLM era foundational theoretical work is the third category. The absorptive-capacity literature beginning with Cohen and Levinthal. The sociotechnical-systems literature beginning with Trist. The IT productivity paradox literature from Brynjolfsson and others in the period before AI funding became pervasive. Paul David’s historical work on electrification and the productivity paradox of the dynamo. James Scott on legibility. Walsh and Ungson on organisational memory. Nelson and Winter on evolutionary economics. This material is genuinely independent because it was produced before AI became the dominant industry concern, and its application to the current moment requires interpretive work but does not depend on observation of the present.

Independent algorithmic accountability research is the fourth. Work coming out of research centres without industry sponsorship, or with sponsorship structured to insulate the research from funder influence. Parts of the AI Now Institute. Parts of the Ada Lovelace Institute. The Alan Turing Institute’s public-interest work. Academic research on automated decision-making in welfare, criminal justice, and immigration enforcement, where the researchers have no relationship with the deploying organisations. The volume is smaller than the management literature but the rigour is higher on the questions it addresses.

Independent peer-reviewed academic work in management and information-systems venues is the fifth category, and the smallest. Papers published in MIS Quarterly, Information Systems Research, Organization Science, the Academy of Management Journal, the Strategic Management Journal, and similar venues, where authors are at independent universities, no industry funding is declared, and no firm partnerships are acknowledged. This category exists but is a minority of the AI-specific publications in these journals over the last three years; most published AI work in these venues has at least some industry funding or partnership in the acknowledgments.

I do not want to overstate the cleanliness of these five categories. Government statistical work is shaped by what governments choose to measure, which is itself shaped by political and economic interests. Public inquiries vary in quality and independence. Pre-LLM theoretical work has the limitation that it predates the technology it is being used to think about. Independent academic work is rare in proportion to the field. The point is not that these categories are pristine but that they are qualitatively different from the consulting-firm and vendor-partnered material that dominates the discourse. They sit at the cleaner end of a spectrum, and a careful paper can lean on them in ways it cannot lean on the rest.

I should also be honest that I am applying a spectrum here rather than a binary cutoff. At one end is fully independent material, the kind I have just described. In the middle is academic work conducted at independent institutions by competent researchers, published in peer-reviewed venues, with some industry funding declared but no firm partnership and no commercial sponsorship of the specific study. The Acemoglu and Autor papers fall here. So do many of the MIS Quarterly and Organization Science papers on AI. This material is usable with explicit acknowledgment of the funding pattern, but it should not be treated as fully arms-length. At the other end of the spectrum is direct vendor research, vendor-academic partnerships, and research-centre work funded by the firms being studied. This material is closer to commercial output than to evidence; it can be cited for what it claims and as documentation of the public discourse, but not as evidence of what is true. The paper applies different levels of weight to sources based on where they sit on this spectrum.

A final observation before I move on. The structural condition I have described is not unique to AI research. Much of management research is partly funded by the firms it studies. Much of medical research is partly funded by the pharmaceutical industry. Much of education research is partly funded by foundations with policy preferences. The pattern is general. What is distinctive about the AI case is that the technology has emerged so quickly, attracted so much money so fast, and become so central to corporate strategy, that the field of independent research has not yet had time to develop. The academic and policy-research apparatus that should be producing independent evidence is still being built; meanwhile, the consulting firms, the vendors, and the firms deploying the technology are producing the literature, because they have the resources and the incentive to do so. This will probably correct over time. In ten years, the AI deployment literature will look different, with a stronger independent core. In the meantime, anyone trying to write seriously about enterprise AI in 2025 and 2026 has to work with a compromised evidence base. The honest thing to do is to say so openly and then to do the best work possible within the constraint.

3. What the Independent Literature Shows

When I apply the standard I have just laid out and ask what the genuinely independent literature shows about enterprise AI in the LLM era, the picture is narrower than the dominant discourse would lead one to expect, but it is not empty. There are several substantive claims that the independent literature does support, and they are worth setting out carefully.

The first concerns the rate of adoption. The US Census Bureau’s Business Trends and Outlook Survey, which is the most rigorous independent measurement of AI use in US firms, has tracked AI adoption bi-weekly since late 2023. The survey asks firms whether they have used AI in the production of goods or services in the previous two weeks; this is a restrictive definition, capturing AI use that has become part of the firm’s actual operations rather than experimental or peripheral use. Under this definition, the share of US firms reporting AI use rose from approximately 3.7 percent at the start of the measurement period in late 2023 to roughly 5.4 percent by early 2024, with an expected rate of about 6.6 percent by the end of 2024. The same survey, after a methodological change in late 2025 that broadened the question to include AI use in “any business function,” reported that about 18 percent of firms had adopted AI by the end of 2025. The Federal Reserve’s synthesis work published in early 2026 confirms this rate at the firm level while showing higher rates at the worker level (around 41 percent of US workers using generative AI for work as of late 2025) and the employment level (around 78 percent of the US labour force works at firms that have adopted some form of AI, but only about 54 percent works at firms using large language models).

These numbers should be set alongside the corporate-survey numbers, which are higher by a factor of two or more. McKinsey’s 2025 State of AI survey reported 88 percent of organisations using AI in at least one function. Stanford’s AI Index, drawing on multiple sources of varying independence, reported 78 percent organisational adoption in 2024 rising to 88 percent in 2025. The gap between the government statistical work and the corporate surveys is substantial and is itself informative. It reflects, at least in part, differences in question wording: corporate surveys typically ask whether the firm “uses AI at all,” while the Census BTOS asks about AI use in production. The narrower definition is more demanding and produces lower numbers. The corporate surveys also draw on samples of executives at large firms, who are likely to over-report AI use relative to the actual operational reality, while the Census surveys draw on a representative sample of firms across the size distribution. The honest reading is that AI adoption is rising rapidly but unevenly, that the highest rates of use are concentrated among large firms and in specific sectors (information, professional services, finance), and that the headline corporate figures substantially overstate the operational reality at the population level.

The gap between the two measurement traditions is set out in Figure 2. The visual comparison is striking; under any reasonable reading, the corporate-survey numbers and the government statistical numbers cannot both be measuring the same thing.

The second substantive claim from the independent literature concerns the productivity effect. This is the area where the literature is most contested and where I have to be most careful about which sources I am citing. Daron Acemoglu’s 2024 paper, “The Simple Macroeconomics of AI,” subsequently published in Economic Policy, applies a task-based framework grounded in Hulten’s theorem to existing estimates of AI task exposure and concludes that the macroeconomic productivity effect of AI in the next decade is likely to be modest. His estimate, based on the share of tasks exposed to AI multiplied by the average task-level cost saving, is no more than a 0.66 percent increase in total factor productivity over ten years, which translates to roughly 0.07 percent per year. This is a small number relative to the consulting-firm forecasts of multi-trillion-dollar global GDP impacts. Acemoglu’s paper is funded in part by Google through the MIT Shaping the Future of Work initiative, which under the spectrum approach I am applying puts it in the middle category: usable with explicit acknowledgment of the funding pattern, but not fully arms-length. The Acemoglu framework, however, is methodologically careful, the assumptions are conservative, and the result is consistent with the historical pattern of general-purpose technologies, where productivity effects take a decade or more to become measurable. I cite the work because it is the most careful theoretical treatment of the question available and because the framework can be evaluated independently of the funding.

A third claim concerns the historical pattern of general-purpose technology adoption. The pre-LLM literature on this is rich and entirely independent of current industry funding. Paul David’s 1990 paper on the dynamo and the modern productivity paradox is the canonical reference. David showed that the productivity gains from electrification, the previous wave’s general-purpose technology, did not become measurable in aggregate statistics until decades after the technology was first deployed. Factories had to be physically and organisationally redesigned around electric power; the redesign took a generation; the productivity gains followed the redesign rather than the technology. Brynjolfsson and his collaborators’ work on the productivity J-curve, in its pre-AI form (the 2018 to 2021 papers building on the Yang and Brynjolfsson 2001 framework), extends this logic to general-purpose technologies more broadly. The J-curve hypothesis is that general-purpose technologies require substantial complementary intangible investments (process redesign, business-model innovation, human-capital development) that are unmeasured in national accounts, that take years to mature, and that produce measurable productivity gains only after the intangible investments have been made. The hypothesis predicts that the productivity statistics during the deployment phase of a general-purpose technology will look poor (because the intangible investments are large and the measurable outputs are small), and that the statistics during the harvesting phase will look good (because the intangible investments are now producing measurable outputs while the marginal investment costs have declined). The shape of measured productivity over the cycle is a “J.”

The J-curve framework is the cleanest available account of why current AI deployment is producing apparently disappointing aggregate effects while consuming substantial investment. The framework predicts exactly this. It also predicts that the eventual productivity gains, when they arrive, may be substantial, but it does not predict when they will arrive or how they will be distributed. The historical pattern from electrification suggests a multi-decade timeline. The historical pattern from computing suggests something similar; the productivity gains from the IT investments of the 1980s and 1990s did not become clearly visible in US national statistics until the late 1990s and 2000s. The current moment, viewed through this framework, looks like the early phase of a general-purpose technology cycle: large investment, small measurable returns, substantial intangible work being done that will eventually produce measurable outputs but has not yet done so.

Figure 3 places the current AI deployment cycle alongside the historical patterns from electrification and computing. The temporal scales are different (electrification’s cycle ran roughly four decades, computing’s ran roughly three) but the shape of the productivity response is consistent across both, and the current AI moment sits where the deployment phases of the earlier technologies sat. The projected portion of the AI curve is presented as a scenario rather than a forecast; what the available evidence supports is the position of the current moment within the deployment phase, not the precise timing of what follows.

Figure 3: The productivity J-curve in three general-purpose technologies. Stylised measured productivity over the deployment cycle for electrification, computing, and current-generation AI.

A fourth claim, from a different literature, concerns the organisational conditions for successful technology adoption. The absorptive-capacity literature beginning with Cohen and Levinthal (1990) is the foundational reference here. Cohen and Levinthal argued that the ability of a firm to recognise the value of new external information, assimilate it, and apply it to commercial ends, depends on the firm’s prior knowledge and accumulated capabilities. A firm with weak prior knowledge cannot effectively absorb new technology even when the technology is available and the firm has the resources to acquire it. The argument has been extended and refined over more than three decades. Zahra and George (2002) distinguished between potential absorptive capacity (acquisition and assimilation) and realised absorptive capacity (transformation and exploitation), allowing for the possibility that firms might acquire knowledge they cannot then act on. The applied literature on technology adoption in organisations, particularly the work in Organization Science and the Academy of Management Journal over the last three decades, has consistently found that organisational characteristics (prior technical capabilities, management quality, complementary investments, organisational learning structures) explain a substantial share of the variation in how firms benefit from any given new technology.

This literature is not specifically about AI, but its findings transfer. The firms that benefit most from a new general-purpose technology are firms that were already positioned to absorb it, and the absorption capacity is not easily or quickly acquired. The implication for the current moment is that we should expect substantial heterogeneity across firms in how they benefit from AI, and we should expect the heterogeneity to be predictable from prior organisational characteristics rather than from differences in AI investment levels or vendor choices. The corporate-survey literature has been claiming something like this, with its segmentations of high performers and laggards and its maturity-model frameworks. The independent academic literature has been saying the same thing, less dramatically, for decades.

A fifth claim, from the algorithmic accountability literature, concerns the failure modes that show up when automated decision-making is deployed at scale. The work coming out of the AI Now Institute, the Ada Lovelace Institute, the Alan Turing Institute, and the academic algorithmic accountability literature more broadly, has identified a consistent set of failure modes across many different settings. These include the systematic disadvantaging of populations that the deploying organisation has less interest in serving well, the opacity of the decision-making process to those subject to it, the difficulty of contesting decisions that are presented as algorithmic and therefore inherently correct, the misalignment of incentives between the deploying organisation and the vendors that build the systems, and the gap between the operational reality of deployment and the formal specification of how the system is supposed to work. These failure modes are documented across welfare administration, criminal justice, immigration enforcement, employment screening, credit decisioning, and a wide range of public-facing deployment contexts. They are not specifically AI failures; they are failure modes of automated decision-making at scale, and they have been observed in rule-based systems, in early machine-learning systems, and now in large-language-model deployments. The fact that the same failure modes recur across very different technical generations of automation suggests that the source of the failure is not the technology but the organisational and institutional conditions of deployment.

A sixth claim is more recent and concerns the dynamics of generative AI adoption specifically. The Bick, Blandin, and Deming work on US adoption rates, published as a Federal Reserve Bank of St. Louis working paper and subsequently in Management Science, provides the most rigorous available data on the speed of generative AI uptake among individuals. Their survey of about 10,000 US workers in 2024 found that, by late 2024, nearly 40 percent of the US working-age population was using generative AI, and roughly 23 percent of employed respondents had used it for work in the previous week. Generative AI’s adoption curve, relative to mass-market product launch, matches the adoption curve of the personal computer almost exactly. This is a fast adoption rate by historical standards, faster than the internet’s, comparable to mobile phones. The Bick, Blandin, and Deming paper has partial Walmart Foundation funding (which under the spectrum approach places it in the middle category), but the methodology is academic and the findings are consistent across multiple independent waves of the same survey, which provides some robustness against funder influence on conclusions.

What the rapid individual adoption of generative AI does not yet show, in any independent literature I have found, is a corresponding effect on aggregate measured productivity. The Federal Reserve has been monitoring this question carefully, and the consistent finding through early 2026 is that the macroeconomic productivity effect of AI is, so far, indistinguishable from zero. This is consistent with the J-curve hypothesis: the technology is being adopted, the intangible work of integration is happening, but the measurable productivity effects have not yet arrived. It is also consistent with several other hypotheses, including the possibility that the productivity effects will be smaller than the dominant narratives expect, or that they will be distributed unevenly across the economy in ways that affect aggregate statistics differently than firm-level statistics.

What this independent literature does not show, and what I want to be honest about, is anything resembling the confident claims that dominate the consulting-firm discourse. The independent literature does not show that 95 percent of enterprise AI pilots fail to deliver measurable value. That figure comes from the MIT NANDA report, which is a preliminary working paper based on 52 interviews and 153 survey responses, conducted by researchers whose own research programme focuses on agentic AI architectures (which the report concludes is the solution to the documented failure rate), and circulated through Fortune magazine in a way that produced a memorable headline rather than careful scientific scrutiny. The independent literature does not show that AI high performers are 3.6 times more likely to redesign workflows; that figure comes from the McKinsey 2025 State of AI survey, which is a commercial output. The independent literature does not show that 5 percent of organisations are “future-built” and 60 percent are laggards; that segmentation comes from BCG. The independent literature does not show that organisations in the higher stages of AI maturity outperform their industry peers financially; that finding comes from MIT CISR, which is funded by member organisations including the major firms whose AI investments validate the maturity framework.

I am not arguing that any of these claims is necessarily false. I am arguing that the independent literature does not yet support them. The corporate sources may be picking up real patterns; we cannot tell from the corporate sources alone. The honest position is that the substantive empirical claims that dominate the discourse are not yet established in any independent research, and the discourse is treating them as if they were established when they are not.

4. What the Public-Record Cases Show

The richest source of genuinely independent evidence on AI deployment failure comes from a small number of cases in which forced disclosure under adversarial conditions has compelled organisations to surrender information they would otherwise have controlled. Almost all of these cases are in the public sector, because public-sector deployments are subject to parliamentary inquiries, royal commissions, freedom-of-information requests, judicial review, and other accountability mechanisms that private-sector deployments do not face. This means the evidence base on AI deployment failures is systematically skewed toward government cases, which is a limitation worth flagging. It also means, however, that what we know about AI deployment failure with any independence is mostly what we know from government cases, and that what we know is consistent enough across cases to suggest patterns worth taking seriously.

I will discuss four cases in this section. The Australian Robodebt scheme is the most thoroughly documented case of automated decision-making failure in any country. The Dutch toeslagenaffaire is comparable in depth of documentation and is geographically distinct. The 2020 UK A-level grading algorithm is a clean case of rapid, visible failure followed by withdrawal and inquiry. The Air Canada chatbot case is smaller in scale but contains an unusually clear adversarial disclosure (a tribunal ruling) of the failure mode. Each case is informative in itself, and together they show a pattern.

4.1 Robodebt

Between 2016 and 2019, the Australian Department of Human Services operated a scheme officially called the Online Compliance Intervention. It became publicly known as Robodebt. The scheme used an automated data-matching process to compare income data held by the Australian Taxation Office with income declarations made by welfare recipients to Centrelink. Where the two sets of figures appeared to be inconsistent, the system raised an automated debt notice against the welfare recipient, calculated by averaging the recipient’s annual income across the relevant period. The recipient was then required to pay the debt or prove that the calculation was wrong; the burden of proof was reversed from the normal administrative arrangement in which the agency must establish a debt before pursuing it.

The Royal Commission into the Robodebt Scheme, which concluded in July 2023 after considering more than 1,000 submissions and hearing from 115 witnesses, found that the scheme was unlawful in its averaging methodology, that the legal advice that had been used to justify the scheme had been misrepresented or ignored, that the scheme had caused substantial distress to its targets including documented cases of suicide, and that the operational structure of the scheme had been designed in a way that made errors difficult or impossible for recipients to contest. The Commission described the scheme as a “crude and cruel mechanism” and made 57 recommendations, including the establishment of a body to monitor and audit automated decision-making in government, reforms to ensure clearer paths of review for those subject to automated decisions, and requirements for greater transparency in the design and operation of automated systems.

Several features of the Robodebt case are worth drawing out, because they recur in other cases.

The first is that the system was deployed at scale before its operational behaviour at scale had been adequately tested. The averaging methodology used by the system, which assumed that income was earned evenly across a year, was wrong for any individual whose income varied during the year, which described a substantial share of welfare recipients (who are disproportionately in casual or part-time work). The error was not subtle; it was inherent in the design. The scheme produced incorrect debt notices at scale because the methodology was wrong, not because of edge cases.

The second is that the organisation deploying the system had institutional incentives to discount evidence of problems. The scheme was projected to generate substantial revenue for the government through debt recovery. Internal staff who raised concerns about the methodology, the legality, or the impact on recipients were not heeded. The Royal Commission found that several senior officials had ignored or suppressed advice that the scheme was unlawful. The institutional culture treated the volume of debts raised as the measure of success, which created pressure to continue and expand the scheme even as evidence of harm accumulated.

The third is that the people subject to the system had no effective means of contesting it. The system presented its conclusions as facts. The burden of proof was reversed. The pathways for review were difficult to navigate, the agency staff who handled complaints had limited ability to override the automated calculations, and the recipients who were affected were disproportionately people with limited capacity to engage with bureaucratic systems (low-income, often with mental health difficulties, often without legal representation). The result was that errors compounded; the system kept producing incorrect debts, the recipients had limited capacity to challenge them, and the system’s outputs were treated as definitive by both the agency and the courts until external pressure forced reconsideration.

The fourth is that the failure was not principally a technical failure. The automation worked in the narrow sense that it generated debt notices according to its specification. The failure was in the specification itself, in the institutional design that placed the system in a context where its errors could not be effectively contested, and in the organisational culture that treated the volume of outputs as evidence of success regardless of the validity of those outputs. The technology, in this sense, was incidental. The same failure would have occurred if the calculations had been done by hand, except that hand calculation would not have permitted the scale at which the errors were produced.

These four features are not unique to Robodebt: deployment at scale before adequate testing, institutional incentives to discount problems, absence of effective contestation by those subject to the system, and the location of failure in the surrounding system rather than the technology. They recur.

4.2 The Dutch toeslagenaffaire

The Dutch childcare benefits scandal, the toeslagenaffaire, has a similar structure. Between 2013 and 2019 (with the most intense period of harm between 2014 and 2018), the Dutch Tax and Customs Administration operated a risk-classification system that used algorithmic profiling to identify applicants for childcare benefits who were considered high risk of fraud. Applicants flagged by the system were subjected to enhanced scrutiny, asked to repay benefits, and in many cases pursued aggressively for amounts they did not in fact owe. At least 35,000 parents were wrongfully accused of fraud, many were forced into debt and bankruptcy, and more than 2,000 children were taken from their parents by child protective services as a consequence of the financial pressure the false fraud accusations created.

A parliamentary inquiry, completed in 2020, concluded that the scheme constituted “unprecedented injustice” and led to the resignation of the Dutch government in 2021. The Dutch Data Protection Authority found that the algorithm had used nationality and “non-Western appearance” as risk indicators, which constituted illegal racial discrimination. The Netherlands Institute for Human Rights conducted its own investigation and found that persons of foreign descent were 3.5 times more likely than persons of Dutch descent to be selected for further investigation under the algorithm. The Dutch government subsequently acknowledged that institutional racism had been a root cause of the scandal.

The toeslagenaffaire shares with Robodebt several of the features I have just described. The system was deployed at scale before its operational behaviour at scale was adequately understood. The institutional incentives within the Tax and Customs Administration favoured aggressive pursuit of fraud over careful adjudication of borderline cases. The people subject to the system had limited effective means of contesting the classifications, particularly given that the criteria the system used were opaque even to the civil servants conducting the investigations. The failure was in the surrounding institutional design and the discriminatory specification of the risk model, not in the technical operation of the algorithm.

The toeslagenaffaire adds one feature that Robodebt did not have in the same form: a discriminatory pattern in the system’s outputs that was traceable to discriminatory inputs in the system’s design. The algorithm had been trained on data that reflected historical patterns of discrimination, and it had been designed with risk indicators (nationality, appearance) that encoded discriminatory assumptions. The output of the system was therefore discriminatory in a systematic and statistically demonstrable way. This is a pattern that has been documented in many other algorithmic-accountability cases (the COMPAS criminal-justice risk assessment in the United States, the various predictive-policing systems, the credit-scoring algorithms that produce disparate outcomes by race) and that I think is best understood as a specific manifestation of the broader failure mode in which the deploying organisation imports its existing biases into the automated system and then treats the system’s outputs as objective.

4.3 The UK A-level algorithm

In August 2020, in the early phase of the COVID-19 pandemic, the UK government decided that A-level examinations could not be held in person. The exams regulator, Ofqual, deployed an algorithm to standardise teacher-assessed grades against historical school performance, with the intention of preventing grade inflation. The algorithm produced results that were systematically biased against students from disadvantaged backgrounds, downgrading approximately 40 percent of teacher assessments and disproportionately downgrading students at schools with worse historical performance. The reaction was immediate and forceful. Within days, the government withdrew the algorithm and reverted to teacher assessments alone.

The A-level case is shorter and cleaner than Robodebt or the toeslagenaffaire because the deployment was withdrawn within days of public visibility, and because the population affected (sixth-form students and their parents) had effective political voice. The inquiry that followed, conducted by the Office for Statistics Regulation among others, identified several recurring features: the algorithm had been designed with insufficient consideration of its distributional effects, the testing of the algorithm before deployment had used measures that did not adequately reveal those effects, the institutional pressure to “prevent grade inflation” had crowded out attention to the harm of downgrading individual students, and the deploying organisation had not anticipated the political and public response.

The A-level case is informative partly because of what did not happen. The algorithm was withdrawn, the affected students were given their teacher-assessed grades, and the deployment did not produce the long-tail harms that Robodebt and the toeslagenaffaire did. The reason was not that the algorithm was better. The reason was that the affected population had political voice. A sixth-former with two parents capable of writing to their member of Parliament, organising on social media, and contacting the press is in a very different position from a welfare recipient with mental health difficulties and no legal representation. The same algorithmic failure, applied to a population with less voice, would have produced the long-tail harm pattern that Robodebt and the toeslagenaffaire show. The political economy of the affected population is part of what determines whether an algorithmic failure becomes visible and is corrected, or remains invisible and continues.

4.4 The Air Canada chatbot

The Air Canada case is much smaller in scale and is a private-sector case rather than a public-sector one, but it is unusual in that adversarial disclosure (a tribunal ruling) produced an independent record of what happened. In November 2022, a customer named Jake Moffatt interacted with an Air Canada chatbot on the airline’s website in order to find out about bereavement fares; his grandmother had just died. The chatbot told him that he could book his ticket immediately and apply for a bereavement rate refund within 90 days. He did so. Air Canada subsequently refused the refund, on the grounds that its formal policy did not permit retroactive bereavement-fare applications. Moffatt took the matter to the British Columbia Civil Resolution Tribunal.

In February 2024, the Tribunal ruled in favour of Moffatt. The Tribunal found that Air Canada was responsible for the misinformation the chatbot had provided, that the airline had failed to take reasonable care to ensure the information was accurate, and that Moffatt was entitled to damages. The Tribunal explicitly rejected Air Canada’s argument that the chatbot should be considered a separate legal entity responsible for its own actions; the chatbot was part of Air Canada’s website and the airline was responsible for everything on its website.

The Air Canada case is small in financial terms but informative as a clean illustration of a recurring pattern. The chatbot had been deployed to handle customer queries; it had been trained on a corpus that included the airline’s general policy information; it had produced an output that was inconsistent with another part of the airline’s website (a specific page on bereavement fares that explained the actual policy); and the airline’s institutional response when the inconsistency produced harm was to disclaim responsibility for the chatbot’s output. The Tribunal compelled disclosure of the failure mode (the chatbot was inconsistent with the airline’s own policy) and assigned responsibility (Air Canada was responsible for the chatbot’s outputs). Without the tribunal ruling, the case would have been invisible; with it, the case is one of the few private-sector enterprise AI deployments for which we have an independent record of the failure mode.

5. The Pattern Across the Independent Evidence

Looking at the four cases I have just described, and considering them alongside the broader algorithmic accountability literature on automated decision-making in welfare, criminal justice, immigration enforcement, and employment screening, several patterns recur. I will set them out as patterns rather than as recommendations, because the independent evidence supports their identification but does not support confident claims about how to avoid them.

Figure 4 sets out the four cases against the recurring features of failure I am about to describe. The pattern is visible at a glance: with one exception (the discriminatory-output row), the same features appear in every case despite the differences in country, sector, technology generation, and deployment scale.

Figure 4: Recurring features across public-record AI deployment failures. Four cases of automated decision-making failure documented through forced disclosure under adversarial conditions.

The first pattern is that deployment failures are concentrated in the gap between the system’s specification and the operational reality in which it is deployed. The Robodebt averaging methodology was wrong for any recipient whose income varied across the year; the system was deployed before this was understood as a fatal flaw rather than a manageable edge case. The Dutch fraud-detection system used risk indicators that encoded discriminatory assumptions; the discriminatory output was inherent in the design, not an emergent property of the algorithm. The A-level algorithm was specified to “prevent grade inflation” but the cost function it minimised did not adequately capture distributional fairness. The Air Canada chatbot was deployed without ensuring consistency between its outputs and the rest of the airline’s policy infrastructure. In each case, the failure was in what the system had been asked to do, not in whether the technology could do what it had been asked.

The second pattern is that the institutional incentives within the deploying organisation systematically discount evidence of problems. The Robodebt scheme generated revenue, which was its primary institutional measure of success; concerns from staff and recipients about errors were not heeded because they did not affect the success metric. The Dutch tax authority was under pressure to combat fraud, which made the false positives (wrongful accusations) institutionally invisible relative to the true positives (caught fraudsters). The UK examinations regulator was under pressure to prevent grade inflation, which made the costs to individual students invisible relative to the institutional goal. The pattern in each case is that the institutional success metric is too narrow to capture the actual costs of the deployment, and the costs accumulate in populations whose voice within the institutional decision-making process is weak.

The third pattern is that the populations subject to the systems have limited effective means of contestation. Welfare recipients facing Robodebt had little capacity to challenge automated debt notices; the system was designed in a way that placed the burden of proof on the recipient. Dutch parents accused of fraud were subjected to administrative processes that gave them limited ability to contest the underlying risk classifications. UK students facing downgraded A-levels initially had no individual review mechanism; only the political reaction at population scale forced reconsideration. Air Canada customers receiving inconsistent information from the chatbot had to take the airline to a tribunal to get a remedy. The pattern is that automated systems shift the burden of contestation onto the people subject to the system, and the people subject to the system are typically less well resourced than the deploying organisation to mount that contestation.

The fourth pattern, less universally present but visible in several cases, is the misalignment between the deploying organisation and the vendor that builds the system. In several of the algorithmic accountability cases (less so in the four I have focused on, more so in the COMPAS and predictive-policing cases), the organisation deploying the system did not fully understand the system, the vendor had built the system to specifications that the organisation had only loosely articulated, and accountability for the system’s behaviour fell between the two parties. The Air Canada case has a small version of this: the airline’s argument that the chatbot was a “separate legal entity” was an attempt to shift accountability to the vendor. The tribunal rejected the argument, but the fact that it was made at all is informative. The vendor relationship in enterprise AI deployment is structurally different from earlier IT vendor relationships because the systems exhibit behaviours that neither the deploying organisation nor the vendor fully predicts or controls, and this creates novel accountability problems that the existing institutional and legal infrastructure has not yet absorbed.

The fifth pattern, which is the one I want to emphasise because it is the easiest to miss, is that the failure modes I have just described are not novel. They are not specifically AI failures. The same patterns have been documented in earlier automation deployments, in rule-based expert systems from the 1980s, in early machine-learning systems from the 2000s, and in the various non-algorithmic bureaucratic systems that operated at scale before automation became technically possible. The Robodebt averaging methodology was a procedural decision, not an algorithmic one; the same scheme could have been operated by clerks using a calculator. The Dutch discrimination problem was structurally similar to the discriminatory enforcement of welfare rules in many countries that operated without automation at all. The A-level algorithm was a specific failure but the underlying problem (standardising assessments across institutions of varying quality, with disparate-impact consequences) is one that human exam boards have wrestled with for decades. The Air Canada chatbot problem was a specific instance of the broader pattern of large organisations producing inconsistent statements across different communication channels; it occurred constantly before chatbots, when call-centre staff and webpage authors operated semi-independently.

The implication of this fifth pattern is that the current AI deployment failures are not telling us something new about AI; they are telling us something familiar about automation, accountability, and decision-making at scale, which AI happens to make more visible because it operates at higher volume and higher speed than earlier forms of automation. The lessons that the genuinely independent evidence supports are not lessons specific to enterprise AI; they are lessons about how organisations deploy any technology that automates decisions or interactions at scale, and they have been documented in the public administration and algorithmic accountability literatures for decades. The fact that they are being rediscovered in the AI literature, often without acknowledgment of the prior work, is itself part of the evidence problem this paper has been describing.

6. What This Tells Us About How Enterprises Are Deploying AI

What can be said, on the basis of this independent evidence, about how enterprises are actually deploying AI in 2024 to 2026? Less than the consulting literature claims, but not nothing.

The first thing that can be said is that adoption is rising rapidly at the individual level and more slowly at the operational level. The Federal Reserve and Census Bureau data converge on the conclusion that AI is being used by a substantial and growing share of workers, but that the share of firms using AI in actual production of goods or services is much lower. The gap between individual use and operational use is large and persistent. The interpretation I find most consistent with the available evidence is that individuals are adopting AI tools for their own work because the tools are useful for their own work; organisations are not deploying AI into operations at the same rate because operational deployment requires coordination, integration, and institutional change that individual adoption does not.

The second thing that can be said is that the productivity effect at the macroeconomic level has not yet appeared. This is consistent with the J-curve hypothesis from the pre-LLM general-purpose-technology literature. It is also consistent with several other hypotheses. The most cautious reading is that the macroeconomic productivity effect of AI is either small, or delayed, or both, and that the confident forecasts of substantial near-term economic transformation that dominate the consulting-firm literature do not yet have empirical support. The independent literature does not say that the productivity effects will not eventually arrive; it says that they have not arrived yet, and that the historical pattern of comparable technologies suggests they would not be expected to have arrived yet.

The third thing that can be said is that enterprise deployment is concentrated in narrow, well-defined task domains where the inputs and outputs are constrained. This is not directly documented in any single independent study, but it is the consistent pattern visible across the cases that have produced independent evidence. The deployments that have been visibly successful (in the sense of being adopted, scaled, and retained) tend to be in narrow domains: contract analysis on standardised contract types, customer support on well-defined ticket categories, code generation on well-scoped tasks, document summarisation for internal use. The deployments that have been visibly unsuccessful (in the sense of being withdrawn or producing public failures) tend to be in broader domains: open-ended customer service across the full range of customer queries, automated decision-making across the full population of welfare recipients, examination grading across the full range of student circumstances. The pattern is consistent with the “jagged technological frontier” framing from the Dell’Acqua et al. study (which I cite with the appropriate caveats about its industry partnership) and with the broader literature on what large language models can and cannot reliably do.

The fourth thing that can be said is that the recurring failure modes are organisational and institutional, not technical. The public-record cases I described in the previous section all show this pattern. The failures occurred not because the technology was incapable of doing what it had been asked to do, but because what it had been asked to do was wrong, because the institutional context did not allow effective contestation of errors, or because the deploying organisation had institutional incentives to discount evidence of problems. This is the most important substantive finding of the paper, and it is the one that most directly contradicts the dominant narrative. The dominant narrative treats AI deployment failure as a technical or operational problem to be solved through better tools, better data, better workflows, or better organisational maturity. The independent evidence suggests that AI deployment failure is principally an organisational and institutional problem that exists independently of the technology, that has been documented in earlier waves of automation, and that has not been solved by any of the proposed technical or operational remedies because the remedies do not address the underlying institutional structure.

The fifth thing that can be said is that the current discourse is misleading practitioners in identifiable ways. I want to be careful here because I am making a claim about the discourse, which is harder to evidence than a claim about the deployments. But the consulting-firm literature consistently presents AI deployment as a problem of organisational maturity, capability investment, and workflow redesign, where the path to success is to follow the playbook of “high performers.” The independent evidence does not support this framing. There are no peer-reviewed independent studies that have established that the consulting-firm segmentations track real differences in deployment outcomes. The maturity models are theoretical constructs that are then validated against survey populations that include the firms whose investments in maturity validate the framework. The “high performer” category is itself a construction of the surveys that produce it. A practitioner reading the consulting literature is being given a confident playbook on the basis of evidence that does not support that confidence.

What the historical literature on general-purpose technologies does suggest, in contrast, is that the firms that eventually capture the productivity gains from a new general-purpose technology are likely to be firms that invest in complementary intangible capabilities over a sustained period, that experiment widely with different applications, that maintain organisational learning structures, and that have the absorptive capacity to integrate new technology into existing operations. This is a less specific recommendation than the consulting playbook offers, and it carries less confidence about timing or distribution of returns, but it is the recommendation the independent evidence actually supports.

The sixth thing that can be said is more uncomfortable. It is that the most reliable way to find out how enterprises are deploying AI is to wait for the deployments that fail to become visible through adversarial disclosure. The structural reason is that successful deployments are documented by their deployers (who have an interest in presenting them favourably) while failed deployments are documented by external parties only when the failure produces consequences that compel disclosure. This means that the empirical record of enterprise AI deployment will, for the foreseeable future, be biased toward visible failures rather than typical deployments. We will know more about how AI deployment goes wrong than about how it goes well, and the things we know about how it goes wrong will be the things that are large enough or harmful enough to produce inquiries, lawsuits, or regulatory action. This is not a satisfactory state of affairs, but it is the actual state of affairs, and a paper that wants to be honest about the evidence should say so.

7. A Framework for Reading the Evidence

Practitioners and observers who have to make decisions about enterprise AI in 2024 to 2026 are not in a position to wait for the independent research apparatus to mature. They have to act now, on the evidence they have. What this paper has tried to do is to clarify what the available evidence does and does not support, and to identify the sources that can be trusted at different levels of weight. In this section I want to draw out, in a more practical register, what reading the evidence honestly looks like.

The first move is to distinguish, in any piece of research on enterprise AI, between the source’s commercial position and its substantive claims. If a consulting firm publishes a survey, the survey may contain accurate measurement of what executives say to surveyors, but the framing of the findings (the segmentations, the recommendations, the implicit comparisons) will be shaped by what the firm sells. A reader can extract the underlying observations while being sceptical of the framings. If a vendor publishes a case study, the case study may contain real information about a real deployment, but the selection of which deployments get presented as cases is determined by what makes the vendor look good. A reader can take the existence of the case as evidence that the deployment occurred while being sceptical that it is representative.

The second move is to apply the spectrum approach to academic research. Peer-reviewed publication in a serious journal is necessary but not sufficient for treating a paper as independent evidence. The funding declarations, the author affiliations, and the partnerships disclosed in the acknowledgments are part of the evidence about the paper’s evidence. A paper conducted by independent academics at independent universities, with no industry funding and no firm partnership, can be treated as substantive evidence. A paper conducted at independent institutions but with industry funding or with industry co-authors should be treated as substantive but qualified evidence. A paper conducted in partnership with the firm being studied should be treated as case material from a cooperating organisation, not as independent evidence. Most readers do not do this work, but it is the work that is required for honest reading.

The third move is to take government statistical work and independent inquiries seriously as primary sources. The Census Bureau and Federal Reserve work on AI adoption is more reliable than any corporate survey, and is freely available. The Royal Commission and parliamentary inquiry reports on algorithmic failure are more reliable than any consulting case study, and are also freely available. These sources are systematically under-cited in the management literature, partly because they sit in different intellectual traditions and partly because they tell less dramatic stories. They are nonetheless the strongest evidence we have, and a reader who has read them is better informed than one who has read the corporate literature.

The fourth move is to read historically. The literature on general-purpose technology adoption is several decades old and is mostly free of current industry funding. Paul David on electrification, Brynjolfsson’s pre-AI work on the IT productivity paradox, Cohen and Levinthal on absorptive capacity, Trist on sociotechnical systems, Walsh and Ungson on organisational memory, Nelson and Winter on evolutionary economics, the substantial body of work on technology diffusion. This literature is the cleanest source we have for thinking about the current moment, because it was developed without the current industry influence and because the questions it addresses (how do organisations absorb new general-purpose technologies, what conditions determine which firms benefit, how long does the absorption take) are the same questions that the current discourse is asking about AI. The current discourse mostly does not engage with this literature, which is both a missed opportunity and a sign of how much the discourse has been shaped by parties whose interest is in presenting the current moment as historically unprecedented rather than as a familiar case of a recurring pattern.

The fifth move is to be appropriately humble about prediction. The historical literature suggests that general-purpose technologies take a decade or more to produce measurable productivity gains, that the distribution of those gains across firms is highly uneven, that the firms that capture the gains are typically those with prior absorptive capacity, and that the eventual gains are typically substantial in aggregate but difficult to forecast in their specific forms. The current moment in AI is, on this reading, an early phase of a long cycle. The confident predictions about the next two to three years that dominate the consulting literature are not supported by the historical literature on comparable technologies. Neither, I should be honest, are the confident predictions of catastrophe or stagnation that come from the more pessimistic commentators. The honest position is that the productivity effects will become clearer over the next decade, and that prediction at finer time scales is poorly supported by anything we know.

The sixth move is to attend to who is harmed when AI deployments fail. The public-sector cases I described in section 4 all show that the people harmed by deployment failure are systematically less able to make their harm visible than the deploying organisation is to control the narrative. Welfare recipients with Robodebt, parents falsely accused of fraud in the toeslagenaffaire, students downgraded by the A-level algorithm, customers misled by chatbots: in each case, the harm fell on a population whose voice in the institutional process was weak, and the harm only became visible when the population’s collective voice reached a threshold that compelled institutional attention. A reader of enterprise AI deployments should ask, of any deployment under consideration, who is positioned to make harm visible if it occurs, and what institutional structures exist to ensure that harm is identified and remedied. The answers will often be unsatisfactory. The asking itself is a discipline that the consulting literature does not encourage.

8. Limitations and Open Questions

I want to be honest about what this paper does and does not do, and what would change my view if the evidence base shifted.

The paper makes its strongest claims about what the genuinely independent literature does and does not show, and about the recurring patterns visible in the public-record AI deployment cases. These claims rest on government statistical work, pre-LLM theoretical and historical research, and adversarially-disclosed public-sector inquiries, which are the cleanest sources available. The claims would be weakened if a serious independent empirical literature on enterprise AI deployment outcomes emerged that contradicted the patterns I have identified. I do not expect this to happen in the near term, but it is possible.

The paper makes weaker claims about the macroeconomic productivity effects of AI, the rate of adoption at the firm level, and the conditions for organisational absorption of new technology. These claims rest on a combination of government statistical work and pre-LLM theoretical work, both of which are reasonably solid, but the application of the older theoretical work to the current AI moment requires interpretive judgement that I have made transparently. The claims would be weakened if the current AI deployment cycle turned out to be qualitatively different from earlier general-purpose-technology cycles in ways that the historical literature does not capture. It is possible that this is the case; I do not think the available evidence yet supports the claim that the current cycle is qualitatively different, but the evidence is thin enough that the claim cannot be ruled out.

The paper does not make strong claims about what enterprises should do. It declines to make those claims because the independent evidence does not yet support them. A reader who wants confident practitioner guidance will find the paper unsatisfying. The honest response is that the consulting literature provides such guidance with more confidence than the evidence supports, and that the right response to the current state of the evidence is to be sceptical of confident guidance rather than to provide alternative confident guidance.

The paper is geographically uneven. The public-sector cases I have discussed are Australian, Dutch, British, and Canadian; the algorithmic accountability literature I have drawn on is largely Western. There is substantial public-sector AI deployment in China, India, the Gulf states, parts of Africa, and elsewhere; I do not have a serious independent literature on any of these to draw on, and the paper is correspondingly weighted toward Western cases. This is a limitation of the available evidence rather than a choice; it is, however, worth flagging.

The paper is also weighted toward public-sector cases, because that is where the adversarial disclosure mechanisms exist. The patterns I have identified may or may not transfer to private-sector deployments. They probably do in the broad outlines (the institutional failure modes are similar in structure across sectors), but the specifics of private-sector deployment may differ in ways that the public-sector cases do not capture. The Air Canada case is the one private-sector case where adversarial disclosure produced independent evidence; one case is not enough to establish a pattern, even though it is consistent with the pattern from the public-sector cases.

The paper does not address the question of whether enterprise AI deployment is, on balance, good or bad. This is a question I think is premature given the current state of the evidence. The historical literature on general-purpose technologies suggests that the eventual aggregate effects are likely to be substantial and probably net-positive, but the distribution of effects across populations is likely to be uneven and not entirely predictable from the early phase. The current cycle could produce substantial productivity gains widely shared; it could produce substantial productivity gains narrowly captured; it could produce smaller gains than expected, with significant social costs along the way. The independent evidence does not yet support any of these outcomes confidently over the others, and a careful paper should not pretend it does.

What would change the picture I have drawn? Several developments would update my view. First, the emergence of a substantial independent empirical literature on enterprise AI deployment that meets a reasonable independence standard. This would allow the paper’s claims to be revised against new evidence. Second, the maturation of the agentic AI deployment cycle (the next phase of the technology, currently in single-digit percent deployment across most business functions according to the Stanford AI Index 2026) which may show different patterns than current generative AI deployment. Third, the appearance of clearer productivity effects in the government statistical work, which would either confirm the J-curve pattern (as gains arrive after the intangible investment phase) or contradict it (as gains fail to arrive on the expected timeline). Fourth, the accumulation of more public-record cases, particularly private-sector cases where adversarial disclosure produces independent evidence. Each of these developments would strengthen the empirical base on which honest claims about enterprise AI deployment can be made.

In the meantime, the paper offers what I think is the most honest assessment available: that the current discourse on enterprise AI is too confident given the available evidence, that the independent literature shows a more modest and more historically familiar picture than the consulting literature claims, that the public-record cases show recurring failure modes that are organisational and institutional rather than technical, and that practitioners would be better served by careful scepticism about the dominant narratives than by another confident playbook based on evidence that does not support the confidence.

This is a less marketable conclusion than the consulting literature offers. I think it is also the right one for the moment we are in.

References

References are grouped by the categories used in section 2 to distinguish independence levels. Sources in the corporate-output category are listed because they are cited in the paper as part of the discourse being examined, not as evidence within the paper’s argument.

Government statistical work and official research

Allen, J. S. (2026). Monitoring AI adoption in the U.S. economy. FEDS Notes, Board of Governors of the Federal Reserve System, April 2026.

Bonney, K., Breaux, C., Buffington, C., Dinlersoz, E., Foster, L. S., Goldschlag, N., Haltiwanger, J., Kroff, Z., and Savage, K. (2024). Tracking firm use of AI in real time: A snapshot from the Business Trends and Outlook Survey. NBER Working Paper 32319.

Crane, L., Green, M., and Soto, P. (2025). Measuring AI uptake in the workplace. FEDS Notes, Board of Governors of the Federal Reserve System, February 2025.

McElheran, K., Li, J. F., Brynjolfsson, E., Kroff, Z., Dinlersoz, E., Foster, L. S., and Zolas, N. (2024). AI adoption in America: Who, what, and where. Journal of Economics and Management Strategy 33(2), 375–415.

Royal Commission into the Robodebt Scheme. (2023). Report of the Royal Commission into the Robodebt Scheme. Commonwealth of Australia, July 2023.

Office for Statistics Regulation. (2020). Review of the regulation of statistical models used in 2020 to award grades to students who would have taken examinations in summer 2020. UK Statistics Authority.

Dutch Parliamentary Committee of Inquiry into the Childcare Benefits System. (2020). Ongekend onrecht [Unprecedented Injustice]. Dutch House of Representatives, December 2020.

Autoriteit Persoonsgegevens. (2020). Werkwijze Belastingdienst in strijd met de wet en discriminerend [The Tax and Customs Administration’s working method is unlawful and discriminatory]. Dutch Data Protection Authority.

College voor de Rechten van de Mens. (2024). Onderzoek naar institutioneel racisme bij de kinderopvangtoeslagaffaire [Investigation into institutional racism in the childcare benefits affair]. Netherlands Institute for Human Rights.

Independent inquiries with adversarial disclosure

Moffatt v. Air Canada, 2024 BCCRT 149, British Columbia Civil Resolution Tribunal, February 2024.

Foundational pre-LLM theoretical and historical work

Cohen, W. M., and Levinthal, D. A. (1990). Absorptive capacity: A new perspective on learning and innovation. Administrative Science Quarterly 35(1), 128–152.

David, P. A. (1990). The dynamo and the computer: An historical perspective on the modern productivity paradox. American Economic Review 80(2), 355–361.

Nelson, R. R., and Winter, S. G. (1982). An Evolutionary Theory of Economic Change. Belknap Press.

Orlikowski, W. J. (2007). Sociomaterial practices: Exploring technology at work. Organization Studies 28(9), 1435–1448.

Scott, J. C. (1998). Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed. Yale University Press.

Trist, E. L., and Bamforth, K. W. (1951). Some social and psychological consequences of the longwall method of coal-getting. Human Relations 4(1), 3–38.

Walsh, J. P., and Ungson, G. R. (1991). Organizational memory. Academy of Management Review 16(1), 57–91.

Yang, S., and Brynjolfsson, E. (2001). Intangible assets and growth accounting: Evidence from computer investments. MIT Sloan School of Management working paper.

Zahra, S. A., and George, G. (2002). Absorptive capacity: A review, reconceptualization, and extension. Academy of Management Review 27(2), 185–203.

Academic research with industry funding declared

Acemoglu, D. (2025). The simple macroeconomics of AI. Economic Policy 40(121), 13–58. (Funded in part by Google through the MIT Shaping the Future of Work initiative.)

Autor, D. H. (2024). Applying AI to rebuild middle class jobs. NBER Working Paper 32140. (Funded in part by Google, the William and Flora Hewlett Foundation, the NOMIS Foundation, and the Smith Richardson Foundation.)

Bick, A., Blandin, A., and Deming, D. J. (2025). The rapid adoption of generative AI. Management Science, forthcoming. (Funded in part by the Walmart Foundation.)

Berente, N., Gu, B., Recker, J., and Santhanam, R. (2021). Special issue editor’s comments: Managing artificial intelligence. MIS Quarterly 45(3), 1433–1450.

Algorithmic accountability literature

Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press.

O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown.

Corporate and consulting outputs cited as discourse

Boston Consulting Group. (2025). The widening AI value gap: Build for the future 2025. BCG.

Brynjolfsson, E., Li, D., and Raymond, L. R. (2025). Generative AI at work. Quarterly Journal of Economics 140(2), 889–942. (Vendor-academic partnership; cited as case material, not as independent evidence.)

Challapally, A., Pease, C., Raskar, R., and Chari, P. (2025). The GenAI divide: State of AI in business 2025. MIT Project NANDA, July 2025.

Dell’Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., and Lakhani, K. R. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper 24-013. (Vendor-academic partnership with BCG; cited as case material, not as independent evidence.)

Deloitte AI Institute. (2024–2026). State of generative AI in the enterprise. Quarterly survey series.

Dillon, E. W., Jaffe, S., Immorlica, N., and Stanton, C. T. (2025). Shifting work patterns with generative AI. Working paper, Microsoft Research and Harvard Business School. (Vendor-academic partnership with Microsoft; cited as case material, not as independent evidence.)

IBM Institute for Business Value. (2025). The 2025 CEO study.

McKinsey & Company. (2025). The state of AI: Global survey 2025. McKinsey QuantumBlack, November 2025.

Stanford Institute for Human-Centered Artificial Intelligence. (2025, 2026). AI Index Report.

Weill, P., Woerner, S. L., and Sebastian, I. M. (2024). Building enterprise AI maturity. MIT CISR Research Briefing, December 2024.