What AI Can't Do for Democracy

November 21, 2024

With the recent rapid development of AI tools, many are hailing the promise of digital technologies to enhance civic engagement and improve governance and democracy. Political theorist Hélène Landemore, for example, argues that “AI has the potential to usher in a more inclusive, participatory, and deliberative form of democracy, including at the global scale.”

Many are hailing the promise of digital technologies to enhance civic engagement and improve democracy. Is this optimism justified?

She is far from alone. Lily Tsai and colleagues at MIT’s Governance Lab say that “online platforms and generative AI give us extraordinary new opportunities to participate in discussions and policy deliberations with each other at scale.” Nils Gilman and Ben Cerveny suggest that “a technologically enabled form of continuous democratic engagement offers the promise of a government that is simultaneously more effective, more efficient and more directly responsive to the will of the public.” Beth Noveck, the first Chief AI Strategist of New Jersey, testified to a U.S. Senate committee about “AI’s unmatched potential to analyze public sentiment, manage feedback, and scale engagement across diverse demographics.” And for some this is also a business opportunity: a May 2023 report found that “the market for online participation and deliberation in Europe is expected to grow to €300mn in the next five years.”

In short, there is increasing optimism among both theorists and practitioners over the potential for technology-enabled civic engagement to rejuvenate or deepen democracy. Is this optimism justified?

The answer depends on how we think about what civic engagement can do. Political representatives are often unresponsive to the preferences of ordinary people. Their misperceptions of public needs and preferences are partly to blame, but the sources of democratic dysfunction are much deeper and more structural than information alone. Working to ensure many more “citizens’ voices are truly heard” will thus do little to improve government responsiveness in contexts where the distribution of power means that policymakers have no incentive to do what citizens say. And as some critics have argued, it can even distract from recognizing and remedying other problems, creating a veneer of legitimacy—what health policy expert Sherry Arnstein once famously derided as mere “window dressing.”

Still, there are plenty of cases where contributions from citizens can highlight new problems that need addressing, new perspectives by which issues are understood, and new ideas for solving public problems—from administrative agencies seeking public input to city governments seeking to resolve resident complaints and citizens’ assemblies deliberating on climate policy. But even in these and other contexts, there is reason to doubt AI’s usefulness across the board. The possibilities of AI for civic engagement depend crucially on what exactly it is that policymakers want to learn from the public. For some types of learning, applications of AI can make major contributions to enhance the efficiency and efficacy of information processing. For others, there is no getting around the fundamental needs for human attention and context-specific knowledge in order to adequately make sense of public voices. We need to better understand these differences to avoid wasting resources on tools that might not deliver useful information.

In 2010 the United Kingdom’s coalition government held an online consultation on a promised new “Freedom Bill” aimed at “sweeping away meddlesome legislation and freeing up individuals and business from overbearing rules.” Over 45,000 people submitted suggestions for laws and regulations to be scrapped. But insiders said that Deputy Prime Minister Nick Clegg was left “floundering” in the face of the “sheer volume of information,” until finally he “felt he was being tied up in knots so he washed his hands of it” and the crowdsourcing effort was abandoned.

Large-scale participatory initiatives are plagued by this kind of information overload. Dealing with massive volumes of comments—sometimes numbering in the millions—is a persistent challenge for regulatory rulemaking in the United States. South Korea’s online platform accompanying the 2017 presidential transition received over 180,000 suggestions in just 49 days. Participatory components of Chile’s 2019–2022 constitution-making processes involved over 150,000 participants in over 16,000 online or in-person dialogues. The European Union’s 2021–2022 Conference on the Future of Europe’s online Multilingual Digital Platform received almost 17,000 ideas from over 43,000 contributors. One study of local policy crowdsourcing in the United States called this problem “civic data overload.”

Working to ensure “citizens’ voices are truly heard” will do little to improve responsiveness in contexts where policymakers have no incentive to do what citizens say.

These examples highlight the challenges of civic engagement at scale. Although face-to-face settings like townhall meetings or focus group discussions offer a relatively manageable volume of information to make sense of and process into potentially actionable policy learning, increasing numbers of real-world “democratic innovations” aim to operate at much larger scales, often with massive numbers of participants and sometimes even at national levels of government. These include participatory budgeting, policy crowdsourcing, and citizens’ assemblies. In Taiwan, policymakers are even using deliberative assemblies to inform policy on AI itself. If there is any hope of meaningfully learning from civic engagement in these contexts, policymakers need to be able to overcome the information overload problem.

Potential applications of AI to this problem—including the use of machine learning, LLMs, and tools like chatbots—can be divided into three types depending on the stage of policymaking they aim to enhance.

One type focuses on citizen inputs, aiming to reduce barriers to participation or improve its quality. Automated translation tools can help diverse populations deliberate together. Chatbots can provide automated assistance. LLM-based tools can help individuals draft clearer and more detailed proposals or concerns. Other tools can intervene in deliberations to prevent toxicity, or even serve as moderators, facilitators, or fact-checkers. Still others can help identify inputs that are similar to others already contributed, helping reduce redundancy or identify areas of agreement.

Another type focuses on the outputs of civic engagement and aims to help the broader public better understand or evaluate them. A recent example used a bespoke LLM trained on material from the French citizens assembly on end-of-life policy to help members of the public better understand the process and results. One recent proposal for integrating AI into citizen assemblies suggests that to better communicate results, they “could generate different versions of text to appeal to different audiences.”

A third type concerns the middle stage between inputs and outputs: an information processing stage in which public contributions are condensed into interpretable and actionable information suitable for meaningful learning by policymakers. This is where the information overload challenge is crucial. Given how widespread this challenge is, there is serious potential for AI to make important contributions. In fact, a wide variety of tools are already available. A recent review of methods for computational text analysis surveys tools for duplicate detection, thematic grouping, argument mining, and sentiment analysis, and the more recent development of LLMs like ChatGPT expands the range of possibilities even further.

But each tool is only useful for certain types of tasks. If there is a mismatch between what policymakers want to learn from the public and what an AI tool can actually do, then results are likely to be disappointing. And for some types of learning, AI is still not very useful at all. To better evaluate the potential of AI for information processing in civic engagement, we need to understand what these different types of learning are, and what their differences mean for potential AI applications.

What is it that policymakers want to learn from the public? For any given form of civic engagement inputs—like comments at a town hall meeting, responses to a consultation, submissions to complaint or crowdsourcing platforms, or even social media posts—there are very different types of conclusions that one might want to draw. Two key characteristics of these are specificity and novelty.

First, information that policymakers want to learn can be holistic—pertaining to the full set of inputs, like the results of a vote among participants, or a summary of all the topics they raised—or specific, pertaining to a specific problem that was reported, a specific proposal that was suggested, or a single particularly compelling perspective. To draw holistic conclusions, information can be aggregated, which provides easy-to-understand summaries but at the cost of destroying the content of individual inputs themselves. But to draw specific conclusions, information needs to be filtered, preserving the content of some individual inputs but limiting their volume to a small enough subset for policymakers to actually pay attention to. This requires individual attention to each input, whether by human or by machine, to decide which ones to keep.

It also matters whether information is expected to be novel or familiar. Could its measures or categories have been anticipated and defined ahead of time? Sometimes this is straightforward, as when citizens are asked a yes-or-no question, prioritize a predetermined list of issues, or propose solutions that are easy to score on a common metric. But sometimes policymakers want to be able to discover entirely new problems that they weren’t already aware of, innovative solutions to “wicked problems,” or perspectives that they wouldn’t have thought to ask about ahead of time. This kind of learning requires some way to recognize these types of inputs and what makes them special, on the basis of context-specific knowledge and extrapolation beyond past experience, without being able to easily apply already familiar definitions or categories.

These two characteristics provide a four-way typology of different types of learning from public engagement: each combination of holistic or specific, familiar or novel. None of these are necessarily more or less useful than the others; it’s simply a matter of what policymakers want to learn in any given setting. What distinguishes them are the potentially costly requirements necessary to process inputs into the outputs policymakers actually want to learn.

How can AI help with these? Consider each case in turn.

First, there’s holistic, familiar learning. This is what a vote between predefined options conveys. If inputs from the public are already in a numeric or categorical form, no further processing is needed beyond simple statistical analysis. No need for AI here! When inputs from the public are open-ended, however, a range of AI tools can help lower the costs of converting them into measures or categories for aggregation. This includes tools like sentiment analysis to measure approval or disapproval, ideological scaling to measure policy preferences, or supervised machine learning to categorize contributions using past examples. Although AI can be effective here, this type of learning is not where information overload is usually a major challenge to begin with. And information about public preferences alone is not very likely to be impactful on issues with entrenched interests and polarized views already well-known.

We need to understand where AI can’t help in order to avoid wasting resources on tools that might not deliver useful information.

Second, for holistic but novel learning, AI tools can be used to summarize public contributions in ways that enable emergent, potentially unanticipated outputs. So-called “topic models” and other related keyword-clustering tools identify clusters of words that tend to appear together. Unlike a predefined set of categories, topic models can identify unanticipated, emergent themes, but they require human judgment to interpret what those themes actually mean. In similar fashion, LLMs, as well as tools like argument mining, can be used to produce holistic summaries.

Many recent examples of AI for civic engagement do just this: aggregating across contributions rather than filtering for specifics, but without relying on pre-defined measures or categories. A recent review noted that “Some platforms already use AI to suggest groupings of proposals by common themes or to suggest keywords in comments.” One company offers government entities tools to “effortlessly turn large data sets into key themes within seconds, uncovering the issues that matter most to your community.” The UK government is developing a “Consultation Analyser” based on topic modelling, in the hope of saving some of the roughly 80 million pounds per year spent on consultations. And the U.S. Consumer Financial Protection Bureau uses tools including topic models to “identify emerging trends and statistical anomalies in large volumes of complaints.”

Of course, there is a danger that AI-written summaries of public consultations may be distorted or biased (though they still could be less biased than human-generated summaries). When the Australian Securities and Investment Commission trialled AI summarization of public submissions in early 2024, it found the “summaries were so bad that the assessors agreed that using them could require more work down the line.”

But even if summarization works, there is a more fundamental limitation. Aggregated information outputs are just that: aggregated. There is a difference between learning that 25 percent of comments are about infrastructure, for instance, and learning that one specific bridge has cracks indicating a danger of imminent collapse. The latter information is drowned out by aggregation; it would not feature in a thematic summary. Even sophisticated AI tools like Talk to the City that recognize the importance of “preserving the diversity and nuance of individual opinions”—also being used in Taiwan’s AI alignment assemblies—still primarily aggregate inputs into thematic categories. With the current wave of enthusiasm over AI for civic engagement featuring such a profusion of summarization-based tools, there is a danger that practitioners lose focus on the value of learning from specific contributions too.

Third, for specific, familiar learning, it is relatively straightforward to employ humans who are not subject matter experts, or some form of artificial intelligence, to filter inputs using either clear rules or past examples. While there are a huge variety of methods for supervised machine learning, all operate on the same basic principle of inferring patterns that appear in training data, then using those patterns to classify new cases. For example, past citizen complaints already categorized for assignment to different government agencies could be used as training data to categorize future complaints—provided that one isn’t worried about novel use-cases emerging in the future. Likewise, past public suggestions that have either been accepted or declined by policymakers could be used as training data to predict the value of future suggestions – provided that one is comfortable assuming a lack of bias in past decisions, and assuming that the features associated with “usefulness” don’t change over time.

Other forms of AI can also be used for filtering. LLMs sometimes perform well at classifying open-ended inputs into predefined categories on the basis of clear prompts—identifying the “most important issue” expressed in survey responses, for example. And where policymakers want to filter for particularly high or low values on a single metric, natural language processing tools can be used to produce measures on the basis of similarity scores between near-duplicate contributions, sentiment analysis, linguistic markers, deliberative quality, or topic probabilities derived from topic models. But it will depend from case to case whether the concepts that these methods measure are actually useful for policymakers. And practitioners still need to be wary of fundamental concerns like algorithmic bias and data selection bias. If complaints processing is automated using training data from web-based platforms only, say, the results may degrade performance for demographics who tended to report complaints via telephone—often older individuals.

Finally, there’s specific and novel learning. Much of the promise of public participation in government is precisely its ability to inform policymakers about novel problems, novel solutions, or novel perspectives from the public. For example, the French online platform Parlement et Citoyens brought novel information to the attention of legislators through a consultation on draft legislation on pesticide use when “one of the 521 participants spotted a potential loophole . . . and suggested an amendment which was later implemented.” Similarly, the U-Report SMS platform in Uganda, while initially used primarily for opinion polls, ultimately also enabled the government to identify unanticipated public problems when unsolicited messages from participants highlighted disease outbreaks in rural areas.

Restricting the scope of policy learning to already known measures and categories could even serve to simply lock in existing political inequalities.

But for this kind of learning to work, policymakers need to be able to successfully filter the useful public contributions from those that are less useful—and to do so on the basis of measures that cannot easily be defined ahead of time, explained with clear rules, or demonstrated with past examples. While there are a wide variety of ways this kind of information processing can be done, all of them involve unavoidable tradeoffs between minimizing costs, minimizing biases, and minimizing inaccuracies.

Consider Deputy Prime Minister Clegg’s 2010 crowdsourcing attempt. Policy experts with the regulatory knowledge necessary to recognize novel value could have read tens of thousands of submissions themselves, but this would have required an incredible investment of already scarce time. And given age-old concerns about politician-bureaucrat conflicts, Clegg perhaps worried that civil servants would not share his own goals and so would be biased in how they went about filtering.

The filtering task could instead have been delegated to interested third parties outside of government—advocacy or industry groups with the relevant knowledge as well as willingness to bear the costs of reading large volumes of submissions themselves. But precisely because those groups have relevant material or ideological interests, the results would have been highly biased and lacking in public legitimacy. Alternatively, the task of filtering could have been delegated to voluntary review by citizens at large, either individually or in groups. Although volunteers’ time might not be as expensive to the government, there is no guarantee that ordinary people without context-specific knowledge—in this case, regulatory expertise—would be able to recognize the suggestions most useful to take forward (whether “useful” refers to public value or to more political goals).

Given these limitations, it is tempting to think that AI could offer transformational improvements for this type of learning. But how can you automate filtering for unknown criteria? Supervised machine learning requires labelled training data, which means the desired measures or categories to be learned are already known from past examples. And in any setting where relevant features of the world change over time, naïve applications of past training data will fail to identify truly novel problems, solutions, or perspectives.

The same fundamental limits apply to more recently developed LLMs. While they can perform relatively well at classifying according to clear rules or at summarizing in the aggregate, recognizing truly novel importance of specific individual contributions means operating without any pre-defined instructions or training data. If these existed, it wouldn’t be a novel task. The upshot requires extrapolation beyond the training data, something LLMs generally perform poorly at. And even with increasingly massive volumes of data (often collected unlawfully) to train on, LLMs have been shown to perform worse in settings that were less prevalent in their training data—such as non-western contexts, less widely spoken languages, or on less common subjects. This “prevalence” bias will be a greater problem precisely in settings where context-specific knowledge is more important for recognizing novel value.

Putting these four cases together, we see the potential of AI to enable better information processing depends crucially on the type of information that policymakers actually want to learn. If they want to be able to learn things that are both specific and novel—where information processing requires both individual attention and context-specific knowledge—then AI is not likely to be very useful unless society is willing to accept either serious biases or major inaccuracies. And restricting the scope of policy learning to already known measures and categories could even serve to simply lock in existing political inequalities.

These dangers will only be compounded if policymakers exhibit “automation bias”—a tendency of decisionmakers to uncritically defer to algorithmic recommendations even where their own expertise and experience says otherwise. The very fact that a summary or a priority score is produced by AI, for instance, might lead policymakers to miss out on problems of biased or out-of-date training data, inaccurate scores, or distorted summaries.

Although some more recent developments suggest potential ways out of these tradeoffs, we should still approach them with skepticism. Can local implementations of LLMs using context-specific documents—so-called “retrieval-augmented generation”—help? Although they may ameliorate the prevalence bias problem, they still face the limitation of any form of AI built on training data: those documents pertain to the past, not the future. A complaint platform AI trained on past complaints and responses will fail to recognize an entirely new type of problem that has not appeared before—precisely the type of problem most pressing to identify.

Recently Bruce Schneier and Nathan Sanders proposed a different approach for U.S. regulatory consultations, suggesting that outlier detection can help to identify “those data points that fall outside the mold—comments that don’t use arguments that fit into the neat little clusters.” While this is a potentially fruitful approach, it is likely to have serious accuracy problems: high rates of both false negatives (useful comments that are not detected) and false positives (detected outliers that are not actually useful). In most real-world settings of public contributions, a large majority of outliers will simply be oddities, and not in useful ways. It will still require costly human judgment to separate the useful outliers from the merely odd ones.

Some things simply still require the scarce attention of humans with context-specific knowledge.

More optimistically, suggestions like these could still be developed into hybrid approaches, combining artificial intelligence, collective intelligence, and human expertise together in ways that increase efficiency while mitigating some of the limitations of both. Even where AI-based tools might struggle to recognize the most useful novel contributions, they can still help to prune the least useful ones, such as so-called “mass, computer-generated, and malattributed comments.” Topic models and similar clustering approaches, while unlikely to help identify the most useful specific contributions, might be able to help increase the efficiency of domain experts by ensuring that each contribution is reviewed by the individual best able to assess its particular qualities. And more sophisticated tools like Polis, All Our Ideas, or Policy Synth might help extract more informative measures from ratings of crowdsourced proposals by participants who do have context-specific knowledge.

An ensemble of multiple different approaches together might offer the most promising way forward. Just as recommendations across a wide variety of AI application settings focus on the importance of keeping a “human in the loop,” for civic engagement the best recommendation may be to find ways of keeping “context-specific knowledge in the loop.” This is absolutely not to say that AI shouldn’t be applied to civic engagement, at least with the right guardrails. The key lessons are about the importance of context and ensuring the right match between policymakers’ learning goals and the tools they use. Both the usefulness and limitations of AI will depend on how they are being applied and for what kind of learning.

Perhaps future advances in AI will go on to mitigate some of these tradeoffs. But even the newest wave of LLMs are still based on training data, and so fundamentally bear the limitations and risks that any training data imposes: algorithmic bias; temporal dependence; “garbage in, garbage out.” More broadly, better understanding the importance of context-specificity and temporal extrapolation helps highlight the types of tasks for which this current generation of AI advances still remain poorly suited. Some things simply still require the scarce attention of humans with context-specific knowledge.

Independent and nonprofit, Boston Review relies on reader funding. To support work like this, please donate here.

What AI Can’t Do for Democracy

Donate to support work like this:

Democratic Disenchantment

What’s Wrong with Technological Fixes?

Can Big Tech Serve Democracy?

Get our newsletter