We thank the respondents for their comments. We agree with many, though in some cases we draw different conclusions.

The role of evaluations
One of the themes of the responses is that randomized evaluations are not the only method that can help inform policy in developing countries. We feel similarly. Our intention is to review the lessons of randomized evaluations on consumer behavior in health and education in the developing world, not to deny a role for other techniques. We need nationally representative surveys. We need participatory appraisals and the collaborations advocated by José Gómez-Márquez. In many cases, different approaches complement each other. For example, randomized trials can be, and often are, used to test hypotheses generated from analysis of non-experimental data or through participatory research.

Randomized evaluations are not suited for examining every issue. Macroeconomics and the long-run impacts of history, for example, are best left to other methods. However, we have learned over the last fifteen years that randomized evaluations can be used to address a broad range of questions related to environmental regulation, girls’ empowerment, trust-building in post-conflict environments, anti-corruption policy, school-choice programs, decentralization and citizen participation in public services, and other issues.

While many randomized studies do evaluate particular programs, researchers often choose an intervention because it offers the potential to examine an important general question. This is one of the most promising features of the new wave of evaluations. An analysis of a Save the Children program in Bangladesh, for example, was adapted to help shed light on larger questions about how delaying teen marriage affects maternal and child health. There are also cases in which a new theory is developed, empirical implications are derived, and then a randomized evaluation is designed explicitly to test the model’s implications. The ability to prospectively design the evaluation to precisely test a theory is another benefit of the approach.

Many of the respondents point out that context matters, and Pranab Bardhan is undoubtedly correct that we, like most economists, tend to look more for similarities among people than for differences. But researchers are in a much better position to pick up on these differences than we were before randomized evaluations took off. Rather than just relying on cross-country regressions or regressions that implicitly lump together, for example, spending on teachers and spending on textbooks, we now have evidence on specific programs in specific places, allowing us to examine whether the impact of a given kind of program is similar across contexts.

If impacts are sensitive to the details of the context and program, then only meticulous work of the type we discuss can improve our understanding. If subtlety is important, as Eran Bendavid argues, we have come a long way. Of course as Daniel Posner points out, more extensive trials will fill in those details. They will help us understand how social interaction and peer effects within communities affect behavior, which elements of programs have impact, and how the outcomes of policies vary with circumstance.

It is also worth noting that the modern wave of randomized trials often embodies the continuous processes of learning, adaptation, and retesting that Chloe O’Gara rightly calls for. For example I (Kremer) and several colleagues have been using an iterative process to learn how farmers make choices about fertilizer.

Ensuring Validity
Bendavid suggests that economists adopt institutional mechanisms from medicine to deal with the risk that researchers will cherrypick results. In fact the International Institute for Impact Evaluation is establishing a registry for randomized and non-randomized evaluations and is commissioning systematic reviews of the literature designed to pick up papers with non-positive results that might not have made it into journals. Economists are also starting to move toward the ex-ante analysis commitments Bendavid recommends, such as declaring upfront which subgroups will be analyzed or how hundreds of metrics of social change will be combined. However, it is important to recognize that there are tradeoffs involved. There may be legitimate reasons to look at a particular subgroup or at an outcome that was not thought of in advance. Safeguards are important, but we don’t want the perfect to be the enemy of the good.

If impacts are sensitive to the details of context, then meticulous experimentation can improve our understanding.

It is hard to see the case for singling out randomized evaluations for criticism on these grounds. The dangers of publication bias are surely greater with non-experimental data. Non-experimental researchers have much more discretion in choosing which observations to include or exclude, variables to control for, subgroups to examine, and structural models to impose on data. Only a small proportion of the analyses conducted wind up being published. The potential for publication bias is huge, and indeed formal tests suggest considerable publication bias prior to the advent of randomized evaluations. Unquestionably there is room for improvement in the conduct of randomized evaluations, but they are a step in the right direction as far as validity concerns go.

Behavioral Economics?
Bardhan and Diane Coyle think some of the results we quote could be explained by standard economics. We agree that standard economics has considerable explanatory power. Incentives and information matter across contexts and culture, as economists have long believed.

But there are patterns of behavior that standard models fail to explain. Coyle points out that the poor are often cash and credit constrained, and indeed, as we note, these constraints may be important factors affecting investments in health and education. Yet limited cash and credit cannot explain, for example, the demand for commitment devices. These devices help people lock their future selves into a behavior that is good for the long run but has short-term costs. Randomized evaluations have found a demand for commitment in many contexts—from the United States, to Kenya, to the Philippines.

Evaluation and Policy
Jishnu Das, Shantayanan Devarajan, and Jeffrey Hammer raise the important point that we cannot make policy recommendations without welfare economics. Indeed the studies of fertilizer use and recent work on water combine randomized evaluations with models that can be used to formally examine welfare issues.

Welfare economics gets harder in a world of irrational economic behavior, where, for instance, preferences are not necessarily consistent over time. But if people’s preferences are indeed inconsistent, it is important to know that, and to figure out how to do welfare economics in such circumstances.

Das, Devarajan, and Hammer also argue that we cannot subsidize everything. However, welfare economics provides a strong prima facie case for heavily subsidizing products such as vaccines, mosquito nets, water treatment, and de-worming medicine that fight infectious disease: to the extent that the products work, they provide benefits that go beyond the user. Contrary to Das, Devarajan, and Hammer’s implication, multiple medical studies have demonstrated not only that mosquito nets are effective in protecting users, but also that they aid those in the vicinity by killing malarial mosquitoes. Subsidies for basic education may be similarly justified in terms of societies’ concern for public welfare.

Das, Devarajan, and Hammer note that governments often perform badly. Many researchers are now using randomized evaluations to examine the impact of various reforms on government operations, looking at issues from corruption to teacher incentives and community control of schools. One colleague is examining whether service-delivery systems are less effective when delivery is free. Indeed the finding—cited by Das, Devarajan, and Hammer—that efforts to curb nurse absenteeism in India failed was produced by a randomized evaluation. More generally, only one type of randomized control trial asks, “if implementation is close to perfect, will this program change peoples’ lives?” Another asks, “if implementation is typically messy, is the program still effective at improving lives?”

A separate issue is that politicians may not implement certain policies, even if they are desirable. Surely that is true. Studies showing the benefits of school choice may not induce a politician dependent on the support of teachers’ unions to implement education reforms. But politicians are sometimes open to making changes based on evidence. Conditional cash transfers spread rapidly from Mexico to more than 30 countries, likely helped by the solid evidence of impact produced by randomized evaluations.

Moreover, as we have emphasized, relatively small, politically non-controversial changes in program design— such as auto-enrollment in a retirement program—sometimes have big impacts.

As we write, the government of Bihar, in eastern India, is implementing an ambitious program of free, school-based de-worming for nearly 10 million children, based on evidence accumulated through randomized evaluations. We do not yet know how well implementation will go in Bihar, but last year’s equivalent program in Kenya reached 3.5 million children. Good evidence can translate into concerted action.