Social experiments
Background
Evidence-based policy
Random assignment
Random assignment of groups
Propensity score matching
__
Background
Since 1991 SRDC has been working to encourage greater use of social experiments to test innovative social policy ideas. Randomized experiments were rarely used in social sciences until the latter half of the 20th century. Indeed, even the rigorous assessment of medical interventions, especially new drug therapies, by means of randomized controlled trials, which are so common today, did not become widespread until well after the Second World War.
An excellent history of the development of clinical research can be found in Marks, H. M. (1997). The progress of experiment: Science and therapeutic reform in the United States, 1900–1990. Cambridge, UK: Cambridge University Press.
Much of the foundation for random assignment experiments was actually laid by Ronald A. Fisher’s analysis of agricultural production. Fisher randomly chose fields to receive applications of various fertilizers and demonstrated that systematic differences in crop production were due to fertilization. Fisher went on to help other researchers understand how to implement random assignment studies and how to draw formal statistical inferences from their results.
The classic work by Fisher is Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, UK: Oliver & Boyd.
Over the past 50 years, random assignment designs have come to be widely accepted as the “gold standard” in evaluation research. A list of some 260 social experiments, initiated between 1962 and 2003 and involving almost a million people, can be found in the latest edition of The Digest of Social Experiments.
Greenberg, D., & Shroder, M. (2004). The digest of social experiments, third edition. Washington, DC: The Urban Institute Press.
For most of the past half century, the increasing use of randomized experiments in social policy was an American phenomenon. The list of experiments in the Digest includes 225 experiments in the US and only 35 from elsewhere (14 in the United Kingdom, 5 in Canada, 3 in the Netherlands, 3 in Norway, and 1 each in Argentina, Australia, Columbia, Denmark, Germany, India, Israel, Mexico, Sweden, and Switzerland).
The Canadian experiments included in the Digest are the Manitoba Basic Annual Income Experiment (MINCOME) from the 1970s and four experiments involving SRDC and Human Resources and Social Development Canada — the Self-Sufficiency Project, the Earnings Supplement Project, the Community Employment Innovation Project, and learn$ave (information on all these experiments can be found elsewhere on this Web site).
The Digest missed an experiment that ran in Ontario between 1997 to 2001 to test a crime reduction strategy aimed at serious youth offenders. It was conducted by the Centre for Children and Families in the Justice System and funded by the Ontario Ministry of Community and Social Services and the National Crime Prevention Centre.
A report on this experiment: Centre for Children and Families in the Justice System. (2002). One step forward: Lessons learned from a randomized study of Multisystemic Therapy in Canada. London, Ontario: Centre for Children and Families in the Justice System, can be found at www.lfcc.on.ca/CCFJS_researchreports.html.
Evidence-based policy
The growing interest in randomized experiments is linked to the concept of “evidence-based policy” — an idea that came to prominence in Britain during the 1990s and which was based on the concept of “evidence-based medicine” championed by the Cochrane Collaboration. The Cochrane Collaboration, named for the British epidemiologist, Archie Cochrane, was founded in 1993. It is an international, non-profit, independent organization dedicated to making up-to-date, accurate information about the effects of healthcare readily available worldwide. The library maintained by the Cochrane Collaboration currently contains almost 450,000 articles on randomized experiments in medicine.
The Web site of the Cochrane Collaboration is www.cochrane.org.
More recently, efforts have been underway to adapt the Cochrane approach to a wider range of policy issues. The Campbell Collaboration was inaugurated in February 2000, its stated aim is to produce, disseminate, and continuously update systematic reviews of studies of the effectiveness of social and behavioural interventions, including education interventions. At the heart of the activities of the Campbell Collaboration are “Campbell reviews” — systematic reviews, conducted according to strict protocols, of the results generated by studies of particular interventions. The goal of these reviews is twofold: first, to sift through evaluation studies and separate the good from the bad and, in the process, “raise the bar” for what constitutes reliable evaluation evidence, second, to make what is known about the effectiveness of social policies and programs accessible to a wide audience: policy-makers, advocates, service delivery organizations and other practitioners, the media, and the general public (including those who are the subjects of the interventions).
The Web site of the Campbell Collaboration is www.campbellcollaboration.org.
In the UK, a central feature of the efforts to reform and modernize the machinery of government has been a commitment to evidence-based policy. The 1999 British White Paper, Modernizing Government, stated that government policy must be evidence-based, properly evaluated, and based on best practice. The report the same year from the UK Cabinet Office Strategic Policy Making Team, Professional Policy Making for the 21st Century, stated that policy-making must be soundly based on evidence of what works.
These developments are reviewed in Davies, P. (2004). Is evidence-based government possible? Jerry Lee lecture presented at the 4th Annual Campbell Collaboration Colloquium, Washington, DC, February 19, 2004. Prime Minister’s Strategy Unit, Cabinet Office, London, UK.
This document can be viewed at www.policyhub.gov.uk/downloads/JerryLeeLecture1202041.pdf.
Parallels to this development were seen in the US in the program of educational reform emanating from the No Child Left Behind Act of 2002, which placed greater emphasis on accountability. This legislation was accompanied by the 2002 Education Sciences Reform Act, which had as its goal the transformation of education into an evidence-based field in which decision-makers routinely seek out the best available research and data before adopting programs and practice. One of the earliest activities of the newly created Institute for Educational Sciences was to establish the What Works Clearinghouse to serve as a source of scientific evidence about what works in education and which judges the strongest evidence of program effects to come from studies that use random assignment or regression discontinuity designs.
The Web site for the What Works Clearinghouse is www.w-w-c.org.
Interest in evidence-based policy-making and its link to social experimentation led to the convening in June 2002 of a special symposium on “Randomized Controlled Trials in Social Science” at Nuffield College, Oxford University. This, in turn, resulted in the publication of a special volume of the Annals of the American Academy of Political and Social Science:
Sherman, L. W., special editor. (2003). Misleading evidence and evidence-led policy: Making social science more experimental. Annals of the American Academy of Political and Social Science, 589 (September 2003).
Random assignment
In evidence-based policy-making, the focus is on determining “what works” and investing scarce public resources in initiatives that have demonstrated their effectiveness. In this case, “works” is defined as having the intended impacts, and determining that a program works means establishing a credible causal relationship between the intervention and its observed effects.
The defining characteristic of a randomized experiment is the use of a random assignment design by which participants in the research project are assigned at random to a treatment group that is eligible to receive the intervention being tested or to a control group that is not eligible. This is the only method that is guaranteed to eliminate selection bias and thereby produce unbiased estimates of program impacts.
The process of random assignment ensures that there are no systematic differences between the program group and the comparison group. Random assignment eliminates this form of selection bias by ensuring that the program and control groups are same in terms of all characteristics — observed and unobserved, measured and unmeasured. For example, the groups are statistically identical in terms of their motivation to participate in the program, their demographic characteristics and past life experiences. They differ only in that one group is eligible for the program and the other is not. Therefore, any differences that are observed over time in the experiences of the two groups (and that exceed the statistical fluctuations that can occur due to chance) can be attributed with confidence to the program.
This is important because the outcomes that individuals experience are usually not the result of their program status alone (i.e. whether they were eligible to take part in the program or not) but are also likely to be influenced by the characteristics of the individuals themselves at the same time the program is operating. These individual characteristics represent additional causal factors (or covariates) in determining outcomes. If members of the program and control groups differ in terms of important characteristics at baseline, then any observed differences in outcomes will be due both to program participation and to differences between the groups in terms of these other causal factors.
Strictly speaking, in a random assignment design the expected values of the averages for all pre-existing characteristics, or covariates, of the program group and the control group are the same, although their actual values may differ somewhat, especially in small samples. Random assignment ensures that the two groups will not differ systematically, but it does not guarantee that they will be identical. Random differences can still occur, they do not bias the impact estimates, but they do reduce the precision of the estimates. Data on the characteristics of the sample that are collected just prior to random assignment can be used subsequently in regression models to improve the precision of the estimates.
There is a growing body of literature dealing with attempts to replicate using non-experimental methods the experimental findings of evaluations of employment and training programs in the United States. The results have generally been disappointing.
See, for example:
Bell, S. H., Orr, L. L., Blomquist, J. D., & Cain, G. C. (1995). Program applicants as a comparison group in evaluating training programs. Kalamazoo, MI: W. E. Upjohn Institute for Employment Research.
Fraker, T. M., & Maynard, R. A. (1987). The adequacy of comparison group designs for evaluations of employment related programs. The Journal of Human Resources, 22: 194–227.
Friedlander, D., & Robins, P. K. (1995). Evaluating program evaluations: New evidence on commonly used nonexperimental methods. American Economic Review, 85: 923–937.
Glazerman, S., Levy, D., & Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy of Political and Social Science, 589 (September 2003): 63–93.
Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76: 604–620.
The basic technique and approaches to random assignment are well established. A number of books and articles have been published that deal with the benefits of random assignment and the techniques associated with fielding a social experiment.
See, for example:
Boruch, R. F. (1997). Randomized experiments for evaluation and planning. Thousand Oaks, CA: Sage Publications.
Borus, M. E. (1979). Measuring the impact of employment-related social programs. Kalamazoo, MI: The W. E. Upjohn Institute for Employment Research.
Burghardt, J., McConnell, S., Meckstroth, A., & Schocet, P. (1997). Implementing random assignment: Lessons from the National Job Corps study. Princeton, NJ: Mathematica Policy Research.
Burtless, G. (1995). The case for randomized field trials in economic and policy research. Journal of Economic Perspectives, 9(2): 63–84.
Burtless, G., & Orr, L. (1987). Are classical experiments needed for manpower policy? The Journal of Human Resources, 21(4): 606–639.
Greenberg D., & Robins, P. (1986). Social experiments in policy analysis. Journal of Policy Analysis and Management, 5(2): 340–362.
Mohr, L. (1995). Impact analysis for program evaluation, 2nd ed. Thousand Oaks, CA: Sage Publications.
Mosteller, F., & Boruch, R., eds. (2002). Evidence matters: Randomized trials in education research. Washington, DC: Brookings Institution Press.
Myers, D., & Dynarski, M. (2003). Random assignment in program evaluation and intervention research: Questions and answers. Washington, DC: US Department of Education, Institute of Educational Sciences, National Center for Education Evaluation and Regional Assistance. Accessible at: www.mathematica-mpr.com/publications/PDFs/randomassign.pdf.
Orr, L. L. (1999). Social experiments: Evaluating public programs with experimental methods. Thousand Oaks, CA: Sage Publications.
Schmidt, C. M. (1999). Knowing what works: The case for rigorous program evaluation. (Discussion Paper No. 77). Bonn, Germany: The Institute for the Study of Labor (IZA).
Treasury Board of Canada, Secretariat (n.d.). Program evaluation methods: Measurement and Attribution of program results, Third edition. Ottawa: Minister of Public Works and Government Services. Accessible at www.tbs-sct.gc.ca/eval/pubs/meth/pem-mep01_e.asp.
Random assignment of groups
Methodological advances are being made that have the potential to extend further the reach of social experiments. For example, “cluster” random assignment is being used to randomly assign larger entities than an individual. This has applicability where the unit of analysis is a group (e.g. a community, a classroom, a school, or a workplace) or where the outcomes of those exposed to a program have potential to “spill over” and affect the outcomes of others.
For example, this methodology is being used in social experiments that involve the delivery of employment-related services to residents of public housing developments in the US (the Jobs-Plus project) and the provision of cash transfers to low-income families linked to school and health clinic attendance in poor communities in Mexico (PROGRESA).
To find out more about these examples see:
Bloom, H. S., Riccio, J. A., & Verma, N. (2005). Promoting work in public housing: The effectiveness of Jobs-Plus. New York, NY: Manpower Demonstration Research Corporation, which can be accessed at www.mdrc.org/publications/405/overview.html.
Skoufias, E. (2001). PROGRESA and its impacts on the human capital and welfare of households in rural Mexico: A synthesis of the results of an evaluation by IFPRI. Washington, DC: International Food Policy Research Institute, which can be accessed at www.ifpri.org/themes/progresa/pdf/Skoufias_finalsyn.pdf.
This approach is also being examined for evaluations of education programs, where in some cases it may be more appropriate to randomly assign classrooms or whole schools, rather than individual students.
For applications to education programs, see:
Schochet, P. (2005). Statistical power for random assignment evaluations of education programs. Document No. PR05-36. Princeton, NJ: Mathematica Policy Research, Inc., accessible at www.mathematica-mpr.com/publications/PDFs/statisticalpower.pdf.
Bloom, H., Bos, J., & Lee, S. (1999). Using cluster random assignment to measure program impacts: Statistical implications for the evaluation of education programs. New York, NY: Manpower Demonstration Research Corporation, accessible at www.mdrc.org/publications/93/full.pdf.
In the US the W.T. Grant Foundation has been supporting the building of capacity to conduct group-randomized evaluation designs. An outgrowth of that work is a consulting service created in collaboration with Stephen Raudenbush and colleagues at the University of Michigan.
Information on this service can be found at: www.wtgrantfoundation.org/.
Propensity score matching
Although a random assignment design is widely seen as the “gold standard” for evaluating program impacts, it is not always possible or acceptable to implement random assignment. A common alternative is to construct a comparison group whose outcomes will be compared with the outcomes of those who receive program services. One approach to identifying comparison group members that has shown some promise and has recently been attracting attention is propensity score matching. This technique was first developed for evaluating non-experimentally the effects of differing forms of medical treatment.
Ideally, a matched comparison group is constructed by finding, for each member of the program group, a comparison group member who is identical in terms of all characteristics that affect outcomes. In practical terms, this is not possible. In propensity score matching, a “propensity score” is calculated for each member of the program group member and each potential comparison group member. The propensity score is a measure of the estimated likelihood (or “propensity”) of an individual being observed in the program group rather than in the comparison group given their observed characteristics. Therefore, the propensity score serves as a composite indicator of multiple individual-specific characteristics (or covariates). For each program group member, the potential comparison group member with the closest propensity score is selected for the comparison group.
Propensity score matching has been applied to the evaluation of social programs, and, in some cases at least, the use of this technique has permitted researchers to reproduce the findings from experimental studies.
For a promising result, see:
Dehejia, R. H., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94: 1053–1062.
However, researchers in two more recent studies were unable, using propensity score matching, to replicate experimental results, raising questions about how effective a tool propensity score matching can be.
To read about an attempt to replicate findings from an experimental evaluation of youth drop-out prevention programs, see:
Agodini, R., & Dynarski, M. (2001). Are experiments the only option? A look at dropout prevention programs. Document No. PR01-71. Princeton, NJ: Mathematica Policy Research, Inc., accessible at www.mathematica-mpr.com/publications.
To read about an attempt to replicate findings from an experimental evaluation of welfare-to-work programs, see:
Bloom, H. S., Michalopoulos, C., Hill, C. J., & Lei, Y. (2002). Can nonexperimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs? New York, NY: Manpower Demonstration Research Corporation, accessible at www.mdrc.org/publications/66/full.pdf.
