The value of variation: Why we need to attend to heterogeneity in intervention research

November 15, 2022

Carrie Conaway, Elizabeth Tipton, and Alfredo J. Artiles

The fact that intervention effects vary is incontestable–and yet that variation is often treated as a profound problem by the research community.

The fact that intervention effects vary is incontestable–and yet that variation is often treated as a profound problem by the research community. Certainly, an intervention that works the same everywhere, under every context, is easier to understand and explain (Bryan, Tipton, & Yeager, 2021). But solving real problems and addressing real inequities in schools requires us to complexify our thinking beyond the sample average treatment effect. For teachers and school leaders, the idea that an intervention may not work in every context isn’t a problem to explain away; it is one that is imperative to address. To deny this is to deny the humanity, culture, and assets of their communities and to render research irrelevant for the decisions they face.

If the research community wants our research to be useful and used, then it is time that we rethink our approach to variation. Rather than viewing variation in intervention effects as problematic, what if we took understanding this variation as an essential question–maybe even THE essential question–for applied education research? That idea is a central theme in the recent National Academy of Sciences, Engineering, and Medicine report, The Future of Education Research at IES: Advancing an Equity-Oriented Science (2022). The report challenges researchers evaluating interventions to see variation in outcomes as an important signal, not as noise to be minimized, controlled, or removed.

A typical quantitative impact study is designed to estimate a single average treatment effect as a means of summarizing the effect of an intervention. Researchers expend great effort on expunging sources of variation other than the intervention itself from the study context to ensure that this average treatment effect is clearly identified and well estimated, in the belief that studies with the strongest internal validity provide the best, most useful, evidence for decision makers. But in reality, this approach results in findings so specific to the context in which the intervention was studied that it’s hard for practitioners to know how they might translate elsewhere. Worse, the specific conditions that enabled success (or failure) aren’t typically well documented, further limiting usability. As Sameroff (2010) noted, “the unexplained variance, the noise, might contain the signals of many other dimensions of the individual or context that are necessary for meaningful long-term predictive models” (p. 7). Yet none of these dimensions are well captured in traditional quantitative impact studies.

One might think that the solution is to focus not on a single study but on a collection of studies. But the same issue occurs in meta-analysis, which aims to aggregate the findings of multiple studies to tell us something about the typical impact of an intervention across settings. These meta-analyses often find that the range of treatment effects across studies can be quite wide, sometimes even involving results with opposing directionality. At times meta-analyses can tell us something about specific features that are correlated with treatment effects—for example, the size or scale of a program or the nature of the focal population studied. But because meta-analyses are quantitative, these features must be quantifiable to be analyzed in this way, limiting the ways in which contextual factors can be operationalized and explored. Often most of the variation in outcomes cannot be explained by factors coded in the studies.

If our goal in education research is to produce knowledge that is useful and used, reaching generalizability by removing variation is destined to failure. Education settings are not laboratories. The contexts and conditions under which an intervention is implemented are part of the intervention, particularly because “interventions themselves are contested spaces, filled with tensions and resistance from a range of stakeholders” (Gutierrez & Penuel, 2014, p. 20). We need to study the implementation of interventions just as seriously as we do the intervention itself–especially because this is the exact type of information practitioners seek when trying to apply findings to their own contexts.

As an example, consider interventions on student learning. Researchers in the learning sciences, cultural psychology, and educational anthropology have advanced situated models that regard learning as located in institutional and historical contexts in which local socio-cultural practices and perspectives mediate how people learn (Greeno, 2005; Nasir et al., 2022). In contrast, traditional intervention research typically uses tasks or procedures to assess students’ use of skills outside of the routine circumstances in which they learned them. This critique has been leveled for decades by cultural psychologists and anthropologists. They call attention, for instance, to the risks associated with crafting theoretical principles of human cognition inferred from test and experimental tasks or observations to explain “the wide variety of intellectual behavior observed in non-laboratory settings (everyday life)” (Cole et al., 1978, p. 2; see also Lave, 1997). Cole and colleagues explained this practice is an axiom in experimental cognitive psychology that produces ecologically invalid research. This means, therefore, that intervention research offers only a partial understanding of how study participants use or learn the target skills or tools in naturalistic settings (Artiles & Kozleski, 2010). Given the lack of relevant contextual information in most published intervention research (Joyce & Cartwright 2019), it is not surprising that many school professionals turn to nearby peer schools, not the research literature, when making decisions.

What would it take for use and usefulness to be as important and valued in research as understanding causal impacts is? First, we would need a commonly agreed upon framework for what information is likely to predict the effectiveness of an intervention, so that researchers would know what contextual data to collect and report. This might build from work such as Munro et al. (2016), who proposed that if a tested intervention is to be effective in a new setting, then “the intervention must be capable of helping to produce the targeted result in your setting; the support factors necessary for it to do so are there, or you can arrange to get them there; and nothing will happen in the setting to derail the intervention” (p. 30, emphasis in original). Similarly, learning scientists have recently called upon researchers to “more clearly attend to the ways in which the for what, for whom, and with whom of teaching and learning are necessarily intertwined with the how of learning” (Philip et al., 2018, emphasis in original). The work of building agreement on the essential elements of this framework mostly lies ahead of us. Our goal should be to build something akin to the standards researchers have developed for reporting the direction, magnitude, and statistical significance of impact estimates—a standard that is common across all publication outlets and seen as a required element of strong research practice.

Second, we need better data on context and how it varies across educational settings. The United States has long invested heavily in monitoring the outputs of our education system - e.g., the National Assessment of Education Progress - yet has done little to monitor the inputs and processes that mediate outcomes. We know little regarding current practices in schools: What curricula are commonly used? How many minutes of instruction do students receive each day in each subject? Which programs are purchased versus developed in-house? Which teachers are working with instructional coaches? Which educational opportunities are made available to students and which students are able to take advantage of them? What is the learning culture like? Systematically collecting these types of information will help guide researchers when designing where and with whom they will evaluate interventions.

But more data will not be enough. We also need to collaborate more across disciplinary and professional boundaries, because no single disciplinary approach or research method will illuminate all the questions we should ask or all the forms of data we will need. The National Study of Learning Mindsets - an evaluation of a growth mindset intervention - provides an example of the benefits of such interdisciplinary work (Yeager et al., 2019). The study was designed by a team of experts in psychology, sociology, economics, education, and statistics, and the benefits of this collaboration can be seen in its unique study design and findings. For example, to address concerns with generalizability, the study involved randomly sampling U.S. high schools, while also randomly assigning students to the intervention to isolate causality. To address concerns with heterogeneity, the study pre-registered hypotheses about subgroups and contextual factors, over-sampled important subgroups, developed new measures of school context and teacher behaviors, and employed new statistical methods for detecting, explaining, and reporting this variation. The investigators found their growth mindset intervention was most effective with lower-achieving students and in classrooms in which peer norms were aligned with the intervention.

These interdisciplinary collaborations will also need to push beyond the simplistic quantitative/qualitative divide that has plagued the history of education research. Quantitative methods are best at helping us know whether an intervention worked, but tell us little about how or why. Far too commonly, we find that interventions are not implemented well, even under the optimal conditions of an efficacy study–and we stop there. We need to know much more about why this is the case. Some of these concerns with implementation have to do with how well an intervention can be adjusted to fit a local context. Did certain contexts or conditions result in stronger implementation? How do teachers and students make subtle adjustments in intervention procedures in the course of daily interactions in the dynamic and complex contexts of schools (Gutierrez & Penuel, 2014)? What are the consequences of making these adjustments in terms of the intervention’s efficacy (Arzubiaga et al., 2008; Gueron, 2001)? Other questions have to do with how the very act of studying an intervention might affect its implementation - and what this means outside of the study. For example, are those that brokered involvement in the study considered “insiders” in these communities, and how do we know? What do school faculty think about randomly assigning students to an intervention that perhaps most or all learners need - and how does this perception mediate their implementation of the intervention? These questions and others require observing and talking with educators and students - and these skills are exactly those mastered by qualitative and mixed methods scholars. If we are taking variation seriously as a subject of inquiry, we need a model of educational research that combines nomothetic and idiographic approaches to produce knowledge about the nature of phenomena in their cultural and historical contexts (Artiles, 2019).

All the better if collaborative research teams also included practitioners, who have unique understandings of their context. Producing research through research-practice partnerships (RPP) is one pathway to foregrounding context, because the long-term nature of RPPs affords researchers the opportunity to better understand and describe the context in which findings are produced. But right now few RPPs operate outside the very largest school districts. We will need to find a way to make participation in research feasible for a wider range of districts without overburdening stretched practitioners. Another pathway might be designing studies that allow practitioners to make structured adaptations to an intervention to suit their context. A recent study compared a standardized summer reading intervention to one in which teachers could adapt certain parts of the program: for example, adjusting the timing of lessons or selecting different books than those recommended by the program. A randomized controlled trial found that students in the schools that adapted the program outperformed those in the traditional program (Kim et al 2017). The scholarly community still needs to grapple with key conceptual issues around research done in partnership: for example, how teams conceptualize and award credit for members’ roles and contributions, and variation in how partnerships identify problems of practical value and account for contextual influences (Penuel et al., 2020). And scholars have raised questions about equity and power, particularly racial equity, with partnership-based research (Diamond 2021, Vetter et al 2022). Nevertheless, we see promise in this and related collaborative research approaches to address the challenges we have identified.

Finally, and perhaps most challenging, we will need to let go of some old methods and frameworks. Rather than designing our enterprise around what can be studied well using our existing methods, let’s center use and usefulness as our goal and agree to use “relevance to practice” as a core index of rigor in educational research (Gutierrez & Penuel, 2014). Rather than putting a laser focus on internal validity and hand-waving at external validity as an afterthought, let’s put these concerns on equal footing. Rather than assuming that if an intervention is effective somewhere, it is effective everywhere until proven otherwise, let’s begin with the assumption that context matters and design our studies to answer how and why this is so. If we do this hard work as a research community, we will be better positioned to help those doing the hard work of educating our children and young adults.

About the authors:

Carrie Conaway (carrie_conaway@gse.harvard.edu) is a Senior Lecturer on Education at Harvard Graduate School of Education.

Elizabeth Tipton (tipton@northwestern.edu) is Associate Professor of Statistics at Northwestern University.

Alfredo J. Artiles (aartiles@stanford.edu) is Lee L. Jacks Professor of Education at Stanford University.

References

Artiles, A. J. (2019). 14th annual Brown Lecture in Education Research - Re-envisioning equity research: Disability identification disparities as a case in point. Educational Researcher, 48, 325-335. https://doi.org/10.3102/0013189X19871949

Artiles, A. J., & Kozleski, E. B. (2010). What counts as Response and Intervention in RTI? A sociocultural analysis. Psicothema, 22, 949-954.

Arzubiaga, A., Artiles, A.J., King, K., & Harris-Murri, N. (2008). Beyond research on cultural minorities: Challenges and implications of research as situated cultural practice. Exceptional Children, 74, 309-327. https://doi.org/10.1177/001440290807400303

Bryan, C. J., Tipton, E., & Yeager, D. S. (2021). Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature human behaviour, 5(8), 980-989. https://doi.org/10.1038/s41562-021-01143-3

Cole, M., Hood, L., & McDermott, R. (1978). Ecological niche picking: Ecological invalidity as an axiom of experimental cognitive psychology. Laboratory of Comparative Human Cognition and Institute for Comparative Human Development. (Technical report). https://doi.org/10.13140/2.1.4727.1204

Diamond, J. B. (2021, July 20). Racial equity and research practice partnerships 2.0: A critical reflection. William T. Grant Foundation. Retrieved October 31, 2022, from https://wtgrantfoundation.org/racial-equity-and-research-practice-partnerships-2-0-a-critical-reflection

Greeno, J. (2005). Learning in activity. In R. Keith Sawyer (Ed.), The Cambridge handbook of the learning sciences. Cambridge University Press. https://doi.org/10.1111/j.1467-873X.2008.00425.x

Gueron, J. M. (2001). The politics of random assignment: Implementing studies and affecting policy. In R. Mosteller (Ed.), Evidence matters: Randomized trials in education research (pp. 15-49). Brookings Institution Press.

Gutiérrez, K.D., & Penuel, W.R. (2014). Relevance to practice as a criterion for rigor. Educational Researcher, 43, 19-23. https://doi.org/10.3102/0013189X13520289

Kim, J.S., Burkhauser, M.A., Quinn, D.M., Guryan, J., Kingston, H.C., Aleman, K. (2017). Effectiveness of structured teacher adaptations to an evidence-based summer literacy program. Reading Research Quarterly 52, 443-467. https://doi.org/10.1002/rrq.178

Joyce, K. E., & Cartwright, N. (2019). Bridging the gap between research and practice: Predicting what will work locally. American Educational Research Journal, 57(3), 1045–1082. https://doi.org/10.3102/0002831219866687

Lave, J. (1997). What’s special about experiments as contexts for thinking. In M. Cole, Y. Engestrom, & O. Vasquez (Eds.): Mind, culture and activity: Seminal papers from the Laboratory of Comparative Human Cognition (pp. 57-69). Cambridge University Press.

Munro, E., Cartwright, N., Hardie, J., & Montuschi, E. (2016). Improving child safety: Deliberation, judgment and empirical research (ISSN 2053-2660). Retrieved October 31, 2022 from https://www.dur.ac.uk/resources/chess/ONLINE_Improvingchildsafety-15_2_17-FINAL.pdf

Nasir, N.S., Lee, C., Pea, R. & de Royson, M.M. (Eds.). (2022). Handbook of the cultural foundations of learning. Routledge. https://doi.org/10.4324/9780203774977

National Academies of Sciences, Engineering, and Medicine (2022). The Future of Education Research at IES: Advancing an Equity-Oriented Science. The National Academies Press. https://doi.org/10.17226/26428

Penuel, W. R., Riedy, R., Barber, M. S., Peurach, D. J., LeBouef, W.A., Clark, T. (2020). Principles of collaborative education research with stakeholders: Toward requirements for a new research and development infrastructure. Review of Educational Research, 90, 627-674. https://doi.org/10.3102/0034654320938126

Philip, T.M., Bang, M., & Jackson, K. (2018). Articulating the “how,” the “for what,” the “for whom,” and the “with whom” in concert: A call to broaden the benchmarks of our scholarship. Cognition and Instruction, 36, 83-88. https://doi.org/10.1080/07370008.2018.1413530

Sameroff, A. (2010). A unified theory of development: A dialectic integration of nature and nurture. Child Development, 81, 6-22. https://doi.org/10.1111/j.1467-8624.2009.01378.x

Yeager, D.S., Hanselman, P., Walton, G.M. et al. A national experiment reveals where a growth mindset improves achievement. Nature 573, 364–369 (2019). https://doi.org/10.1038/s41586-019-1466-y

Vetter, A., Faircloth, B. S., Hewitt, K. K., Gonzalez, L. M., He, Y., & Rock, M. L. (2022). Equity and Social Justice in research practice partnerships in the United States. Review of Educational Research, 92(5), 829–866. https://doi.org/10.3102/00346543211070048

The value of variation: Why we need to attend to heterogeneity in intervention research

More public scholarships