5.2 Experimental Design

Learning objectives.

  • Explain the difference between between-subjects and within-subjects experiments, list some of the pros and cons of each approach, and decide which approach to use to answer a particular research question.
  • Define random assignment, distinguish it from random sampling, explain its purpose in experimental research, and use some simple strategies to implement it
  • Define several types of carryover effect, give examples of each, and explain how counterbalancing helps to deal with them.

In this section, we look at some different ways to design an experiment. The primary distinction we will make is between approaches in which each participant experiences one level of the independent variable and approaches in which each participant experiences all levels of the independent variable. The former are called between-subjects experiments and the latter are called within-subjects experiments.

Between-Subjects Experiments

In a  between-subjects experiment , each participant is tested in only one condition. For example, a researcher with a sample of 100 university  students might assign half of them to write about a traumatic event and the other half write about a neutral event. Or a researcher with a sample of 60 people with severe agoraphobia (fear of open spaces) might assign 20 of them to receive each of three different treatments for that disorder. It is essential in a between-subjects experiment that the researcher assigns participants to conditions so that the different groups are, on average, highly similar to each other. Those in a trauma condition and a neutral condition, for example, should include a similar proportion of men and women, and they should have similar average intelligence quotients (IQs), similar average levels of motivation, similar average numbers of health problems, and so on. This matching is a matter of controlling these extraneous participant variables across conditions so that they do not become confounding variables.

Random Assignment

The primary way that researchers accomplish this kind of control of extraneous variables across conditions is called  random assignment , which means using a random process to decide which participants are tested in which conditions. Do not confuse random assignment with random sampling. Random sampling is a method for selecting a sample from a population, and it is rarely used in psychological research. Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too.

In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition (e.g., a 50% chance of being assigned to each of two conditions). The second is that each participant is assigned to a condition independently of other participants. Thus one way to assign participants to two conditions would be to flip a coin for each one. If the coin lands heads, the participant is assigned to Condition A, and if it lands tails, the participant is assigned to Condition B. For three conditions, one could use a computer to generate a random integer from 1 to 3 for each participant. If the integer is 1, the participant is assigned to Condition A; if it is 2, the participant is assigned to Condition B; and if it is 3, the participant is assigned to Condition C. In practice, a full sequence of conditions—one for each participant expected to be in the experiment—is usually created ahead of time, and each new participant is assigned to the next condition in the sequence as he or she is tested. When the procedure is computerized, the computer program often handles the random assignment.

One problem with coin flipping and other strict procedures for random assignment is that they are likely to result in unequal sample sizes in the different conditions. Unequal sample sizes are generally not a serious problem, and you should never throw away data you have already collected to achieve equal sample sizes. However, for a fixed number of participants, it is statistically most efficient to divide them into equal-sized groups. It is standard practice, therefore, to use a kind of modified random assignment that keeps the number of participants in each group as similar as possible. One approach is block randomization . In block randomization, all the conditions occur once in the sequence before any of them is repeated. Then they all occur again before any of them is repeated again. Within each of these “blocks,” the conditions occur in a random order. Again, the sequence of conditions is usually generated before any participants are tested, and each new participant is assigned to the next condition in the sequence.  Table 5.2  shows such a sequence for assigning nine participants to three conditions. The Research Randomizer website ( http://www.randomizer.org ) will generate block randomization sequences for any number of participants and conditions. Again, when the procedure is computerized, the computer program often handles the block randomization.

Random assignment is not guaranteed to control all extraneous variables across conditions. The process is random, so it is always possible that just by chance, the participants in one condition might turn out to be substantially older, less tired, more motivated, or less depressed on average than the participants in another condition. However, there are some reasons that this possibility is not a major concern. One is that random assignment works better than one might expect, especially for large samples. Another is that the inferential statistics that researchers use to decide whether a difference between groups reflects a difference in the population takes the “fallibility” of random assignment into account. Yet another reason is that even if random assignment does result in a confounding variable and therefore produces misleading results, this confound is likely to be detected when the experiment is replicated. The upshot is that random assignment to conditions—although not infallible in terms of controlling extraneous variables—is always considered a strength of a research design.

Matched Groups

An alternative to simple random assignment of participants to conditions is the use of a matched-groups design . Using this design, participants in the various conditions are matched on the dependent variable or on some extraneous variable(s) prior the manipulation of the independent variable. This guarantees that these variables will not be confounded across the experimental conditions. For instance, if we want to determine whether expressive writing affects people’s health then we could start by measuring various health-related variables in our prospective research participants. We could then use that information to rank-order participants according to how healthy or unhealthy they are. Next, the two healthiest participants would be randomly assigned to complete different conditions (one would be randomly assigned to the traumatic experiences writing condition and the other to the neutral writing condition). The next two healthiest participants would then be randomly assigned to complete different conditions, and so on until the two least healthy participants. This method would ensure that participants in the traumatic experiences writing condition are matched to participants in the neutral writing condition with respect to health at the beginning of the study. If at the end of the experiment, a difference in health was detected across the two conditions, then we would know that it is due to the writing manipulation and not to pre-existing differences in health.

Within-Subjects Experiments

In a  within-subjects experiment , each participant is tested under all conditions. Consider an experiment on the effect of a defendant’s physical attractiveness on judgments of his guilt. Again, in a between-subjects experiment, one group of participants would be shown an attractive defendant and asked to judge his guilt, and another group of participants would be shown an unattractive defendant and asked to judge his guilt. In a within-subjects experiment, however, the same group of participants would judge the guilt of both an attractive  and  an unattractive defendant.

The primary advantage of this approach is that it provides maximum control of extraneous participant variables. Participants in all conditions have the same mean IQ, same socioeconomic status, same number of siblings, and so on—because they are the very same people. Within-subjects experiments also make it possible to use statistical procedures that remove the effect of these extraneous participant variables on the dependent variable and therefore make the data less “noisy” and the effect of the independent variable easier to detect. We will look more closely at this idea later in the book .  However, not all experiments can use a within-subjects design nor would it be desirable to do so.

One disadvantage of within-subjects experiments is that they make it easier for participants to guess the hypothesis. For example, a participant who is asked to judge the guilt of an attractive defendant and then is asked to judge the guilt of an unattractive defendant is likely to guess that the hypothesis is that defendant attractiveness affects judgments of guilt. This  knowledge could  lead the participant to judge the unattractive defendant more harshly because he thinks this is what he is expected to do. Or it could make participants judge the two defendants similarly in an effort to be “fair.”

Carryover Effects and Counterbalancing

The primary disadvantage of within-subjects designs is that they can result in order effects. An order effect  occurs when participants’ responses in the various conditions are affected by the order of conditions to which they were exposed. One type of order effect is a carryover effect. A  carryover effect  is an effect of being tested in one condition on participants’ behavior in later conditions. One type of carryover effect is a  practice effect , where participants perform a task better in later conditions because they have had a chance to practice it. Another type is a fatigue effect , where participants perform a task worse in later conditions because they become tired or bored. Being tested in one condition can also change how participants perceive stimuli or interpret their task in later conditions. This  type of effect is called a  context effect (or contrast effect) . For example, an average-looking defendant might be judged more harshly when participants have just judged an attractive defendant than when they have just judged an unattractive defendant. Within-subjects experiments also make it easier for participants to guess the hypothesis. For example, a participant who is asked to judge the guilt of an attractive defendant and then is asked to judge the guilt of an unattractive defendant is likely to guess that the hypothesis is that defendant attractiveness affects judgments of guilt. 

Carryover effects can be interesting in their own right. (Does the attractiveness of one person depend on the attractiveness of other people that we have seen recently?) But when they are not the focus of the research, carryover effects can be problematic. Imagine, for example, that participants judge the guilt of an attractive defendant and then judge the guilt of an unattractive defendant. If they judge the unattractive defendant more harshly, this might be because of his unattractiveness. But it could be instead that they judge him more harshly because they are becoming bored or tired. In other words, the order of the conditions is a confounding variable. The attractive condition is always the first condition and the unattractive condition the second. Thus any difference between the conditions in terms of the dependent variable could be caused by the order of the conditions and not the independent variable itself.

There is a solution to the problem of order effects, however, that can be used in many situations. It is  counterbalancing , which means testing different participants in different orders. The best method of counterbalancing is complete counterbalancing  in which an equal number of participants complete each possible order of conditions. For example, half of the participants would be tested in the attractive defendant condition followed by the unattractive defendant condition, and others half would be tested in the unattractive condition followed by the attractive condition. With three conditions, there would be six different orders (ABC, ACB, BAC, BCA, CAB, and CBA), so some participants would be tested in each of the six orders. With four conditions, there would be 24 different orders; with five conditions there would be 120 possible orders. With counterbalancing, participants are assigned to orders randomly, using the techniques we have already discussed. Thus, random assignment plays an important role in within-subjects designs just as in between-subjects designs. Here, instead of randomly assigning to conditions, they are randomly assigned to different orders of conditions. In fact, it can safely be said that if a study does not involve random assignment in one form or another, it is not an experiment.

A more efficient way of counterbalancing is through a Latin square design which randomizes through having equal rows and columns. For example, if you have four treatments, you must have four versions. Like a Sudoku puzzle, no treatment can repeat in a row or column. For four versions of four treatments, the Latin square design would look like:

You can see in the diagram above that the square has been constructed to ensure that each condition appears at each ordinal position (A appears first once, second once, third once, and fourth once) and each condition preceded and follows each other condition one time. A Latin square for an experiment with 6 conditions would by 6 x 6 in dimension, one for an experiment with 8 conditions would be 8 x 8 in dimension, and so on. So while complete counterbalancing of 6 conditions would require 720 orders, a Latin square would only require 6 orders.

Finally, when the number of conditions is large experiments can use  random counterbalancing  in which the order of the conditions is randomly determined for each participant. Using this technique every possible order of conditions is determined and then one of these orders is randomly selected for each participant. This is not as powerful a technique as complete counterbalancing or partial counterbalancing using a Latin squares design. Use of random counterbalancing will result in more random error, but if order effects are likely to be small and the number of conditions is large, this is an option available to researchers.

There are two ways to think about what counterbalancing accomplishes. One is that it controls the order of conditions so that it is no longer a confounding variable. Instead of the attractive condition always being first and the unattractive condition always being second, the attractive condition comes first for some participants and second for others. Likewise, the unattractive condition comes first for some participants and second for others. Thus any overall difference in the dependent variable between the two conditions cannot have been caused by the order of conditions. A second way to think about what counterbalancing accomplishes is that if there are carryover effects, it makes it possible to detect them. One can analyze the data separately for each order to see whether it had an effect.

When 9 Is “Larger” Than 221

Researcher Michael Birnbaum has argued that the  lack  of context provided by between-subjects designs is often a bigger problem than the context effects created by within-subjects designs. To demonstrate this problem, he asked participants to rate two numbers on how large they were on a scale of 1-to-10 where 1 was “very very small” and 10 was “very very large”.  One group of participants were asked to rate the number 9 and another group was asked to rate the number 221 (Birnbaum, 1999) [1] . Participants in this between-subjects design gave the number 9 a mean rating of 5.13 and the number 221 a mean rating of 3.10. In other words, they rated 9 as larger than 221! According to Birnbaum, this  difference  is because participants spontaneously compared 9 with other one-digit numbers (in which case it is  relatively large) and compared 221 with other three-digit numbers (in which case it is relatively  small).

Simultaneous Within-Subjects Designs

So far, we have discussed an approach to within-subjects designs in which participants are tested in one condition at a time. There is another approach, however, that is often used when participants make multiple responses in each condition. Imagine, for example, that participants judge the guilt of 10 attractive defendants and 10 unattractive defendants. Instead of having people make judgments about all 10 defendants of one type followed by all 10 defendants of the other type, the researcher could present all 20 defendants in a sequence that mixed the two types. The researcher could then compute each participant’s mean rating for each type of defendant. Or imagine an experiment designed to see whether people with social anxiety disorder remember negative adjectives (e.g., “stupid,” “incompetent”) better than positive ones (e.g., “happy,” “productive”). The researcher could have participants study a single list that includes both kinds of words and then have them try to recall as many words as possible. The researcher could then count the number of each type of word that was recalled. 

Between-Subjects or Within-Subjects?

Almost every experiment can be conducted using either a between-subjects design or a within-subjects design. This possibility means that researchers must choose between the two approaches based on their relative merits for the particular situation.

Between-subjects experiments have the advantage of being conceptually simpler and requiring less testing time per participant. They also avoid carryover effects without the need for counterbalancing. Within-subjects experiments have the advantage of controlling extraneous participant variables, which generally reduces noise in the data and makes it easier to detect a relationship between the independent and dependent variables.

A good rule of thumb, then, is that if it is possible to conduct a within-subjects experiment (with proper counterbalancing) in the time that is available per participant—and you have no serious concerns about carryover effects—this design is probably the best option. If a within-subjects design would be difficult or impossible to carry out, then you should consider a between-subjects design instead. For example, if you were testing participants in a doctor’s waiting room or shoppers in line at a grocery store, you might not have enough time to test each participant in all conditions and therefore would opt for a between-subjects design. Or imagine you were trying to reduce people’s level of prejudice by having them interact with someone of another race. A within-subjects design with counterbalancing would require testing some participants in the treatment condition first and then in a control condition. But if the treatment works and reduces people’s level of prejudice, then they would no longer be suitable for testing in the control condition. This difficulty is true for many designs that involve a treatment meant to produce long-term change in participants’ behavior (e.g., studies testing the effectiveness of psychotherapy). Clearly, a between-subjects design would be necessary here.

Remember also that using one type of design does not preclude using the other type in a different study. There is no reason that a researcher could not use both a between-subjects design and a within-subjects design to answer the same research question. In fact, professional researchers often take exactly this type of mixed methods approach.

Key Takeaways

  • Experiments can be conducted using either between-subjects or within-subjects designs. Deciding which to use in a particular situation requires careful consideration of the pros and cons of each approach.
  • Random assignment to conditions in between-subjects experiments or counterbalancing of orders of conditions in within-subjects experiments is a fundamental element of experimental research. The purpose of these techniques is to control extraneous variables so that they do not become confounding variables.
  • You want to test the relative effectiveness of two training programs for running a marathon.
  • Using photographs of people as stimuli, you want to see if smiling people are perceived as more intelligent than people who are not smiling.
  • In a field experiment, you want to see if the way a panhandler is dressed (neatly vs. sloppily) affects whether or not passersby give him any money.
  • You want to see if concrete nouns (e.g.,  dog ) are recalled better than abstract nouns (e.g.,  truth).
  • Birnbaum, M.H. (1999). How to show that 9>221: Collect judgments in a between-subjects design. Psychological Methods, 4 (3), 243-249. ↵

Creative Commons License

Share This Book

  • Increase Font Size

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

1.3: Threats to Internal Validity and Different Control Techniques

  • Last updated
  • Save as PDF
  • Page ID 32915

  • Yang Lydia Yang
  • Kansas State University

Internal validity is often the focus from a research design perspective. To understand the pros and cons of various designs and to be able to better judge specific designs, we identify specific threats to internal validity . Before we do so, it is important to note that the primary challenge to establishing internal validity in social sciences is the fact that most of the phenomena we care about have multiple causes and are often a result of some complex set of interactions. For example, X may be only a partial cause of Y or X may cause Y, but only when Z is present. Multiple causation and interactive effects make it very difficult to demonstrate causality. Turning now to more specific threats, Figure 1.3.1 below identifies common threats to internal validity.

Different Control Techniques

All of the common threats mentioned above can introduce extraneous variables into your research design, which will potentially confound your research findings. In other words, we won't be able to tell whether it is the independent variable (i.e., the treatment we give participants), or the extraneous variable, that causes the changes in the dependent variable. Controlling for extraneous variables reduces its threats on the research design and gives us a better chance to claim the independent variable causes the changes in the dependent variable, i.e., internal validity. There are different techniques we can use to control for extraneous variables.

Random assignment

Random assignment is the single most powerful control technique we can use to minimize the potential threats of the confounding variables in research design. As we have seen in Dunn and her colleagues' study earlier, participants are not allowed to self select into either conditions (spend $20 on self or spend on others). Instead, they are randomly assigned into either group by the researcher(s). By doing so, the two groups are likely to be similar on all other factors except the independent variable itself. One confounding variable mentioned earlier is whether individuals had a happy childhood to begin with. Using random assignment, those who had a happy childhood will likely end up in each condition group. Similarly, those who didn't have a happy childhood will likely end up in each condition group too. As a consequence, we can expect the two condition groups to be very similar on this confounding variable. Applying the same logic, we can use random assignment to minimize all potential confounding variables (assuming your sample size is large enough!). With that, the only difference between the two groups is the condition participants are assigned to, which is the independent variable, then we are confident to infer that the independent variable actually causes the differences in the dependent variables.

It is critical to emphasize that random assignment is the only control technique to control for both known and unknown confounding variables. With all other control techniques mentioned below, we must first know what the confounding variable is before controlling it. Random assignment does not. With the simple act of randomly assigning participants into different conditions, we take care both the confounding variables we know of and the ones we don't even know that could threat the internal validity of our studies. As the saying goes, "what you don't know will hurt you." Random assignment take cares of it.

Matching is another technique we can use to control for extraneous variables. We must first identify the extraneous variable that can potentially confound the research design. Then we want to rank order the participants on this extraneous variable or list the participants in a ascending or descending order. Participants who are similar on the extraneous variable will be placed into different treatment groups. In other words, they are "matched" on the extraneous variable. Then we can carry out the intervention/treatment as usual. If different treatment groups do show differences on the dependent variable, we would know it is not the extraneous variables because participants are "matched" or equivalent on the extraneous variable. Rather it is more likely to the independent variable (i.e., the treatments) that causes the changes in the dependent variable. Use the example above (self-spending vs. others-spending on happiness) with the same extraneous variable of whether individuals had a happy childhood to begin with. Once we identify this extraneous variable, we do need to first collect some kind of data from the participants to measure how happy their childhood was. Or sometimes, data on the extraneous variables we plan to use may be already available (for example, you want to examine the effect of different types of tutoring on students' performance in Calculus I course and you plan to match them on this extraneous variable: college entrance test scores, which is already collected by the Admissions Office). In either case, getting the data on the identified extraneous variable is a typical step we need to do before matching. So going back to whether individuals had a happy childhood to begin with. Once we have data, we'd sort it in a certain order, for example, from the highest score (meaning participants reporting the happiest childhood) to the lowest score (meaning participants reporting the least happy childhood). We will then identify/match participants with the highest levels of childhood happiness and place them into different treatment groups. Then we go down the scale and match participants with relative high levels of childhood happiness and place them into different treatment groups. We repeat on the descending order until we match participants with the lowest levels of childhood happiness and place them into different treatment groups. By now, each treatment group will have participants with a full range of levels on childhood happiness (which is a strength...thinking about the variation, the representativeness of the sample). The two treatment groups will be similar or equivalent on this extraneous variable. If the treatments, self-spending vs. other-spending, eventually shows the differences on individual happiness, then we know it's not due to how happy their childhood was. We will be more confident it is due to the independent variable.

You may be thinking, but wait we have only taken care of one extraneous variable. What about other extraneous variables? Good thinking.That's exactly correct. We mentioned a few extraneous variables but have only matched them on one. This is the main limitation of matching. You can match participants on more than one extraneous variables, but it's cumbersome, if not impossible, to match them on 10 or 20 extraneous variables. More importantly, the more variables we try to match participants on, the less likely we will have a similar match. In other words, it may be easy to find/match participants on one particular extraneous variable (similar level of childhood happiness), but it's much harder to find/match participants to be similar on 10 different extraneous variables at once.

Holding Extraneous Variable Constant

Holding extraneous variable constant control technique is self-explanatory. We will use participants at one level of extraneous variable only, in other words, holding the extraneous variable constant. Using the same example above, for example we only want to study participants with the low level of childhood happiness. We do need to go through the same steps as in Matching: identifying the extraneous variable that can potentially confound the research design and getting the data on the identified extraneous variable. Once we have the data on childhood happiness scores, we will only include participants on the lower end of childhood happiness scores, then place them into different treatment groups and carry out the study as before. If the condition groups, self-spending vs. other-spending, eventually shows the differences on individual happiness, then we know it's not due to how happy their childhood was (since we already picked those on the lower end of childhood happiness only). We will be more confident it is due to the independent variable.

Similarly to Matching, we have to do this one extraneous variable at a time. As we increase the number of extraneous variables to be held constant, the more difficult it gets. The other limitation is by holding extraneous variable constant, we are excluding a big chunk of participants, in this case, anyone who are NOT low on childhood happiness. This is a major weakness, as we reduce the variability on the spectrum of childhood happiness levels, we decreases the representativeness of the sample and generalizabiliy suffers.

Building Extraneous Variables into Design

The last control technique building extraneous variables into research design is widely used. Like the name suggests, we would identify the extraneous variable that can potentially confound the research design, and include it into the research design by treating it as an independent variable. This control technique takes care of the limitation the previous control technique, holding extraneous variable constant, has. We don't need to excluding participants based on where they stand on the extraneous variable(s). Instead we can include participants with a wide range of levels on the extraneous variable(s). You can include multiple extraneous variables into the design at once. However, the more variables you include in the design, the large the sample size it requires for statistical analyses, which may be difficult to obtain due to limitations of time, staff, cost, access, etc.

Random Assignment in Psychology: Definition & Examples

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

In psychology, random assignment refers to the practice of allocating participants to different experimental groups in a study in a completely unbiased way, ensuring each participant has an equal chance of being assigned to any group.

In experimental research, random assignment, or random placement, organizes participants from your sample into different groups using randomization. 

Random assignment uses chance procedures to ensure that each participant has an equal opportunity of being assigned to either a control or experimental group.

The control group does not receive the treatment in question, whereas the experimental group does receive the treatment.

When using random assignment, neither the researcher nor the participant can choose the group to which the participant is assigned. This ensures that any differences between and within the groups are not systematic at the onset of the study. 

In a study to test the success of a weight-loss program, investigators randomly assigned a pool of participants to one of two groups.

Group A participants participated in the weight-loss program for 10 weeks and took a class where they learned about the benefits of healthy eating and exercise.

Group B participants read a 200-page book that explains the benefits of weight loss. The investigator randomly assigned participants to one of the two groups.

The researchers found that those who participated in the program and took the class were more likely to lose weight than those in the other group that received only the book.

Importance 

Random assignment ensures that each group in the experiment is identical before applying the independent variable.

In experiments , researchers will manipulate an independent variable to assess its effect on a dependent variable, while controlling for other variables. Random assignment increases the likelihood that the treatment groups are the same at the onset of a study.

Thus, any changes that result from the independent variable can be assumed to be a result of the treatment of interest. This is particularly important for eliminating sources of bias and strengthening the internal validity of an experiment.

Random assignment is the best method for inferring a causal relationship between a treatment and an outcome.

Random Selection vs. Random Assignment 

Random selection (also called probability sampling or random sampling) is a way of randomly selecting members of a population to be included in your study.

On the other hand, random assignment is a way of sorting the sample participants into control and treatment groups. 

Random selection ensures that everyone in the population has an equal chance of being selected for the study. Once the pool of participants has been chosen, experimenters use random assignment to assign participants into groups. 

Random assignment is only used in between-subjects experimental designs, while random selection can be used in a variety of study designs.

Random Assignment vs Random Sampling

Random sampling refers to selecting participants from a population so that each individual has an equal chance of being chosen. This method enhances the representativeness of the sample.

Random assignment, on the other hand, is used in experimental designs once participants are selected. It involves allocating these participants to different experimental groups or conditions randomly.

This helps ensure that any differences in results across groups are due to manipulating the independent variable, not preexisting differences among participants.

When to Use Random Assignment

Random assignment is used in experiments with a between-groups or independent measures design.

In these research designs, researchers will manipulate an independent variable to assess its effect on a dependent variable, while controlling for other variables.

There is usually a control group and one or more experimental groups. Random assignment helps ensure that the groups are comparable at the onset of the study.

How to Use Random Assignment

There are a variety of ways to assign participants into study groups randomly. Here are a handful of popular methods: 

  • Random Number Generator : Give each member of the sample a unique number; use a computer program to randomly generate a number from the list for each group.
  • Lottery : Give each member of the sample a unique number. Place all numbers in a hat or bucket and draw numbers at random for each group.
  • Flipping a Coin : Flip a coin for each participant to decide if they will be in the control group or experimental group (this method can only be used when you have just two groups) 
  • Roll a Die : For each number on the list, roll a dice to decide which of the groups they will be in. For example, assume that rolling 1, 2, or 3 places them in a control group and rolling 3, 4, 5 lands them in an experimental group.

When is Random Assignment not used?

  • When it is not ethically permissible: Randomization is only ethical if the researcher has no evidence that one treatment is superior to the other or that one treatment might have harmful side effects. 
  • When answering non-causal questions : If the researcher is just interested in predicting the probability of an event, the causal relationship between the variables is not important and observational designs would be more suitable than random assignment. 
  • When studying the effect of variables that cannot be manipulated: Some risk factors cannot be manipulated and so it would not make any sense to study them in a randomized trial. For example, we cannot randomly assign participants into categories based on age, gender, or genetic factors.

Drawbacks of Random Assignment

While randomization assures an unbiased assignment of participants to groups, it does not guarantee the equality of these groups. There could still be extraneous variables that differ between groups or group differences that arise from chance. Additionally, there is still an element of luck with random assignments.

Thus, researchers can not produce perfectly equal groups for each specific study. Differences between the treatment group and control group might still exist, and the results of a randomized trial may sometimes be wrong, but this is absolutely okay.

Scientific evidence is a long and continuous process, and the groups will tend to be equal in the long run when data is aggregated in a meta-analysis.

Additionally, external validity (i.e., the extent to which the researcher can use the results of the study to generalize to the larger population) is compromised with random assignment.

Random assignment is challenging to implement outside of controlled laboratory conditions and might not represent what would happen in the real world at the population level. 

Random assignment can also be more costly than simple observational studies, where an investigator is just observing events without intervening with the population.

Randomization also can be time-consuming and challenging, especially when participants refuse to receive the assigned treatment or do not adhere to recommendations. 

What is the difference between random sampling and random assignment?

Random sampling refers to randomly selecting a sample of participants from a population. Random assignment refers to randomly assigning participants to treatment groups from the selected sample.

Does random assignment increase internal validity?

Yes, random assignment ensures that there are no systematic differences between the participants in each group, enhancing the study’s internal validity .

Does random assignment reduce sampling error?

Yes, with random assignment, participants have an equal chance of being assigned to either a control group or an experimental group, resulting in a sample that is, in theory, representative of the population.

Random assignment does not completely eliminate sampling error because a sample only approximates the population from which it is drawn. However, random sampling is a way to minimize sampling errors. 

When is random assignment not possible?

Random assignment is not possible when the experimenters cannot control the treatment or independent variable.

For example, if you want to compare how men and women perform on a test, you cannot randomly assign subjects to these groups.

Participants are not randomly assigned to different groups in this study, but instead assigned based on their characteristics.

Does random assignment eliminate confounding variables?

Yes, random assignment eliminates the influence of any confounding variables on the treatment because it distributes them at random among the study groups. Randomization invalidates any relationship between a confounding variable and the treatment.

Why is random assignment of participants to treatment conditions in an experiment used?

Random assignment is used to ensure that all groups are comparable at the start of a study. This allows researchers to conclude that the outcomes of the study can be attributed to the intervention at hand and to rule out alternative explanations for study results.

Further Reading

  • Bogomolnaia, A., & Moulin, H. (2001). A new solution to the random assignment problem .  Journal of Economic theory ,  100 (2), 295-328.
  • Krause, M. S., & Howard, K. I. (2003). What random assignment does and does not do .  Journal of Clinical Psychology ,  59 (7), 751-766.

Print Friendly, PDF & Email

Chapter 6: Experimental Research

6.2 experimental design, learning objectives.

  • Explain the difference between between-subjects and within-subjects experiments, list some of the pros and cons of each approach, and decide which approach to use to answer a particular research question.
  • Define random assignment, distinguish it from random sampling, explain its purpose in experimental research, and use some simple strategies to implement it.
  • Define what a control condition is, explain its purpose in research on treatment effectiveness, and describe some alternative types of control conditions.
  • Define several types of carryover effect, give examples of each, and explain how counterbalancing helps to deal with them.

In this section, we look at some different ways to design an experiment. The primary distinction we will make is between approaches in which each participant experiences one level of the independent variable and approaches in which each participant experiences all levels of the independent variable. The former are called between-subjects experiments and the latter are called within-subjects experiments.

Between-Subjects Experiments

In a between-subjects experiment , each participant is tested in only one condition. For example, a researcher with a sample of 100 college students might assign half of them to write about a traumatic event and the other half write about a neutral event. Or a researcher with a sample of 60 people with severe agoraphobia (fear of open spaces) might assign 20 of them to receive each of three different treatments for that disorder. It is essential in a between-subjects experiment that the researcher assign participants to conditions so that the different groups are, on average, highly similar to each other. Those in a trauma condition and a neutral condition, for example, should include a similar proportion of men and women, and they should have similar average intelligence quotients (IQs), similar average levels of motivation, similar average numbers of health problems, and so on. This is a matter of controlling these extraneous participant variables across conditions so that they do not become confounding variables.

Random Assignment

The primary way that researchers accomplish this kind of control of extraneous variables across conditions is called random assignment , which means using a random process to decide which participants are tested in which conditions. Do not confuse random assignment with random sampling. Random sampling is a method for selecting a sample from a population, and it is rarely used in psychological research. Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too.

In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition (e.g., a 50% chance of being assigned to each of two conditions). The second is that each participant is assigned to a condition independently of other participants. Thus one way to assign participants to two conditions would be to flip a coin for each one. If the coin lands heads, the participant is assigned to Condition A, and if it lands tails, the participant is assigned to Condition B. For three conditions, one could use a computer to generate a random integer from 1 to 3 for each participant. If the integer is 1, the participant is assigned to Condition A; if it is 2, the participant is assigned to Condition B; and if it is 3, the participant is assigned to Condition C. In practice, a full sequence of conditions—one for each participant expected to be in the experiment—is usually created ahead of time, and each new participant is assigned to the next condition in the sequence as he or she is tested. When the procedure is computerized, the computer program often handles the random assignment.

One problem with coin flipping and other strict procedures for random assignment is that they are likely to result in unequal sample sizes in the different conditions. Unequal sample sizes are generally not a serious problem, and you should never throw away data you have already collected to achieve equal sample sizes. However, for a fixed number of participants, it is statistically most efficient to divide them into equal-sized groups. It is standard practice, therefore, to use a kind of modified random assignment that keeps the number of participants in each group as similar as possible. One approach is block randomization . In block randomization, all the conditions occur once in the sequence before any of them is repeated. Then they all occur again before any of them is repeated again. Within each of these “blocks,” the conditions occur in a random order. Again, the sequence of conditions is usually generated before any participants are tested, and each new participant is assigned to the next condition in the sequence. Table 6.2 “Block Randomization Sequence for Assigning Nine Participants to Three Conditions” shows such a sequence for assigning nine participants to three conditions. The Research Randomizer website ( http://www.randomizer.org ) will generate block randomization sequences for any number of participants and conditions. Again, when the procedure is computerized, the computer program often handles the block randomization.

Table 6.2 Block Randomization Sequence for Assigning Nine Participants to Three Conditions

Random assignment is not guaranteed to control all extraneous variables across conditions. It is always possible that just by chance, the participants in one condition might turn out to be substantially older, less tired, more motivated, or less depressed on average than the participants in another condition. However, there are some reasons that this is not a major concern. One is that random assignment works better than one might expect, especially for large samples. Another is that the inferential statistics that researchers use to decide whether a difference between groups reflects a difference in the population takes the “fallibility” of random assignment into account. Yet another reason is that even if random assignment does result in a confounding variable and therefore produces misleading results, this is likely to be detected when the experiment is replicated. The upshot is that random assignment to conditions—although not infallible in terms of controlling extraneous variables—is always considered a strength of a research design.

Treatment and Control Conditions

Between-subjects experiments are often used to determine whether a treatment works. In psychological research, a treatment is any intervention meant to change people’s behavior for the better. This includes psychotherapies and medical treatments for psychological disorders but also interventions designed to improve learning, promote conservation, reduce prejudice, and so on. To determine whether a treatment works, participants are randomly assigned to either a treatment condition , in which they receive the treatment, or a control condition , in which they do not receive the treatment. If participants in the treatment condition end up better off than participants in the control condition—for example, they are less depressed, learn faster, conserve more, express less prejudice—then the researcher can conclude that the treatment works. In research on the effectiveness of psychotherapies and medical treatments, this type of experiment is often called a randomized clinical trial .

There are different types of control conditions. In a no-treatment control condition , participants receive no treatment whatsoever. One problem with this approach, however, is the existence of placebo effects. A placebo is a simulated treatment that lacks any active ingredient or element that should make it effective, and a placebo effect is a positive effect of such a treatment. Many folk remedies that seem to work—such as eating chicken soup for a cold or placing soap under the bedsheets to stop nighttime leg cramps—are probably nothing more than placebos. Although placebo effects are not well understood, they are probably driven primarily by people’s expectations that they will improve. Having the expectation to improve can result in reduced stress, anxiety, and depression, which can alter perceptions and even improve immune system functioning (Price, Finniss, & Benedetti, 2008).

Placebo effects are interesting in their own right (see Note 6.28 “The Powerful Placebo” ), but they also pose a serious problem for researchers who want to determine whether a treatment works. Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” shows some hypothetical results in which participants in a treatment condition improved more on average than participants in a no-treatment control condition. If these conditions (the two leftmost bars in Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” ) were the only conditions in this experiment, however, one could not conclude that the treatment worked. It could be instead that participants in the treatment group improved more because they expected to improve, while those in the no-treatment control condition did not.

Figure 6.2 Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions

Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions

Fortunately, there are several solutions to this problem. One is to include a placebo control condition , in which participants receive a placebo that looks much like the treatment but lacks the active ingredient or element thought to be responsible for the treatment’s effectiveness. When participants in a treatment condition take a pill, for example, then those in a placebo control condition would take an identical-looking pill that lacks the active ingredient in the treatment (a “sugar pill”). In research on psychotherapy effectiveness, the placebo might involve going to a psychotherapist and talking in an unstructured way about one’s problems. The idea is that if participants in both the treatment and the placebo control groups expect to improve, then any improvement in the treatment group over and above that in the placebo control group must have been caused by the treatment and not by participants’ expectations. This is what is shown by a comparison of the two outer bars in Figure 6.2 “Hypothetical Results From a Study Including Treatment, No-Treatment, and Placebo Conditions” .

Of course, the principle of informed consent requires that participants be told that they will be assigned to either a treatment or a placebo control condition—even though they cannot be told which until the experiment ends. In many cases the participants who had been in the control condition are then offered an opportunity to have the real treatment. An alternative approach is to use a waitlist control condition , in which participants are told that they will receive the treatment but must wait until the participants in the treatment condition have already received it. This allows researchers to compare participants who have received the treatment with participants who are not currently receiving it but who still expect to improve (eventually). A final solution to the problem of placebo effects is to leave out the control condition completely and compare any new treatment with the best available alternative treatment. For example, a new treatment for simple phobia could be compared with standard exposure therapy. Because participants in both conditions receive a treatment, their expectations about improvement should be similar. This approach also makes sense because once there is an effective treatment, the interesting question about a new treatment is not simply “Does it work?” but “Does it work better than what is already available?”

The Powerful Placebo

Many people are not surprised that placebos can have a positive effect on disorders that seem fundamentally psychological, including depression, anxiety, and insomnia. However, placebos can also have a positive effect on disorders that most people think of as fundamentally physiological. These include asthma, ulcers, and warts (Shapiro & Shapiro, 1999). There is even evidence that placebo surgery—also called “sham surgery”—can be as effective as actual surgery.

Medical researcher J. Bruce Moseley and his colleagues conducted a study on the effectiveness of two arthroscopic surgery procedures for osteoarthritis of the knee (Moseley et al., 2002). The control participants in this study were prepped for surgery, received a tranquilizer, and even received three small incisions in their knees. But they did not receive the actual arthroscopic surgical procedure. The surprising result was that all participants improved in terms of both knee pain and function, and the sham surgery group improved just as much as the treatment groups. According to the researchers, “This study provides strong evidence that arthroscopic lavage with or without débridement [the surgical procedures used] is not better than and appears to be equivalent to a placebo procedure in improving knee pain and self-reported function” (p. 85).

Doctors treating a patient in Surgery

Research has shown that patients with osteoarthritis of the knee who receive a “sham surgery” experience reductions in pain and improvement in knee function similar to those of patients who receive a real surgery.

Army Medicine – Surgery – CC BY 2.0.

Within-Subjects Experiments

In a within-subjects experiment , each participant is tested under all conditions. Consider an experiment on the effect of a defendant’s physical attractiveness on judgments of his guilt. Again, in a between-subjects experiment, one group of participants would be shown an attractive defendant and asked to judge his guilt, and another group of participants would be shown an unattractive defendant and asked to judge his guilt. In a within-subjects experiment, however, the same group of participants would judge the guilt of both an attractive and an unattractive defendant.

The primary advantage of this approach is that it provides maximum control of extraneous participant variables. Participants in all conditions have the same mean IQ, same socioeconomic status, same number of siblings, and so on—because they are the very same people. Within-subjects experiments also make it possible to use statistical procedures that remove the effect of these extraneous participant variables on the dependent variable and therefore make the data less “noisy” and the effect of the independent variable easier to detect. We will look more closely at this idea later in the book.

Carryover Effects and Counterbalancing

The primary disadvantage of within-subjects designs is that they can result in carryover effects. A carryover effect is an effect of being tested in one condition on participants’ behavior in later conditions. One type of carryover effect is a practice effect , where participants perform a task better in later conditions because they have had a chance to practice it. Another type is a fatigue effect , where participants perform a task worse in later conditions because they become tired or bored. Being tested in one condition can also change how participants perceive stimuli or interpret their task in later conditions. This is called a context effect . For example, an average-looking defendant might be judged more harshly when participants have just judged an attractive defendant than when they have just judged an unattractive defendant. Within-subjects experiments also make it easier for participants to guess the hypothesis. For example, a participant who is asked to judge the guilt of an attractive defendant and then is asked to judge the guilt of an unattractive defendant is likely to guess that the hypothesis is that defendant attractiveness affects judgments of guilt. This could lead the participant to judge the unattractive defendant more harshly because he thinks this is what he is expected to do. Or it could make participants judge the two defendants similarly in an effort to be “fair.”

Carryover effects can be interesting in their own right. (Does the attractiveness of one person depend on the attractiveness of other people that we have seen recently?) But when they are not the focus of the research, carryover effects can be problematic. Imagine, for example, that participants judge the guilt of an attractive defendant and then judge the guilt of an unattractive defendant. If they judge the unattractive defendant more harshly, this might be because of his unattractiveness. But it could be instead that they judge him more harshly because they are becoming bored or tired. In other words, the order of the conditions is a confounding variable. The attractive condition is always the first condition and the unattractive condition the second. Thus any difference between the conditions in terms of the dependent variable could be caused by the order of the conditions and not the independent variable itself.

There is a solution to the problem of order effects, however, that can be used in many situations. It is counterbalancing , which means testing different participants in different orders. For example, some participants would be tested in the attractive defendant condition followed by the unattractive defendant condition, and others would be tested in the unattractive condition followed by the attractive condition. With three conditions, there would be six different orders (ABC, ACB, BAC, BCA, CAB, and CBA), so some participants would be tested in each of the six orders. With counterbalancing, participants are assigned to orders randomly, using the techniques we have already discussed. Thus random assignment plays an important role in within-subjects designs just as in between-subjects designs. Here, instead of randomly assigning to conditions, they are randomly assigned to different orders of conditions. In fact, it can safely be said that if a study does not involve random assignment in one form or another, it is not an experiment.

There are two ways to think about what counterbalancing accomplishes. One is that it controls the order of conditions so that it is no longer a confounding variable. Instead of the attractive condition always being first and the unattractive condition always being second, the attractive condition comes first for some participants and second for others. Likewise, the unattractive condition comes first for some participants and second for others. Thus any overall difference in the dependent variable between the two conditions cannot have been caused by the order of conditions. A second way to think about what counterbalancing accomplishes is that if there are carryover effects, it makes it possible to detect them. One can analyze the data separately for each order to see whether it had an effect.

When 9 Is “Larger” Than 221

Researcher Michael Birnbaum has argued that the lack of context provided by between-subjects designs is often a bigger problem than the context effects created by within-subjects designs. To demonstrate this, he asked one group of participants to rate how large the number 9 was on a 1-to-10 rating scale and another group to rate how large the number 221 was on the same 1-to-10 rating scale (Birnbaum, 1999). Participants in this between-subjects design gave the number 9 a mean rating of 5.13 and the number 221 a mean rating of 3.10. In other words, they rated 9 as larger than 221! According to Birnbaum, this is because participants spontaneously compared 9 with other one-digit numbers (in which case it is relatively large) and compared 221 with other three-digit numbers (in which case it is relatively small).

Simultaneous Within-Subjects Designs

So far, we have discussed an approach to within-subjects designs in which participants are tested in one condition at a time. There is another approach, however, that is often used when participants make multiple responses in each condition. Imagine, for example, that participants judge the guilt of 10 attractive defendants and 10 unattractive defendants. Instead of having people make judgments about all 10 defendants of one type followed by all 10 defendants of the other type, the researcher could present all 20 defendants in a sequence that mixed the two types. The researcher could then compute each participant’s mean rating for each type of defendant. Or imagine an experiment designed to see whether people with social anxiety disorder remember negative adjectives (e.g., “stupid,” “incompetent”) better than positive ones (e.g., “happy,” “productive”). The researcher could have participants study a single list that includes both kinds of words and then have them try to recall as many words as possible. The researcher could then count the number of each type of word that was recalled. There are many ways to determine the order in which the stimuli are presented, but one common way is to generate a different random order for each participant.

Between-Subjects or Within-Subjects?

Almost every experiment can be conducted using either a between-subjects design or a within-subjects design. This means that researchers must choose between the two approaches based on their relative merits for the particular situation.

Between-subjects experiments have the advantage of being conceptually simpler and requiring less testing time per participant. They also avoid carryover effects without the need for counterbalancing. Within-subjects experiments have the advantage of controlling extraneous participant variables, which generally reduces noise in the data and makes it easier to detect a relationship between the independent and dependent variables.

A good rule of thumb, then, is that if it is possible to conduct a within-subjects experiment (with proper counterbalancing) in the time that is available per participant—and you have no serious concerns about carryover effects—this is probably the best option. If a within-subjects design would be difficult or impossible to carry out, then you should consider a between-subjects design instead. For example, if you were testing participants in a doctor’s waiting room or shoppers in line at a grocery store, you might not have enough time to test each participant in all conditions and therefore would opt for a between-subjects design. Or imagine you were trying to reduce people’s level of prejudice by having them interact with someone of another race. A within-subjects design with counterbalancing would require testing some participants in the treatment condition first and then in a control condition. But if the treatment works and reduces people’s level of prejudice, then they would no longer be suitable for testing in the control condition. This is true for many designs that involve a treatment meant to produce long-term change in participants’ behavior (e.g., studies testing the effectiveness of psychotherapy). Clearly, a between-subjects design would be necessary here.

Remember also that using one type of design does not preclude using the other type in a different study. There is no reason that a researcher could not use both a between-subjects design and a within-subjects design to answer the same research question. In fact, professional researchers often do exactly this.

Key Takeaways

  • Experiments can be conducted using either between-subjects or within-subjects designs. Deciding which to use in a particular situation requires careful consideration of the pros and cons of each approach.
  • Random assignment to conditions in between-subjects experiments or to orders of conditions in within-subjects experiments is a fundamental element of experimental research. Its purpose is to control extraneous variables so that they do not become confounding variables.
  • Experimental research on the effectiveness of a treatment requires both a treatment condition and a control condition, which can be a no-treatment control condition, a placebo control condition, or a waitlist control condition. Experimental treatments can also be compared with the best available alternative.

Discussion: For each of the following topics, list the pros and cons of a between-subjects and within-subjects design and decide which would be better.

  • You want to test the relative effectiveness of two training programs for running a marathon.
  • Using photographs of people as stimuli, you want to see if smiling people are perceived as more intelligent than people who are not smiling.
  • In a field experiment, you want to see if the way a panhandler is dressed (neatly vs. sloppily) affects whether or not passersby give him any money.
  • You want to see if concrete nouns (e.g., dog ) are recalled better than abstract nouns (e.g., truth ).
  • Discussion: Imagine that an experiment shows that participants who receive psychodynamic therapy for a dog phobia improve more than participants in a no-treatment control group. Explain a fundamental problem with this research design and at least two ways that it might be corrected.

Birnbaum, M. H. (1999). How to show that 9 > 221: Collect judgments in a between-subjects design. Psychological Methods, 4 , 243–249.

Moseley, J. B., O’Malley, K., Petersen, N. J., Menke, T. J., Brody, B. A., Kuykendall, D. H., … Wray, N. P. (2002). A controlled trial of arthroscopic surgery for osteoarthritis of the knee. The New England Journal of Medicine, 347 , 81–88.

Price, D. D., Finniss, D. G., & Benedetti, F. (2008). A comprehensive review of the placebo effect: Recent advances and current thought. Annual Review of Psychology, 59 , 565–590.

Shapiro, A. K., & Shapiro, E. (1999). The powerful placebo: From ancient priest to modern physician . Baltimore, MD: Johns Hopkins University Press.

  • Research Methods in Psychology. Provided by : University of Minnesota Libraries Publishing. Located at : http://open.lib.umn.edu/psychologyresearchmethods . License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike

Footer Logo Lumen Candela

Privacy Policy

Matching and Randomization in Experiments

Thoughts on a classic paper on causality.

Jeremy Salfen

Jeremy Salfen

  • Custom Social Profile Link

I recently read Donald Rubin’s classic paper Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies ( PDF ) as part of the Kickstarter Data team’s reading group.

Two arguments in this paper jumped out at me, the first about the value of matching and the second about the costs and benefits of conducting a randomized versus observational study.

Rubin proposes a hypothetical experiment on 2 N units in which the experimental treatment E is assigned to N units while the control treatment C is assigned to a different set of N units. If for every unit receiving E there is a matched unit receiving C such that we expect the pair to react identically to the same treatment, then there are N identically matched pairs. Rubin observes that

if one had N identically matched pairs, a “thoughtless” random assignment could be worse than a nonrandom assignment of E to one member of the pair and C to the other. By “thoughtless” we mean some random assignment that does not assure that the members of each matched pair get different treatments. (692)

In this case, “thoughtless” randomization “could be worse” in the sense that the statistical power of the experiment will suffer, and you will be less likely to detect an effect if there really is one.

Geography is a good example of this. If you’re running an experiment on users across the United States, matching similar geographic regions might be more effective than a completely randomized trial because the variation between regions might be higher than variation within regions , diluting the effect.

Of course, a multilevel model that takes into account region is one solution. For an insightful description of how Google has approached this problem, see Estimating causal effects using geo experiments .

Randomized vs. Observational Studies

Another one of Rubin’s claims that stood out to me is a comparison of the costs and benefits of typical randomized and observational studies.

One major advantage of randomized studies is that, with a large enough sample size, you often don’t have to worry about controlling for confounding factors that might bias your results.

There can be downsides to randomized studies though. They can have nontrivial setup costs, and running a randomized study over a long window (e.g. years) is often not feasible. Moreover, because randomized trials are often conducted in a controlled environment, Rubin claims that they tend to be less natural than an observational study — that is, the units of analysis are often constrained to a particular setting or selected to be a subset of the population of interest.

Granted, this is more the case for experiments in fields like psychology than in online experiments on the web, but it does suggest that generalizability is a factor we should consider when interpreting the results of a randomized trial.

Comparing these costs and benefits, Rubin argues that

the first issue, the effect of variables not explicitly controlled, is usually more serious in nonrandomized than in randomized studies, while the second, the applicability of the results to a population of interest, is often more serious in randomized than in nonrandomized studies. (698)

I take this as a reminder to think carefully about the generalizability of the results of an experiment. When we run experiments on specific parts of a website or on particular subsets of users, often our goal is to generalize these results to the entire website or to all users. Rubin reminds us that observational studies, when analyzed properly, may in fact be better suited to those kinds of claims, particularly when matching can be used.

UCCS Community

  • Current Students
  • Faculty Staff
  • Alumni & Friends
  • Parents & Families

Schools and Colleges

  • College of Business
  • College of Education
  • College of Engineering and Applied Science
  • College of Letters, Arts & Sciences
  • College of Public Service
  • Graduate School
  • Helen and Arthur E. Johnson Beth-El College of Nursing and Health Sciences

Quick Links

  • Search for Programs & Careers
  • Academic Advising
  • Ent Center for the Arts
  • Kraemer Family Library
  • Military and Veteran Affairs
  • myUCCS Portal
  • Campus Email
  • Microsoft 365
  • Mountain Lion Connect
  • Support Network: Students
  • Support Network: Faculty
  • Account Help

Effect Size Calculators

Dr. Lee A. Becker

  • Content, Part 1
  • Content, Part 2
  • Research Tools

Statistical Analysis of Quasi-Experimental Designs:

I. apriori selection techniques.

Content, part II

I. Overview

Random assignment is used in experimental designs to help assure that different treatment groups are equivalent prior to treatment. With small n 's randomization is messy, the groups may not be equivalent on some important characteristic.

In general, matching is used when you want to make sure that members of the various groups are equivalent on one or more characteristics. If you are want to make absolutely sure that the treatment groups are equivalent on some attribute you can use matched random assignment.

When you can't randomly assign to conditions you can still use matching techniques to try to equate groups on important characteristics. This set of notes makes the distinction between normative group matching and normative group equivalence. In normative group matching you select an exact match from normative comparison group for each participant in the treatment group. In normative group equivalence you select a comparison group that has approximately equivalent characteristics to the treatment group.

II. Matching in Experimental Designs: Matched Random Assignment

In an experimental design, matched random sampling can be used to equate the groups on one or more characteristics. Whitley (in chapter 8) uses an example of matching on IQ.

The Matching Process

Note: Tx = Treatment group, Ctl = Control Group.

Analysis of a Matched Random Assignment Design

If the matching variable is related to the dependent variable, (e.g., IQ is related to almost all studies of memory and learning), then you can incorporate the matching variable as a blocking variable in your analysis of variance. That is, in the 2 x 3 example, the first 6 participants can be entered as IQ block #1, the second 6 participants as IQ block #2. This removes the variance due to IQ from the error term, increasing the power of the study.

The analysis is treated as a repeated measures design where the measures for each block of participants are considered to be repeated measures. For example, in setting up the data for a two-group design (experimental vs. control) the data would look like this:

The analysis would be run as a repeated measures design with group (control vs. experimental) as a within-subjects factor.

If you were interested in analyzing the equivalence of the groups on the IQ score variable you could enter the IQ scores as separate variables.  An analysis of variance of  the IQ scores with treatment group (Treatment vs. Control) as a within-subjects factor should show no mean differences between the two groups. Entering the IQ data would allow you to find the correlation between IQ and performance scores within each treatment group.

One of the problems with this type of analysis is that if any score is missing then the entire block is set to missing.  None of the performance data from Block 4 in Table 2 would be included in the analysis because the performance score is missing for the person in the control group. If you had a 6 cells in your design you would loose the data on all 6 people in a block that had only one missing data point.

I understand that Dr. Klebe has been writing a new data analysis program to take care of this kind of missing data problem.

SPSS Note 

The SPSS syntax commands for running the data in Table 2 as a repeated measures analysis of variance are shown in Table 3.  The SPSS syntax commands for running the data in Table 2 as a paired t test are shown in Table 4. 

III. Matching in Quasi-Experimental Designs: Normative Group Matching

Suppose that you have a quasi-experiment where you want to compare an experimental group (e.g., people who have suffered mild head injury) with a sample from a normative population. Suppose that there are several hundred people in the normative population.

One strategy is to randomly select the same number of people from the normative population as you have in your experimental group. If the demographic characteristics of the normative group approximate those of your experimental group, then this process may be appropriate. But, what if the normative group contains equal numbers of males and females ranging in age from 6 to 102, and people in your experimental condition are all males ranging in age from 18 to 35? Then it is unlikely that the demographic characteristics of the people sampled from the normative group will match those of your experimental group. For that reason, simple random selection is rarely appropriate when sampling from a normative population.

The Normative Group Matching Procedure

Determine the relevant characteristics (e.g., age, gender, SES, etc.) of each person in your experimental group. E.g., Exp person #1 is a 27 year-old male. Then randomly select one of the 27 year-old males from the normative population as a match for Exp person #1. Exp person #2 is a 35 year-old male, then randomly select one of the 35 year-old males as a match for Exp person #2. If you have done randomize normative group matching then the matching variable should be used as a blocking factor in the ANOVA.

If you have a limited number of people in the normative group then you can do caliper matching . In caliper matching you select the matching person based a range of scores, for example, you can caliper match within a range of 3 years. Exp person #1 would be randomly selected from males whose age ranged from 26 to 27 years. If you used a five year caliper for age then for exp person #1 you randomly select a males from those whose age ranged from 25 to 29 years old. You would want a narrower age caliper for children and adolescents than for adults.

This procedure becomes very difficult to accomplish when you try to start matching on more than one variable. Think of the problems of finding exact matches when several variables are used, e.g., an exact match for a 27-year old, white female with an IQ score of 103 and 5 children.

Analysis of a Normative Group Matching Design

The analysis is the same as for a matched random assignment design. If the matching variable is related to the dependent variable, then you can incorporate the matching variable as a blocking variable in your analysis of variance.

III. Matching in Quasi-Experimental Designs: Normative Group Equivalence

Because of the problems in selecting people in a normative group matching design and the potential problems with the data analysis of that design, you may want to make the normative comparison group equivalent on selected demographic characteristics. You might want the same proportion of males and females, and the mean age (and SD) of the normative group should be the same as those in the experimental group. If the ages of the people in the experimental group ranged from 18 to 35, then your normative group might contain an equal number of participants randomly selected from those in the age range from 18 to 35 in the normative population.

Analysis of a Normative Group Equivalence Design

In the case of normative group equivalence there is no special ANOVA procedure as there is in Normative Group Matching. In general, demographic characteristics themselves rarely predict the d.v., so you haven’t lost anything by using the group equivalence method.

A Semantic Caution

The term "matching" implies a one-to-one matching and it implies that you have incorporated that matched variable into your ANOVA design. Please don’t use the term "matching" when you mean mere "equivalence."

Statology

Statistics Made Easy

Random Selection vs. Random Assignment

Random selection and random assignment  are two techniques in statistics that are commonly used, but are commonly confused.

Random selection  refers to the process of randomly selecting individuals from a population to be involved in a study.

Random assignment  refers to the process of randomly  assigning  the individuals in a study to either a treatment group or a control group.

You can think of random selection as the process you use to “get” the individuals in a study and you can think of random assignment as what you “do” with those individuals once they’re selected to be part of the study.

The Importance of Random Selection and Random Assignment

When a study uses  random selection , it selects individuals from a population using some random process. For example, if some population has 1,000 individuals then we might use a computer to randomly select 100 of those individuals from a database. This means that each individual is equally likely to be selected to be part of the study, which increases the chances that we will obtain a representative sample – a sample that has similar characteristics to the overall population.

By using a representative sample in our study, we’re able to generalize the findings of our study to the population. In statistical terms, this is referred to as having  external validity – it’s valid to externalize our findings to the overall population.

When a study uses  random assignment , it randomly assigns individuals to either a treatment group or a control group. For example, if we have 100 individuals in a study then we might use a random number generator to randomly assign 50 individuals to a control group and 50 individuals to a treatment group.

By using random assignment, we increase the chances that the two groups will have roughly similar characteristics, which means that any difference we observe between the two groups can be attributed to the treatment. This means the study has  internal validity  – it’s valid to attribute any differences between the groups to the treatment itself as opposed to differences between the individuals in the groups.

Examples of Random Selection and Random Assignment

It’s possible for a study to use both random selection and random assignment, or just one of these techniques, or neither technique. A strong study is one that uses both techniques.

The following examples show how a study could use both, one, or neither of these techniques, along with the effects of doing so.

Example 1: Using both Random Selection and Random Assignment

Study:  Researchers want to know whether a new diet leads to more weight loss than a standard diet in a certain community of 10,000 people. They recruit 100 individuals to be in the study by using a computer to randomly select 100 names from a database. Once they have the 100 individuals, they once again use a computer to randomly assign 50 of the individuals to a control group (e.g. stick with their standard diet) and 50 individuals to a treatment group (e.g. follow the new diet). They record the total weight loss of each individual after one month.

Random selection vs. random assignment

Results:  The researchers used random selection to obtain their sample and random assignment when putting individuals in either a treatment or control group. By doing so, they’re able to generalize the findings from the study to the overall population  and  they’re able to attribute any differences in average weight loss between the two groups to the new diet.

Example 2: Using only Random Selection

Study:  Researchers want to know whether a new diet leads to more weight loss than a standard diet in a certain community of 10,000 people. They recruit 100 individuals to be in the study by using a computer to randomly select 100 names from a database. However, they decide to assign individuals to groups based solely on gender. Females are assigned to the control group and males are assigned to the treatment group. They record the total weight loss of each individual after one month.

Random assignment vs. random selection in statistics

Results:  The researchers used random selection to obtain their sample, but they did not use random assignment when putting individuals in either a treatment or control group. Instead, they used a specific factor – gender – to decide which group to assign individuals to. By doing this, they’re able to generalize the findings from the study to the overall population but they are  not  able to attribute any differences in average weight loss between the two groups to the new diet. The internal validity of the study has been compromised because the difference in weight loss could actually just be due to gender, rather than the new diet.

Example 3: Using only Random Assignment

Study:  Researchers want to know whether a new diet leads to more weight loss than a standard diet in a certain community of 10,000 people. They recruit 100 males athletes to be in the study. Then, they use a computer program to randomly assign 50 of the male athletes to a control group and 50 to the treatment group. They record the total weight loss of each individual after one month.

Random assignment vs. random selection example

Results:  The researchers did not use random selection to obtain their sample since they specifically chose 100 male athletes. Because of this, their sample is not representative of the overall population so their external validity is compromised – they will not be able to generalize the findings from the study to the overall population. However, they did use random assignment, which means they can attribute any difference in weight loss to the new diet.

Example 4: Using Neither Technique

Study:  Researchers want to know whether a new diet leads to more weight loss than a standard diet in a certain community of 10,000 people. They recruit 50 males athletes and 50 female athletes to be in the study. Then, they assign all of the female athletes to the control group and all of the male athletes to the treatment group. They record the total weight loss of each individual after one month.

Random selection vs. random assignment

Results:  The researchers did not use random selection to obtain their sample since they specifically chose 100 athletes. Because of this, their sample is not representative of the overall population so their external validity is compromised – they will not be able to generalize the findings from the study to the overall population. Also, they split individuals into groups based on gender rather than using random assignment, which means their internal validity is also compromised – differences in weight loss might be due to gender rather than the diet.

Featured Posts

5 Statistical Biases to Avoid

Hey there. My name is Zach Bobbitt. I have a Master of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Establishing Equivalence: Methodological Progress in Group-Matching Design and Analysis

Sara t. kover.

University of Wisconsin-Madison, Waisman Center, 1500 Highland Avenue, Madison, WI 53705

Amy K. Atwood

University of Wisconsin-Madison

This methodological review draws attention to the challenges faced by intellectual and developmental disabilities researchers in the appropriate design and analysis of group comparison studies. We provide a brief overview of matching methodologies in the field, emphasizing group-matching designs utilized in behavioral research on cognition and language in neurodevelopmental disorders, including autism spectrum disorder, fragile X syndrome, Down syndrome, and Williams syndrome. The limitations of relying on p -values to establish group equivalence are discussed in the context of other existing methods: equivalence tests, propensity scores, and regression-based analyses. Our primary recommendation for advancing research on intellectual and developmental disabilities is the use of descriptive indices of adequate group matching: effect sizes (i.e., standardized mean differences) and variance ratios.

With the ultimate goal of understanding their causal effects on development, much of behavioral research on intellectual and developmental disabilities (IDDs) is designed to (1) characterize phenotypic strengths and weaknesses in behavior and cognition and/or (2) identify syndrome-specific aspects of these profiles. Such aims are often addressed with group-matching designs, in which statistical comparisons between nonrandomized groups (e.g., autism spectrum disorder [ASD] versus typical development) provide the basis for conclusions. Despite the considerable attention matching has received ( Abbeduto, 2010 ; Burack, 2004 ), methodological issues in group matching remain at the forefront of concerns regarding the progress of behavioral research on neurodevelopmental disorders ( Beeghly, 2006 ; Eigsti, de Marchena, Schuh, & Kelley, 2011 ).

The purpose of this article is to introduce methodological improvements to group-matching designs frequently used in IDD research. To that end, we discuss the pitfalls of common group-matching strategies and suggest metrics for establishing adequate group equivalence that are not novel, but are new to the field: effect sizes and variance ratios. Because our primary goal is to provide a foundation from which informed decisions on research design, analysis, and interpretation can be made, we highlight several other study designs worthy of consideration. We conclude by emphasizing the need for thoughtful research questions and responsible use of equivalence thresholds.

Challenges of Group Matching in IDD Research

Frameworks for causality.

The ability to draw conclusions about causality has traditionally hinged upon random assignment of participants to the target group (e.g., treatment, intervention, diagnosis—in our case) and comparison group. Properly implemented, random assignment allows estimation of causal effects because it ensures, in the long run, that any differences between the target and comparison groups (i.e., bias or selection bias) aside from group assignment prior to the study are due to chance. One approach to causality, the Rubin Causal Model, defines the causal effect—that is, the effect of a manipulable treatment—in terms of potential outcomes: what the outcome would have been for participants in the comparison group had they received the treatment and what the outcome would have been for those in the treatment group had they not received it ( Holland, 1986 ; Rubin, 1974 ). In quasi-experimental designs (e.g., regression discontinuity, interrupted time series), it is possible to test hypotheses about the effects of causes without random assignment. A nonequivalent control group design is one that seeks to remove the bias associated with nonrandom assignment by matching the target and comparison groups to establish equivalence, or balance ( Shadish, Cook, & Campbell, 2002 ).

Methods in IDD Research

Although IDDs are attributable to neurodevelopmental disorders, those disorders can scarcely be considered manipulable causes. Research on IDDs is further constrained by ethical parameters (e.g., inability to randomly assign the circumstances that lead to a diagnosis of fetal alcohol spectrum disorder) and relatively small samples due to low prevalence. As such, the use of more desirable techniques, such as random assignment or sophisticated matching that relies on large datasets, is precluded. Instead, in the simplest and perhaps most common group-matching design in the field, two groups composed of participants with preexisting diagnoses are matched on a single variable, such as nonverbal cognitive ability, and then compared on some dependent variable of interest, such as vocabulary ability. These groups are selected in such a manner as to presume they are equivalent on a dimension of ability thought to be relevant to the dependent variable of interest. Differences between groups on the dependent variable are taken to indicate strengths or weaknesses on the construct of interest relative to the matching construct. How to select constructs and variables on which to match is discussed elsewhere and is beyond the current scope (see Burack, Iarocci, Bowler, & Mottron, 2002 ; Mervis & Klein-Tasman, 2004 ; Mervis & Robinson, 1999 ; Strauss, 2001 ). We focus here on a specific aspect of matching: establishing when groups are equivalent.

A customary group-matching procedure is to iteratively exclude participants from one or both groups until an independent samples t -test of the group mean difference on the matching variable yields a sufficiently high p -value, showing that the groups do not significantly differ. This process is achieved by first testing the difference between the groups on the matching variable. For example, a hypothetical target group of 30 participants with a mean score of 60.10 on the matching variable would not be considered matched to a comparison group of 30 participants with a mean score of 71.70 because the p -value for the t -test on the matching variable is less than .05 (hypothetical data are given in the Appendix ). Two matched groups might be attained by next removing all participants outside of the overlapping range of the groups or according to some other criterion, and then testing the group difference again, usually yielding iteratively higher p -values ( Mervis & John, 2008 ). This procedure might be repeated an unreported number of times by a researcher and usually yields higher p -values as participants are removed.

P -value Thresholds

The most persuasive standard in the field for group matching has been generated by Mervis and colleagues ( Mervis & Klein-Tasman, 2004 , Mervis & Robinson, 1999 ), who drew important attention to the matching procedures used to study individuals with IDDs. Mervis and colleagues highlighted that accepting groups as matched when a t -test on the matching variable yields a p -value greater than .05 is not sufficient ( Mervis & Klein-Tasman, 2004 ; Mervis & Robinson, 1999 ). As such, they substantially improved upon the common practice of accepting the null hypothesis that population means are equal given any nonsignficant p -value. Mervis and Klein-Tasman (2004) suggested that when considering a p -value threshold for matching groups, “…it is important to show that the group distributions on the matching variable overlap strongly, as evidenced, we suggest, by a p -level of at least .50 on the test of mean differences for the control variable(s),” (p. 9). While p -values below .05 are taken as clear indication of a group difference, Mervis and colleagues (2004 , 1999 ) proposed that p -values between .20 and .50 are ambiguous; p -values of .50 and above are sufficient evidence of equivalence. The .50 p -value threshold was based on Frick’s (1995) “good-effort criterion” for accepting the null hypothesis, which included p -value thresholds in combination with a small effect.

According to Mervis’ guideline, groups might be considered matched on a measure of cognitive ability only when the p -value for the test of the group difference on the matching variable is greater than or equal to .50. In our hypothetical example, a subset of participants ( n = 20 from each group) could be selected to improve the overlap of the groups on the matching variable by removing the lowest scoring participants of the target group and the highest scoring participants of the comparison group, yielding a mean of 68.10 for the target group and a mean of 67.10 for the comparison group. The t -test on the matching variable for these subgroups is not significant, but instead gives a p -value of .55, which is a value that might be taken as evidence that the groups are matched.

The Trouble with P -value Thresholds

Mervis and colleagues (2004 , 1999 ) were correct in emphasizing that groups ought not be considered matched solely on the basis of a p -value greater than .05. Most importantly, their recommendations increased awareness in the field that some p -values should lead to a conclusion of failing to reject the null hypothesis that the population means are equal. Nonetheless the p -value threshold proposed by Mervis et al. (2004 , 1999 ) and Frick (1995) is not without limitations ( Edgell, 1995 ; Imai, King, & Stuart, 2008 ). Difficulties hinge around the interpretation of p -values and the role of power in hypothesis testing.

Interpretation of a P -value

A p -value is defined as the probability of observing the sampled data or data more extreme, given that the null hypothesis is true. The hypotheses in question in the traditional matching procedure are H 0 : Δ = 0 and H 1 : Δ ≠ 0, where Δ is the population mean difference on the matching variable. The p -value represents the probability of observing the sample mean difference (or one more extreme) when the population mean difference is zero. When there is no difference in the population on the matching variable, one would expect a p -value to be less than or equal to .05 exactly 5% of the time. As such, using p > .05 as a threshold for declaring groups to be matched will result in groups being considered matched 95% of the time when there is no true population mean difference. This can be seen in the first panel of Figure 1 . Likewise, one would expect a p -value to be less than or equal to .20 exactly 20% of the time; a p -value of less than or equal to .50 would be expected 50% of the time when the populations are equivalent. Thus, using a matching threshold of p ≥ .50, one would conclude that the group samples are matched just 50% of the time on average when the groups are truly matched in the population.

An external file that holds a picture, illustration, etc.
Object name is nihms906034f1.jpg

Proportion of samples considered to be adequately matched according to p -value thresholds by sample size and population effect size (Cohen’s d ).

Failure to reject H 0 :Δ = 0 does not allow one to conclude that the groups come from populations with the same mean because a p -value denotes nothing about the probability of the truth of H 0 given the observed data ( Schervish, 1996 ). Null hypothesis significance testing allows for rejecting or failing to reject the null hypothesis; the option of accepting the null hypothesis simply does not exist. Thus, observing a p -value of .50 leads only to a conclusion that the groups are not significantly different (i.e., a failure to reject the null hypothesis).

Power and P -values

It is possible to observe a large p -value due to a lack of effect, due to a small effect, and/or due to the lack of power to detect that effect because of a small sample size. Eliminating participants until a test of the mean difference results in a large enough p -value may decrease the difference between the observed sample means for the matching variable, but also decreases the power (due to the decreasing sample size) to detect any effect at all, including on the dependent variable of interest. In other words, increasing p -values on the matching variable may have less to do with achieving equivalence between groups and more to do with a reduction in power, particularly when sample sizes are initially small. Mervis and John (2008) provide a substantive example of this for a sample of participants with Williams syndrome.

Impact of sample size on p -value threshold matching

To further illustrate these difficulties, we simulated the process of sampling groups of various sizes from populations with known mean differences in a Monte Carlo simulation using R version 2.10.1 ( R Development Core Team, 2010 ). Over 10,000 iterations, we tracked the frequency with which various p -value matching thresholds resulted in concluding that the groups were matched. We defined groups to be matched when a t -test comparing group means on the matching variable resulted in a p -value greater than p -value thresholds of .05, .20, and .50. We based the matching variable on a standardized assessment of cognitive ability, such as IQ ( M = 100, SD = 15), because it is often used in this context and is readily interpretable. The true difference in populations was calculated in terms of the standardized mean difference effect size, ranging from 0 (i.e., no difference) to 0.5 (i.e., a medium effect size using Cohen's [1988] guidelines).

The impact of sample size on the p -value threshold method for determining when groups are matched becomes apparent when the population mean difference is truly greater than zero. As seen in Figure 1 , when the population mean difference is a medium effect of d = 0.50, using a threshold of p ≥ .50, one concludes that the groups are matched between 5% and 35% of the time across sample sizes of 10 to 50. This notable discrepancy in rate of meeting the p -value threshold and concluding that the groups are matched is due to variability in sample size alone.

Inferential statistics should not be used in isolation for establishing equivalence because the results of a t -test hinge on both the observed mean difference between groups and the statistical power of the test, which has a direct relation to sample sizes—dropping participants reduces power and increases a p -value without respect to the mean difference ( Imai et al., 2008 ).

Improved Equivalence Thresholds: Recommendations

Descriptive statistics for group matching.

Contemporary reviews of matching methodologies highlight descriptive statistics pre- and post-matching as an alternative to inferential statistics for determining the adequacy of group equivalence ( Steiner & Cook, 2012 ; Stuart, 2010 ). As the basis for equivalence thresholds—a term from Stegner et al. (1996) —for IDD research, we suggest two descriptive metrics: effect sizes (i.e., standardized mean differences) and variance ratios. These metrics are used widely in quasi-experimental designs with propensity score analysis, which we describe below. Importantly, effect sizes and variance ratios yield interpretable estimates of group matching adequacy and reduce the influence of sample size ( Breaugh & Arnold, 2007 ; Imai et al., 2008 ).

Sometimes referred to as standardized bias, standardized mean differences are a simple and effective index of matching ( Rosenbaum & Rubin, 1985 ; Rubin, 2001 ). Where x ̄ t and x ̄ c are the means of the target and comparison groups on the matching variable, respectively, and s 2 is the corresponding variance, Cohen’s d should be calculated as ( x t ¯ − x c ¯ ) / s t 2 + s c 2 / 2 , when population variances are assumed equal a priori and sample sizes are equal. Note that Cohen’s d is calculated as ( x t ¯ − x c ¯ ) / [ ( n t − 1 ) s t 2 + ( n c − 1 ) s c 2 ] / [ n t + n c − 2 ] for equal or unequal sample sizes, but the formula simplifies as above when sample sizes are equal. When variances are not assumed equal and/or interpreting the mean difference with respect to the variance of the comparison group is preferred, Cohen’s d can be calculated as ( x t ¯ − x c ¯ ) / s c 2 . For our hypothetical example of the n = 20 groups (see Appendix ), Cohen’s d is: ( 68.10 − 67.10 ) / ( 36 + 20 ) / 2 = .19 . Effect sizes should be reported as best practice for tests of the dependent variable ( American Psychological Association, 2010 ; Bakeman, 2006 ), but also reporting standardized mean differences alongside p -values on the matching variable provides context to the comparison of groups. The strategy of reporting effect sizes has been utilized in investigations of language and cognitive abilities in boys with ASD to aid the reader in interpreting the equivalence between the target and comparison samples ( Brown, Aczel, Jimenez, Kaufman, & Grant, 2010 ; Kover, McDuffie, Hagerman, & Abbeduto, under revision ).

The weaknesses of matching groups on just one aspect of their distributions (i.e., means) has been noted ( Facon, Magis, & Belmont, 2011 ); however, using p -value threshold tests on variance, skewness, and kurtosis may exacerbate the issues associated with using p -value thresholds for means alone. We do not recommend p -value threshold matching for variances in addition to means. Rather, we favor Rubin’s (2001) guideline, which avoids use of an inferential statistic: reporting the ratio of the variance of the target group to the variance of the comparison group. The variance ratio should be calculated as: s t 2 / s c 2 . For our hypothetical example of the n = 20 groups, the variance ratio is: 36.00/20.20 = 1.78.

Thresholds for Effects Sizes and Variance Ratios

Of course, the issue of where to set the equivalence thresholds for effect sizes and variance ratios remains ( Shadish & Steiner, 2010 ). Researchers will need to decide on meaningful thresholds based on seminal substantive studies because general guidelines are not universally applicable and should be used only when other references are not available ( Cohen, 1988 ):

"The terms 'small,' 'medium,' and 'large' are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation…In the face of this relativity, there is a certain risk inherent in offering conventional operational definitions for these terms for use in power analysis in as diverse a field of inquiry as behavioral science. This risk is nevertheless accepted in the belief that more is to be gained than lost by supplying a common conventional frame of reference which is recommended for use only when no better bases for estimating the ES [effect size] index is available" (p. 25).

An adequately small effect size for matched groups might be defined as the smallest value at which a difference in groups would be clinically meaningful ( Piaggio et al., 2006 ). Rubin (2001) proposed that the standardized mean difference be close to zero (less than half a standard deviation apart; d ≤ .5) and that the ratio of variances be near 1 (0.5 and 2 serve as endpoints that indicate very poor variance matches). Others have been more specific in defining equivalence as a standardized mean difference near zero such that it is within .1 standard deviation and a variance ratio greater than .8 and less than 1.25 ( Steiner, Cook, Shadish, & Clark, 2010 ). In research on ASD, a Cohen’s d of less than .20 has been described as trivial, but this threshold has yet to be evaluated in terms of group-matching adequacy ( Cicchetti et al., 2011 ). Steiner and Cook (2012) point out that a given effect size on the matching variable must be interpreted together with the expected effect size of the variable of interest (e.g., a Cohen’s d of 0.15 on the matching variable would not be sufficiently small if the effect of interest was expected to be 0.20).

We suggest that groups be considered adequately matched when they fall within the field’s standards for both the absolute value of Cohen’s d and the variance ratio. Table 1 lists a variety of effect sizes and variance ratios with illustrative corresponding means and variances on a matching variable for two groups. A Cohen’s d of 0.00 reflects well-matched group means; a Cohen’s d of 1.00 reflects poorly matched groups. A variance ratio of 1 indicates no difference in variances; a ratio of 2 reflects an unacceptable magnitude of difference in the spread of the distributions. For our hypothetical example, the effect size of .19 might be sufficiently small in some contexts for some researchers to conclude that the groups are matched, but taken together with the variance ratio of 1.78, it is unlikely that these two groups should be considered matched in most fields of study.

Example Standardized Mean Differences (Cohen’s d ) and Variance Ratios as Thresholds

Note . Matched-group adequacy should be evaluated with respect to both means (i.e., absolute value of the effect size) and variances. We emphasize that decisions regarding the adequacy of group matches must be reached through consensus within individual fields; we merely provide starting points based on Rubin (2001) and Steiner and Cook (2012) . Sample statistics reflect a matching variable with M = 100 and SD = 15.

Although negotiating appropriate equivalence thresholds will be far from a trivial feat, these descriptive indices of group matching have several strengths. First, effect sizes are less directly affected by sample size than are p -values. Second, effect sizes and variance ratios can be used in combination with other metrics of equivalence, including visual inspection of plots and p -values from the t -test on the matching variable. Furthermore, because means and standard deviations are usually reported for matching variables in published studies, an interested reader can calculate effect sizes and variance ratios to aid in interpreting extant findings. In Table 2 , we summarize the strengths and weaknesses of the indices of equivalence discussed, as well as the methods described in the next section.

Brief Summary of Strengths and Weaknesses of Methodologies for IDD Research

Existing Methodologies Applied to IDD Research

Simple group-matching designs are ubiquitous in research on IDDs; however, other methodological options are available. We briefly describe three classes of methodologies with strengths and weaknesses that may be unfamiliar to the reader: equivalence tests, propensity score matching, and regression-based techniques.

Equivalence Tests

Often used in medical studies to demonstrate that the difference between two treatments does not exceed some clinically meaningful equivalence threshold, equivalence tests can also be applied to behavioral research ( Rogers, Howard, & Vessey, 1993 ; Serlin & Lapsley, 1985 ; Stegner et al., 1996 ). Schuirmann (1987) suggested a “two one-sided tests” procedure wherein one may conclude that Δ lies within the equivalence bounds (−Δ B , Δ B ) by simultaneously rejecting both H 0 : Δ ≥ −Δ B and H 0 : Δ ≤ Δ B . For Westlake’s (1979) confidence interval method, equivalence is established if the confidence interval (constructed in the usual manner, but with coverage of 0.90) for Δ ̂ , the observed mean difference, falls entirely within the equivalence bounds (−Δ B , Δ B ). Finally, the range equivalence confidence intervals proposed by Serlin and Lapsley (1985 ; 1993 ) stem from a good-enough principle and provide an additional alternative to "strawman" point null hypothesis testing. It is important to note that limited sample sizes may prevent equivalence methods from having the necessary power to detect a truly ‘trivial’ effect, or else triviality may need to be set at a higher magnitude than would be desired. For example, Brown et al. (2010) concluded based on equivalence testing that implicit learning is unimpaired in individuals with ASD relative to typical development, though their choice of threshold value may have been unusually large.

Propensity Scores

The state-of-the-art for matching nonequivalent control groups in quasi-experimental design is propensity score analysis. With the goal of removing selection bias by modeling the probability of being in the target group, propensity scores are aggregated variables that predict group membership using logistic regression ( Fraser & Guo, 2009 ; Shadish et al., 2002 ). Propensity score analysis involves creating a single score from many variables that could be related to group membership and then matching the groups on those propensity scores ( Fraser & Guo, 2009 ). The nonequivalent control groups are often matched utilizing algorithms that, for example, select comparison participants who have scores within a defined absolute difference from a given target participant (i.e., caliper matching) or minimize the total difference between pairs of target and comparison participants (i.e., optimal matching; Rosenbaum, 1989 ).

Propensity scores are best suited to the analysis of large datasets in which it is reasonable to assume that all variables relevant to group membership have been measured and those with complete overlap between the groups on the range of propensity scores ( Shadish et al., 2002 ). In addition, propensity score analysis may be no better than regression techniques unless the primary concern is the large number of matching variables included in the analysis ( Shadish & Steiner, 2010 ). Despite the fact that these conditions are rarely met in IDD research, there are cases in which propensity score matching has been applied. For example, Blackford (2009) used propensity score matching for data from State of Tennessee administrative databases to test whether infants with Down syndrome have lower birth weight than those without. Unfortunately, such large databases are yet unavailable to answer many research questions relevant to neurodevelopmental disorders.

Importantly, matching groups on propensity scores escapes neither the problem of having a satisfactory way to determine when groups are adequately matched nor other problems associated with matching groups on a single variable. Even when using large samples and sophisticated matching algorithms, matching can be problematic when the populations of interest do not completely overlap in range. As such, group-matching procedures can lead researchers to analyze data from samples of participants that are not representative of the populations from which they are drawn or to which the researcher wishes to generalize ( Shadish et al., 2002 ). Furthermore, when participants are chosen from the ends of their distributions due to matching criteria and when matching variables are measured with error, regression to the mean is of concern because participants selected for their extreme, apparently nonrepresentative scores are likely to have less extreme scores on the dependent variable and/or over time ( Breaugh & Arnold, 2007 ; Marsh, 1998 ; Shadish et al., 2002 ). Thus, propensity scores are not a panacea for researchers interested in a single matching construct or those with limited resources to collect large samples with all measurable variables relevant to group membership.

Regression-based Methods

Analysis of covariance (ANCOVA) is sometimes used as an alternative to group matching. ANCOVA is ideal for reducing sampling bias due to variability between groups in experimental designs when unbalance occurs due to chance. Assumptions of ANCOVA include: group membership independent of the covariate, linearly related predictor and outcome, and identical slopes for the groups between the covariate and the dependent variable. When used with preexisting groups, a researcher can expect difficult interpretation, at best, and spurious findings, at worst, because ANCOVA attempts to adjust or control for part of what the group effect is thought to be ( Brock, Jarrold, Farran, Laws, & Riby, 2007 ; Miller & Chapman, 2001 ). For neurodevelopmental disorders, the “selection bias” being removed is often integrally related to the causal effect of interest (e.g., background genes, maternal interaction styles, family stress, world experiences; see Newcombe, 2003 for an example related to children's socioemotional adjustment). In these cases, statistical adjustments between groups diminish true population differences that are attributes of the disorder, yielding uninterpretable results ( Dennis et al., 2009 ; Miller & Chapman, 2001 ; Tupper & Rosenblood, 1984 ). A strong argument has been made in particular against the use of IQ as a covariate in studies of neurodevelopmental disorders because it is inseparable from the disorder itself ( Dennis et al., 2009 ).

More generally, the process of choosing a matching variable or covariate should be deliberate. Preliminary tests of significance—including tests on the matching variable to decide whether it should be used as a covariate—are not recommended ( Atwood, Swoboda, & Serlin, 2010 ; Zimmerman, 2004 ). Above all, the choice of covariate or matching variable is likely to have a greater impact on the conclusions drawn than the choice of analytic method and, thus, should be carefully theoretically justified ( Breaugh & Arnold, 2007 ; Steiner et al., 2010 ).

Developmental trajectories and residuals

Distinct from ANCOVA, Thomas and colleagues (2009) have put forth a regression-based approach, termed cross-sectional developmental trajectories analysis, that allows testing within-group slope differences with respect to theoretically motivated predictors. From this perspective, trajectories are estimated for the dependent variable of interest relative to age and other predictors, such as nonverbal cognitive ability, and these trajectories are compared between a target group and a large comparison group. Conclusions can be drawn about group differences in intercepts (i.e., level of ability) and slopes (i.e., the relationship between a given predictor and the variable of interest). This approach has been applied to aspects of cognitive development in individuals with Williams syndrome ( Karmiloff-Smith et al., 2004 ) and vocabulary ability in individuals with ASD ( Kover et al., under revision ). We refer the interested reader to the detailed substantive examples and thorough characterization of the approach provided by Thomas et al. (2009) , which includes an online worksheet that walks through trajectory analyses step-by-step.

A special case of this type of analysis involves standardizing the performance of the target group based on the residual score (the difference between observed and predicted) from the trajectory of the comparison group ( Jarrold & Brock, 2004 ). The z -scores (or alternatively, scores divided by the standard error of the regression estimate) of these residuals can be used to assess relative deficits on multiple tasks of interest that have been standardized using the same predictor ( Jarrold, Baddeley, & Phillips, 2007 ; Jarrold & Brock, 2004 ). For example, Jarrold and colleagues (2007) examined the performance of individuals with Down syndrome and Williams syndrome on memory tasks with respect to multiple control variables (e.g., age, vocabulary ability), standardized against the performance of 110 typically developing children. By standardizing performance relative to these constructs, Jarrold et al. (2007) identified distinct relationships among abilities relative to the comparison group and differentiated the nature of the deficits in long-term memory in individuals with Down syndrome from those with Williams syndrome.

The developmental trajectories method carries fewer assumptions than ANCOVA because the regression with the matching variable is done for the comparison group alone, avoiding the potential to violate the assumption of independence between the covariate and group ( Brock et al., 2007 ). While allowing simultaneous analysis and “comparison” of disparate participant groups, this procedure is not without limitations. First, a very large comparison group is required. Second, transformations of matching and dependent variables limit the extent to which results can be transparently interpreted. Finally, like other methods, this technique still requires linearity and complete overlap between the groups on the matching variable. As data sharing and access to national datasets (e.g., the National Database for Autism Research; NDAR) become more common, analytic techniques like the developmental trajectories approach will only become more valuable because of the availability of larger samples.

Summary of Recommendations for Researchers

Having brought attention to some of the methodological challenges in research on IDDs, we close with comments on the relationship between research questions and design, and on the responsible use of effect sizes and variance ratios as descriptive equivalence thresholds.

Choose Productive Research Questions

Thoughtful research questions that yield interpretable results should drive study design. We have focused on the simplest type of group-matching design (i.e., two groups and one matching variable); however, many research questions call for other applications of nonequivalent comparison designs. For example, pair-wise matching on one or more control variables might ensure more closely matched groups, but it might also call into question the generalizability of the findings ( Mervis & Robinson, 1999 ). In some cases, studies might be strengthened by including multiple comparison groups ( Burack et al., 2002 ; Eigsti et al., 2011 ) or by matching that is conducted on control tasks that very closely align with the skill of interest ( Jarrold & Brock, 2004 ). Another alternative is creating individual profiles of ability (e.g., case-study analysis), rather than group-level profiles that might fail to represent any individuals from the population from which the sample was drawn ( Mervis & Robinson, 1999 ; Towgood, Meuwese, Gilbert, Turner, & Burgess, 2009 ). Regardless of the research question, reporting results based on multiple matching and analysis techniques will leave the reader informed and free to draw conclusions based on maximal information ( Breaugh & Arnold, 2007 ; Brock et al., 2007 ; Kover et al., under revision ; Mervis & John, 2008 ).

Shifting focus towards understanding individual variability avoids some difficulties associated with group matching, while also leading researchers closer to understanding the sources of difficulty that result in phenotypic strengths and weaknesses. Comparing unrepresentative samples provides little advantage over studying the entire range of variability within a given phenotype and identifying foundational cognitive skills that account for individual variation ( Beeghly, 2006 ; Eigsti et al., 2011 ). Adopting an individual differences approach can highlight phenotypic variability and emphasizes the prerequisite skills necessary for development, ultimately supporting research that emphasizes learning mechanisms rather than outcomes. Of course, some research questions will nonetheless necessitate group comparisons.

Use Effect Sizes and Variance Ratios for Equivalence Thresholds

Group-matching studies that appropriately compare groups presumed to be equivalent on a single matching variable have the potential to provide the groundwork for stronger, well-controlled studies of greater scope. Researchers will benefit from including as many sources of information as possible for establishing group equivalence: plots of the distributions, effect sizes, variance ratios, etc. Given the complexities faced by IDD researchers, our recommendation is that groups be considered adequately matched when both the effect size (e.g., Cohen’s d ) and variance ratio fall within acceptable ranges for a particular area of research. We have provided a table of effect sizes and variance ratios that demonstrates how this technique can be applied to decision making regarding group matching adequacy; however, this table is meant to be thought-provoking, not prescriptive. In published reports, best practice would be to report effect sizes and variance ratios in all cases—for the matching variable and the dependent variable of interest—to allow the reader to interpret where meaningful differences exist.

Conclusions and Future Directions

We have discussed the limitations of p -value thresholds and ways in which using descriptive diagnostics (effect sizes and variance ratios) as equivalence thresholds will benefit research on neurodevelopmental disorders. Drawing the interest of methodological specialists to the study of IDDs will also be key to advancing the field. Open dialogue concerning current practices, paired with the development of improved methods for defining and testing meaningful differences, will significantly improve the design and implementation of research on IDDs.

Acknowledgments

This work was supported in part by NIH P30 HD03352 to the Waisman Center and NIH T32 DC05359. We thank Peter Steiner for his comments on an earlier draft. Following Strauss (2001), we chose to maintain a methodological focus and avoided citing substantive studies as examples, with the exception of those that have utilized methodologies likely to be lesser known to the reader.

Hypothetical Scores on a Matching Variable from Two Groups

Note . The shaded cells show the subset of 20 participants in each group who remained in the analysis during the process of obtaining a higher p -value. They were chosen simply to demonstrate the calculation of effect size and variance ratio, not to demonstrate adequate equivalence.

A preliminary paper was presented at the 2011 annual meeting of the American Educational Research Association in New Orleans.

Contributor Information

Sara T. Kover, University of Wisconsin-Madison, Waisman Center, 1500 Highland Avenue, Madison, WI 53705.

Amy K. Atwood, University of Wisconsin-Madison.

  • Abbeduto L. Editorial. American Journal on Intellectual and Developmental Disabilities. 2010; 115 (1):1–2. [ PubMed ] [ Google Scholar ]
  • American Psychological Association. Publication manual of the American Psychological Association. Sixth. Washington, D.C: Author; 2010. [ Google Scholar ]
  • Atwood AK, Swoboda CM, Serlin RC. The impact of selection procedures for nonnormal covariates on the Type I error rate and power of ANCOVA. Paper presented at the the annual meeting of the American Educational Research Association; Denver, CO. 2010. Paper retrieved from. [ Google Scholar ]
  • Bakeman R. VII. THE PRACTICAL IMPORTANCE OF FINDINGS. Monographs of the Society for Research in Child Development. 2006; 71 (3):127–145. [ Google Scholar ]
  • Beeghly M. Translational research on early language development: Current challenges and future directions. Development and Psychopathology. 2006; 18 (03):737–757. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Blackford JU. Propensity Scores: Method for Matching on Multiple Variables in Down Syndrome Research. Intellectual and Developmental Disabilities. 2009; 47 (5):348–357. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Breaugh JA, Arnold J. Controlling nuisance variables by using a matched-groups design. Organizational Research Methods. 2007; 10 (3):523–541. [ Google Scholar ]
  • Brock J, Jarrold C, Farran EK, Laws G, Riby DM. Do children with Williams syndrome really have good vocabulary knowledge? Methods for comparing cognitive and linguistic abilities in developmental disorders. Clinical Linguistics & Phonetics. 2007; 21 (9):673–688. [ PubMed ] [ Google Scholar ]
  • Brown J, Aczel B, Jimenez L, Kaufman SB, Grant KP. Intact implicit learning in autism spectrum conditions. Quarterly Journal of Experimental Psychology (Hove) 2010; 63 (9):1789–1812. [ PubMed ] [ Google Scholar ]
  • Burack J. Editorial Preface. Journal of Autism and Develompental Disorders. 2004; 34 (1):3–5. [ Google Scholar ]
  • Burack JA, Iarocci G, Bowler D, Mottron L. Benefits and pitfalls in the merging of disciplines: The example of developmental psychopathology and the study of persons with autism. Development and Psychopathology. 2002; 14 (2):225–237. [ PubMed ] [ Google Scholar ]
  • Cicchetti DV, Koenig K, Klin A, Volkmar FR, Paul R, Sparrow S. From Bayes through marginal utility to effect sizes: a guide to understanding the clinical and statistical significance of the results of autism research findings. Journal of Autism and Develompental Disorders. 2011; 41 (2):168–174. [ PubMed ] [ Google Scholar ]
  • Cohen J. Statistical power analysis for the behavioral sciences. 2. Hillsdale, N.J: L. Erlbaum Associates; 1988. [ Google Scholar ]
  • Dennis M, Francis DJ, Cirino PT, Schachar R, Barnes MA, Fletcher JM. Why IQ is not a covariate in cognitive studies of neurodevelopmental disorders. Journal of the International Neuropsychological Society. 2009; 15 (03):331–343. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Edgell SE. Commentary on "Accepting the null hypothesis". Memory & Cognition. 1995; 23 (4):525–526. [ PubMed ] [ Google Scholar ]
  • Eigsti IM, de Marchena AB, Schuh JM, Kelley E. Language acquisition in autism spectrum disorders: A developmental review. Research in Autism Spectrum Disorders. 2011; 5 (2):681–691. [ Google Scholar ]
  • Facon B, Magis D, Belmont JM. Beyond matching on the mean in developmental disabilities research. Research in Developmental Disabilities. 2011; 32 (6):2134–2147. [ PubMed ] [ Google Scholar ]
  • Fraser MW, Guo S. Propensity Score Analysis: Statistical Methods and Applications. SAGE Publications; 2009. [ Google Scholar ]
  • Frick RW. Accepting the Null Hypothesis. Memory & Cognition. 1995; 23 (1):132–138. [ PubMed ] [ Google Scholar ]
  • Holland PW. Statistics and Causal Inference. Journal of the American Statistical Association. 1986; 81 (396):945–960. [ Google Scholar ]
  • Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society Series a-Statistics in Society. 2008; 171 :481–502. [ Google Scholar ]
  • Jarrold C, Baddeley AD, Phillips C. Long-term memory for verbal and visual information in Down syndrome and Williams syndrome: performance on the Doors and People test. Cortex. 2007; 43 (2):233–247. [ PubMed ] [ Google Scholar ]
  • Jarrold C, Brock J. To match or not to match? Methodological issues in autism-related research. Journal of Autism and Developmental Disorders. 2004; 34 (1):81–86. [ PubMed ] [ Google Scholar ]
  • Karmiloff-Smith A, Thomas M, Annaz D, Humphreys K, Ewing S, Brace N, Campbell R. Exploring the Williams syndrome face-processing debate: the importance of building developmental trajectories. Journal of Child Psychology and Psychiatry. 2004; 45 (7):1258–1274. [ PubMed ] [ Google Scholar ]
  • Kover S, McDuffie A, Hagerman R, Abbeduto L. Receptive vocabulary in boys with autism spectrum disorder: Cross-sectional developmental trajectories (under revision) [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Marsh HW. Simulation study of nonequivalent group-matching and regression-discontinuity designs: Evaluations of gifted and talented programs. Journal of Experimental Education. 1998; 66 (2):163–192. [ Google Scholar ]
  • Mervis CB, John AE. Vocabulary abilities of children with Williams syndrome: strengths, weaknesses, and relation to visuospatial construction ability. Journal of Speech, Language, and Hearing Research. 2008; 51 (4):967–982. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mervis CB, Klein-Tasman BP. Methodological Issues in Group-Matching Designs: α Levels for Control Variable Comparisons and Measurement Characteristics of Control and Target Variables. Journal of Autism and Developmental Disorders. 2004; 34 (1):7–17. [ PubMed ] [ Google Scholar ]
  • Mervis CB, Robinson BF. Methodological issues in cross-syndrome comparisons: Matching procedures, sensitivity (Se) and specificity (Sp) Monographs of the Society for Research in Child Development. 1999; 64 (1):115–130. [ PubMed ] [ Google Scholar ]
  • Miller GA, Chapman JP. Misunderstanding analysis of covariance. Journal of Abnormal Psychology. 2001; 110 (1):40–48. [ PubMed ] [ Google Scholar ]
  • Newcombe NS. Some controls control too much. Child Development. 2003; 74 (4):1050–1052. [ PubMed ] [ Google Scholar ]
  • Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SW, f CG. Reporting of noninferiority and equivalence randomized trials: An extension of the consort statement. Journal of the American Medical Association. 2006; 295 (10):1152–1160. [ PubMed ] [ Google Scholar ]
  • R Development Core Team. R: A Language and Environment for Statistical Computing (Version 2.10.1) Vienna, Austria: R Foundation for Statistical Computing; 2010. Retrieved from http://www.R-project.org . [ Google Scholar ]
  • Rogers JL, Howard KI, Vessey JT. Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin. 1993; 113 (3):553–565. [ PubMed ] [ Google Scholar ]
  • Rosenbaum PR. Optimal Matching for Observational Studies. Journal of the American Statistical Association. 1989; 84 (408):1024–1032. [ Google Scholar ]
  • Rosenbaum PR, Rubin DB. Constructing a Control-Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. American Statistician. 1985; 39 (1):33–38. [ Google Scholar ]
  • Rubin DB. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology. 1974; 66 (5):688–701. [ Google Scholar ]
  • Rubin DB. Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation. Health Services and Outcomes Research Methodology. 2001; 2 (3):169–188. [ Google Scholar ]
  • Schervish MJ. P values: What they are and what they are not. American Statistician. 1996; 50 (3):203–206. [ Google Scholar ]
  • Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics. 1987; 15 (6):657–680. [ PubMed ] [ Google Scholar ]
  • Serlin RC, Lapsley DK. Rationality in psychological research: The good-enough principle. American Psychologist. 1985; 40 (1):73–83. [ Google Scholar ]
  • Serlin RC, Lapsley DK. Rational appraisal of psychological research and the good-enough principle. In: Keren G, Lewis C, editors. A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues. Hillsdale, NJ: Erlbaum; 1993. pp. 199–228. [ Google Scholar ]
  • Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin: 2002. [ Google Scholar ]
  • Shadish WR, Steiner PM. A Primer on Propensity Score Analysis. Newborn and Infant Nursing Reviews. 2010; 10 (1):19–26. [ Google Scholar ]
  • Stegner BL, Bostrom AG, Greenfield TK. Equivalence testing for use in psychosocial and services research: An introduction with examples. Evaluation and Program Planning. 1996; 19 (3):193–198. [ Google Scholar ]
  • Steiner PM, Cook DL. Matching and propensity scores. In: Little TD, editor. Oxford Handbook of Quantitative Methods. New York: Oxford University Press; 2012. [ Google Scholar ]
  • Steiner PM, Cook TD, Shadish WR, Clark MH. The Importance of Covariate Selection in Controlling for Selection Bias in Observational Studies. Psychological Methods. 2010; 15 (3):250–267. [ PubMed ] [ Google Scholar ]
  • Strauss ME. Demonstrating specific cognitive deficits: A psychometric perspective. Journal of Abnormal Psychology. 2001; 110 (1):6–14. [ PubMed ] [ Google Scholar ]
  • Stuart EA. Matching Methods for Causal Inference: A Review and a Look Forward. Statistical Science. 2010; 25 (1):1–21. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Thomas MS, Annaz D, Ansari D, Scerif G, Jarrold C, Karmiloff-Smith A. Using developmental trajectories to understand developmental disorders. Journal of Speech, Language, and Hearing Research. 2009; 52 (2):336–358. [ PubMed ] [ Google Scholar ]
  • Towgood KJ, Meuwese JDI, Gilbert SJ, Turner MS, Burgess PW. Advantages of the multiple case series approach to the study of cognitive deficits in autism spectrum disorder. Neuropsychologia. 2009; 47 (13):2981–2988. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Tupper DE, Rosenblood LK. Methodological considerations in the use of attribute variables in neuropsychological research. Journal of Clinical Neuropsychology. 1984; 6 (4):441–453. [ PubMed ] [ Google Scholar ]
  • Westlake WJ. Statistical Aspects of Comparative Bioavailability Trials. Biometrics. 1979; 35 (1):273–280. [ PubMed ] [ Google Scholar ]
  • Zimmerman DW. A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology. 2004; 57 (1):173–181. [ PubMed ] [ Google Scholar ]

IMAGES

  1. Introduction to Random Assignment -Voxco

    random assignment vs matching techniques

  2. PPT

    random assignment vs matching techniques

  3. Random Assignment in Experiments

    random assignment vs matching techniques

  4. Random Sample v Random Assignment

    random assignment vs matching techniques

  5. Random Selection vs. Random Assignment

    random assignment vs matching techniques

  6. PPT

    random assignment vs matching techniques

VIDEO

  1. Matching hats with random comments!

  2. headless red jokar and frog vs matching head -vfx funny video 😂

  3. Pari 🧚 vs Matching With Cake 🎂 #shorts #trending

  4. RANDOM ASSIGNMENT

  5. random sampling & assignment

  6. Submitting to the lower rank

COMMENTS

  1. Random Assignment in Experiments

    Random sampling (also called probability sampling or random selection) is a way of selecting members of a population to be included in your study. In contrast, random assignment is a way of sorting the sample participants into control and experimental groups. While random sampling is used in many types of studies, random assignment is only used ...

  2. 5.2 Experimental Design

    Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too. In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition ...

  3. 1.3: Threats to Internal Validity and Different Control Techniques

    Random assignment. Random assignment is the single most powerful control technique we can use to minimize the potential threats of the confounding variables in research design. As we have seen in Dunn and her colleagues' study earlier, participants are not allowed to self select into either conditions (spend $20 on self or spend on others).

  4. PDF Matching methods for causal inference: Designing observational studies

    random assignment is often infeasible in social science research, due to either ethical or practical concerns. Matching methods constitute a growing collection of techniques that attempts to replicate, as closely as possible, the ideal of randomized experiments when using observational data. There are two key ways in which the matching methods ...

  5. Random Assignment in Psychology: Definition & Examples

    Random Assignment vs Random Sampling. Random sampling refers to selecting participants from a population so that each individual has an equal chance of being chosen. This method enhances the representativeness of the sample. Random assignment, on the other hand, is used in experimental designs once participants are selected.

  6. When randomisation is not good enough: Matching groups in intervention

    Comparison of assignment to groups using (a) variance minimisation and (b) random assignment.When a new participant joins a study, variance minimisation assigns the participant to the group that minimises the variance between groups along with the pre-defined variables (i.e., V); in this case intelligence (IQ), executive functions (EFs), attentional performance (AP), and gender, while keeping ...

  7. PDF Chapter 9: Experimental Research

    1. Matching a. A process whereby a researcher deliberately assigns cases into groups based upon relevant characteristics (a characteristic is considered relevant if in anyway it could affect the dependent variable during the course of the experiment) in order to create similar groups for comparison purposes. b. Matching vs. Random Assignment i.

  8. Randomized Trials and Case-Control Matching Techniques

    Matching techniques fall into two main categories: Individual matching: Every case is matched to a control on the base of determined variables (e.g., age, gender, smoking status, etc.); it is possible also a different ratio (1:2, 1:4 case-control) according to the power analysis. Each case pair has identical values on the matching factors.

  9. 6.2 Experimental Design

    Random assignment is a method for assigning participants in a sample to the different conditions, and it is an important element of all experimental research in psychology and other fields too. In its strictest sense, random assignment should meet two criteria. One is that each participant has an equal chance of being assigned to each condition ...

  10. When randomisation is not good enough: Matching groups in ...

    Randomised assignment of individuals to treatment and controls groups is often considered the gold standard to draw valid conclusions about the efficacy of an intervention. In practice, randomisation can lead to accidental differences due to chance. Researchers have offered alternatives to reduce such differences, but these methods are not used frequently due to the requirement of advanced ...

  11. Matching Methods for Causal Inference: A Review and a Look Forward

    Matching methods are gaining popularity in fields such as economics, epidemiology, medicine and political science. However, until ... Randomized experiments use a known random-ized assignment mechanism to ensure "balance" of the covariates between the treated and control groups: The groups will be only randomly different ...

  12. Matching (statistics)

    Matching is a statistical technique that evaluates the effect of a treatment by comparing the treated and the non-treated units in an observational study or quasi-experiment (i.e. when the treatment is not randomly assigned). The goal of matching is to reduce bias for the estimated treatment effect in an observational-data study, by finding, for every treated unit, one (or more) non-treated ...

  13. Issues in Outcomes Research: An Overview of Randomization Techniques

    Objective: To review and describe randomization techniques used in clinical trials, including simple, block, stratified, and covariate adaptive techniques. Background: Clinical trials are required to establish treatment efficacy of many athletic training procedures. In the past, we have relied on evidence of questionable scientific merit to aid ...

  14. Matching methods for causal inference: A review and a look forward

    Implementing a matching method, given that measure of closeness, Assessing the quality of the resulting matched samples, and perhaps iterating with Steps (1) and (2) until well-matched samples result, and. Analysis of the outcome and estimation of the treatment effect, given the matching done in Step (3).

  15. Matching and Randomization in Experiments

    Two arguments in this paper jumped out at me, the first about the value of matching and the second about the costs and benefits of conducting a randomized versus observational study. Matching Rubin proposes a hypothetical experiment on 2 N units in which the experimental treatment E is assigned to N units while the control treatment C is ...

  16. Random assignment

    Random assignment or random placement is an experimental technique for assigning human participants or animal subjects to different groups in an experiment (e.g., a treatment group versus a control group) using randomization, such as by a chance procedure (e.g., flipping a coin) or a random number generator. This ensures that each participant or subject has an equal chance of being placed in ...

  17. An overview of randomization techniques: An unbiased assessment of

    TYPES OF RANDOMIZATION. Many procedures have been proposed for the random assignment of participants to treatment groups in clinical trials. In this article, common randomization techniques, including simple randomization, block randomization, stratified randomization, and covariate adaptive randomization, are reviewed.

  18. Statistical Analysis of Quasi-Experimental Designs:

    I. Overview Random assignment is used in experimental designs to help assure that different treatment groups are equivalent prior to treatment. With small n's randomization is messy, the groups may not be equivalent on some important characteristic.. In general, matching is used when you want to make sure that members of the various groups are equivalent on one or more characteristics.

  19. Random Selection vs. Random Assignment

    Random selection and random assignment are two techniques in statistics that are commonly used, but are commonly confused.. Random selection refers to the process of randomly selecting individuals from a population to be involved in a study.. Random assignment refers to the process of randomly assigning the individuals in a study to either a treatment group or a control group.

  20. PDF When randomisation is not good enough: Matching groups in ...

    Abstract. Randomised assignment of individuals to treatment and controls groups is often considered the gold standard to draw valid conclusions about the efficacy of an intervention. In practice, randomisation can lead to accidental differences due to chance. Researchers have offered alternatives to reduce such differences, but these methods ...

  21. PDF Alternatives to Random Assignment for Outcome Evaluations

    method may be required. Barring random assignment, nonequivalent comparison group design is the next best approach. This method does not account for all differences between the treatment and comparison groups, but it can minimize them and offer a meaningful analysis of intervention outcomes. If matching of treatment and comparison group

  22. Establishing Equivalence: Methodological Progress in Group-Matching

    As such, the use of more desirable techniques, such as random assignment or sophisticated matching that relies on large datasets, is precluded. Instead, in the simplest and perhaps most common group-matching design in the field, two groups composed of participants with preexisting diagnoses are matched on a single variable, such as nonverbal ...

  23. Can I use matching instead of random assignment to achieve a quasi

    Essentially what you doing is randomly throwing away some members of Party A and some members of Party B; doing so doesn't change the distribution of net worth in either party. You need to use strategic matching (e.g., propensity score) to throw out the right members of each party. An unrelated randomization wont help using the matching approach.