Statistics Made Easy

## What is a Covariate in Statistics?

In statistics, researchers are often interested in understanding the relationship between one or more explanatory variables and a response variable .

However, occasionally there may be other variables that can affect the response variable that are not of interest to researchers. These variables are known as covariates .

Covariates: Variables that affect a response variable, but are not of interest in a study.

For example, suppose researchers want to know if three different studying techniques lead to different average exam scores at a certain school. The studying technique is the explanatory variable and the exam score is the response variable.

However, there’s bound to exist some variation in the student’s studying abilities within the three groups. If this isn’t accounted for, it will be unexplained variation within the study and will make it harder to actually see the true relationship between studying technique and exam score.

One way to account for this could be to use the student’s current grade in the class as a covariate . It’s well known that the student’s current grade is likely correlated with their future exam scores.

Thus, although current grade is not a variable of interest in this study, it can be included as a covariate so that researchers can see if studying technique affects exam scores even after accounting for the student’s current grade in the class.

Covariates appear most often in two types of settings: ANOVA (analysis of variance) and Regression.

## Covariates in ANOVA

When we perform an ANOVA (whether it’s a one-way ANOVA , two-way ANOVA , or something more complex), we’re interested in finding out whether or not there is a difference between the means of three or more independent groups.

In our previous example, we were interested in understanding whether or not there was a difference in mean exam scores between three different studying techniques. To understand this, we could have conducted a one-way ANOVA.

However, since we knew that a student’s current grade was also likely to affect exam scores we could include it as a covariate and instead perform an ANCOVA (analysis of covariance).

This is similar to an ANOVA, except that we include a continuous variable (student’s current grade) as a covariate so that we can understand whether or not there is a difference in mean exam scores between the three studying techniques, even after accounting for the student’s current grade .

## Covariates in Regression

When we perform a linear regression, we’re interested in quantifying the relationship between one or more explanatory variables and a response variable.

For example, we could run a simple linear regression to quantify the relationship between square footage and house prices in a certain city. However, it may be known that the age of a house is also a variable that affects house price.

In particular, older houses may be correlated with lower house prices. In this case, the age of the house would be a covariate since we’re not actually interested in studying it, but we know that it has an effect on house price.

Thus, we could include house age as an explanatory variable and run a multiple linear regression with square footage and house age as explanatory variables and house price as the response variable.

Thus, the regression coefficient for square footage would then tell us the average change in house price associated with a one unit increase in square footage after accounting for house age .

## Additional Resources

An Introduction to ANCOVA (Analysis of Variance) How to Interpret Regression Coefficients How to Perform an ANCOVA in Excel How to Perform Multiple Linear Regression in Excel

## Featured Posts

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

## One Reply to “What is a Covariate in Statistics?”

It is really easy to understand. Thank you, Zach!

## Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

## Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

## Confusing Statistical Terms #5: Covariate

by Karen Grace-Martin 73 Comments

Covariate really has only one meaning, but it gets tricky because the meaning has different implications in different situations, and people use it in slightly different ways. And these different ways of using the term have BIG implications for what your model means.

The most precise definition is its use in Analysis of Covariance, a type of General Linear Model in which the independent variables of interest are categorical, but you also need to adjust for the effect of an observed, continuous variable–the covariate.

In this context, the covariate is always continuous, never the key independent variable, and always observed (i.e. observations weren’t randomly assigned its values, you just measured what was there).

A simple example is a study looking at the effect of a training program on math ability. The independent variable is the training condition–whether participants received the math training or some irrelevant training. The dependent variable is their math score after receiving the training.

But even within each training group, there is going to be a lot of variation in people’s math ability. If you don’t adjust for that, it is just unexplained variation. Having a lot of unexplained variation makes it pretty tough to see the actual effect of the training–it gets lost in all the noise.

So if you use pretest math score as a covariate, you can adjust for where people started out. So you get a clearer picture of whether people do well on the final test due to the training or due to the math ability they had coming in.

Okay, great. Where’s the confusion?

## Covariates as Continuous Predictor Variables

The confusion is that, really, the model doesn’t care that the covariate is something you don’t have a hypothesis about. Something you’re just adjusting for. Mathematically, it’s the same model, and you run it the same way.

And so people who understand this often use the term covariate to mean ANY continuous predictor variable in your model, whether it’s just a control variable or the most important predictor in your hypothesis. And I’m guilty as charged. It’s a lot easier to say covariate than continuous predictor variable.

But SPSS does this too. You can run a linear regression model with only continuous predictor variables in SPSS GLM by putting them in the Covariate box . All the Covariate box does is define the predictor variable as continuous.

(SAS’s PROC GLM does the same thing, but it doesn’t specifically label them as Covariates. In PROC GLM, the assumption is all predictor variables are continuous. If they’re categorical, it’s up to you, the user, to specify them as such in the CLASS statement.)

## Covariates as Control Variables

But the other part of the original ANCOVA definition is that a covariate is a control variable.

So sometimes people use the term Covariate to mean any control variable. Because really, you can covary out the effects of a categorical control variable just as easily.

In our little math training example, you may be unable to pretest the participants. Maybe you can only get them for one session. But it’s quick and easy to ask them, even after the test, “Did you take Calculus?” It’s not as good of a control variable as a pretest score, but you can at least get at their previous math training.

You’d use it in the model in the exact same way you would the pretest score. You’d just have to define it as categorical.

Once again, there isn’t really a good term for categorical control variable, so people sometimes refer to it as the covariate.

## So what is a covariate then?

It’s hard to say. There is no disputing the first definition, so it’s clear there.

I prefer to just be careful, in setting up hypotheses, running analyses, and in writing up results, to be clear about which variables I’m hypothesizing about and which ones I’m controlling for, and whether each variable is continuous or categorical.

The names that the variables, or the models, end up having, aren’t important as long as you’re clear about what you’re testing and how.

———————————————————————————-

Read other posts on Confusing Statistical Terms and/or check out our other resources related to ANOVA and ANCOVA .

## Reader Interactions

May 23, 2022 at 10:53 am

Hello everyone,

In my study, I have plant (plant population) density as factor, is it necessary to include the same plant density (number of plants at harvest) as covariate? I have been advised to do so and I urgently need your help on this.

May 14, 2020 at 3:36 am

This really helped me a lot – thanks!

September 14, 2019 at 2:53 am

Hi Thank you for your article here. I’ve read a paper which said “including gender as a covariate”, but I learned the covariate must be continuous in ANCOVA, so it really baffled me until I found your explanation here!

But there is still a question, if the categorical covariate has an interaction with the IV, how can i report it? Report the main effect of IV hierarchically? Or if there is an interaction ,the variate can not be taken as a covariate?

November 29, 2021 at 9:20 pm

Your definition of a covariate in ANCOVA is completely at odds with that given in Whitlock and Schluter (2020). The Analysis of Biological Data, 3rd ed. They specify that the covariate is categorical, and the main effect (factor/explanatory variable) is numerical (pretty much the opposite of what you state). And by the way, a numerical variable does not have to be “continuous”; it may be discrete (e.g, counts, which are integers). So you’ve even confused the term continuous with numerical.

November 30, 2021 at 3:39 pm

Very interesting. I don’t have that book. Yes, it’s absolutely true that many authors use “covariate” to means different things. But I’ve never, ever, heard of a factor in an ANCOVA being numerical. Look in any other book on ANOVA. That is so backward from usual usage I wonder if it’s a typo. I’d have to see the exact wording to really comment. That said, Factor is also another confusing term, in that it means something entirely different in the context of Factor Analysis, where it is continuous.

And you’re absolutely correct that numerical variables don’t have to be continuous. The difference is very important for dependent/outcome/response variables, since it affects the type of model you use. It’s not a big difference for predictors, since if you fit a line to a discrete predictor, you’re technically treating it as continuous.

April 14, 2019 at 7:22 pm

Brilliant post, that really unlogged a lot for me.

I was just wondering if you had information on what the actual math looked like for adding a covariate into a regression? Just as a basic example. i.e. How do you ‘adjust for the effect of said covariate’?

Sorry if I’m getting my terms mixed up.

January 31, 2019 at 6:50 am

I am confused by your example? Why do you use pre math scores as a covariate and not as timepoint 1 in a repeated measures design? Within variables: time (pre post), dependent variable (training 1 or two), no covariate. Thank you in advance Jan

March 4, 2019 at 11:17 am

It depends on the research question. See this: https://www.theanalysisfactor.com/pre-post-data-repeated-measures/

December 6, 2018 at 8:43 pm

Hi Karen, Our study’s respondents are the left behind emerging adults. The IV is the psychological distress they are undergoing and it includes academic distress. Is it right to ask if they are in college level? Though it won’t be ideal to use it because not all emerging adults are college students but mostly are. It would also help determine the academic distress which I said earlier. Thank you!

November 30, 2018 at 8:34 am

I really need your help. I don’t know what test (and I cant find any source of information about it) should I perform in this circumstances:

One IV (e.g. Gender) Two DV (continuous) One or Two Covariates (both ordinal).

What to do if the Covariates are ordinal?

November 30, 2018 at 11:52 am

Hi Armindo, I can’t give advice without really digging into the details of things like your research questions and the roles of these variables.

But I can comment on a couple things. Ordinal predictors usually need to be treated as categorical: https://www.theanalysisfactor.com/pros-and-cons-of-treating-ordinal-variables-as-nominal-or-continuous/

Once you’ve got more than one DV, you’re into multivariate statistics. So some version of a MANOVA or Multivariate linear model. https://www.theanalysisfactor.com/multiple-regression-model-univariate-or-multivariate-glm/

August 22, 2018 at 3:56 pm

What should you do if you have dramatically different sample sizes across levels of a categorical variable you are including as a covariate? For example, if you are controlling for gender with 4 categories (i.e., man, woman, prefer to self-describe, prefer not to say), is there a citation that supports either collapsing the last gender categories or even excluding them from the analysis because of their extremely small sample size to avoid skewing your results? Even with collapsing, however, you could still run into the same sample size problem. Is there a general consensus about how to handle this type of issue?

October 12, 2018 at 11:28 am

I don’t know there is a consensus, other than to be thoughtful about it and consider the pros and cons of different approaches. You may want to read this: https://www.theanalysisfactor.com/when-unequal-sample-sizes-are-and-are-not-a-problem-in-anova/

August 31, 2017 at 12:26 pm

I want to run a report where I think I will need an ANCOVA for the analysis.

I want to test 2 types of clinical outcomes for one rehab programme to see how they compare in picking up changes in pain and function. I have asked participants to fill out the 2 outcomes pre and post intervention (2 times points only).

I have the data and was thinking to run simple t-tests to show that the intervention has been successful in reducing pain and increasing function for each of the outcomes used.

Then, I want to ensure that both outcomes have picked up change successfully. In order to do this I want to run an ANCOVA, using the baseline pain scores as the covariate. This is in order to show that those with higher pain to start were perhaps less likely to end up with lowest pain scores post. Does this make sense? Here the IV would be the outcome used (either A or B) and the DV would be the pain score recorded post programme.

I would hope that the p value would be insignificant from the ANCOVA print out , showing that both outcomes (A and B) had no differenced in means , however I am getting a little confused about how I would report on the data from the covariate?

June 19, 2017 at 7:12 am

I want to do an analysis, I think it has to be an ANCOVA, but my independent variables are not probably/hopefully not independent of each other (whoch is an assumption of the ANCOVA, right). I want to analyse: first independent variable: condition (4) and second independent variable: the difference score of two measurements points.

But the difference score should be dependent of the conditions (at least I hope so) My dependent variable is the recognition score, which should not be dependent of the conditions and the difference score

So my questions is if I can use an ANCOVA because my two independent variables are linked to each other…

Thank you very much in advance!!!

Verda Simsek

May 17, 2017 at 1:58 pm

Hi Karen, I just want to ask if the covariates will have their own F value in the ANOVA. I was running alinear mixed model ANOVA with 3 fixed factors and 1 random factor but with an extra 2 covariates. But on the Anova output, the covariates are not there…should their significance ´be shown in spss too?

April 10, 2018 at 2:07 pm

I’m not sure when your reply was posted, but I figured I would reply. If too late to help you, at least others may benefit:

The primary purpose of a covariate is to illustrate an effect above an observed effect that goes beyond your manipulation. Therefore, there is already an assumption that your covariates are significantly correlated with the dependent variable. If this were not the case, there would be no point of putting them into your analysis. Basically, SPSS has no need to tell you of your significance of the covariate because you should already know that. Instead, what you should do to observe the ANCOVA at work is to analyze your model with and without the 2 covariates. What you should see is that you had more power without the covariates. However, in exchange for less power, you provide evidence that your effect extends beyond known potential confounds.

Best of luck!!

April 11, 2018 at 10:25 am

Actually, I’m surprised that SPSS didn’t include those covariates in the ANOVA table. Yes, they should be there and yes, you need to test them.

I disagree that covariates should only be there to look for effects beyond a manipulation. That may be true in experimental studies that actually have a manipulation. But many good models have both categorical predictors and continuous ones. Whether that categorical predictor is a manipulation only affects the interpretation, not the model.

May 5, 2017 at 1:54 am

Thanks for this. Its very clear. Its very helpful for me to communicate with my economist collaborator who keep insisting on using regression instead of ancova.

I must choose if I will analyze my data using regression or ANCOVA. The main problem is, if I use ancova then I will have to use all variables as covariate and use no fixed factor at all.

So my problem is to decide if a sensory data (ranking of respondents liking 1-4, like most to dislike most) can be used as fixed factor. Or should I dummy code them and put under covariate instead of fixed factor?

Please helpppp.

Best regards Mulia

March 18, 2017 at 6:23 pm

Thanks for this awesome article. I was analysing paper( https://eric.ed.gov/?id=EJ850774 ) that used “covariate” term in its abstract and it was really hard to get an idea of what the authors think covariate is. But now, after reading your explanations, it is much more clear!

February 21, 2017 at 5:26 am

hi karen, I would like to ask how to compare scores of anxiety for two means between independent variables [group 1 VS group 2]. with no experimental design or manipulation.

however, i want to prove that the difference in mean scores is accounted for even after the introduction of a control variable [which is a continuous score on a Life Stress Scale].

please advice on how i should do this? using SPSS. thank you so much 🙂

November 25, 2016 at 5:19 am

Hi Karen, I want to do covariate analysis and my response variable is Binomial (Res ponder vs Non Responder) but covariate is continuous variables. So which stat tool you would suggest to do the robust analysis.

Thanks@Ashwani

September 27, 2016 at 2:43 am

One other issue with covariates arises when actually using the results in a subsequent decision process. Variables controlled for are akin to what is lost when partially differentiating. It is important to understand the need to put this information back when making decisions. By this, if we take the math training example you used above, within the population studied, there may be subsets who do well with a particular method and those who don’t. If you simply control for this variation with pre-test scores, you effectively average the variation away in you analysis. When this is done, it is critically important to understand that you are seeing an averaged picture (and that many potentially important pieces of information are absent). I loved how Good and Hardin saw fit at the beginning of the first chapter of their Common Errors in Statistics, to state unequivocably that statistics should never be used as the sole basis for making decisions. Thus ‘missing or discarded information’ is in part, to me, why. (Essentially you need either strong uniformity conditions on a population, or explicit inspection, before you can be confident a statistical result applies to a particular instance.)

April 10, 2018 at 2:18 pm

I’m not sure if you’re mistakenly generalizing one concept to another. A successful ANCOVA will discard unwanted information. For example, let’s say a cognitive task is known to have a gender effect. I could use gender as a discrete covariate (either in ANCOVA or a multiple regression with gender as a grouping variable) to show an effect exists beyond gender. Essentially, I would be saying gender does not matter for my decision.

The topic you are talking about (I believe) is how many statistics deal with averages. In doing so, we need to keep in mind that individual differences sexist. While this is true of covariates as well, I think that while a single measure is never enough for determining an effect, a covariate can help someone decide if something should not be a factor.

April 11, 2018 at 10:30 am

Agree. Much of what you’re describing sounds like situations where there is an interaction involved.

But yes, there is a fundamental concept in the decision making literature that statistics apply to groups, not individuals, and that what is best for the average may not be best for any given individual or situation.

August 25, 2015 at 1:40 pm

Hi Karen, I have 45 participants who received an exercise intervention. I am observing the change in their physical performance pre-post and after 8 months of the intervention. I have found group differences by using RM ANOVA. Now I need to control some covariates/ confounding variables (categorical variables) like age group, marital status, education level etc. How shall I do it? I am using SPSS version 22. Please suggest the most suitable tool to use. Please also comment if I could get that tool in the following options:

Option 1: While defining the independent variable in the process of RM ANOVA, there are two more spaces available to put data on ‘between subject factor’ and ‘covariate’. Shall I put all 8 covariates under the space labelled ‘covariates’? If I can do that, Q1. Will that actually control these variables all together? Q2. Will it lose power in doing so? Q3. How shall I then term the tool of analysis? RM ANOVA with covariates or RM ANOVA or any other term? Q4. Shall I use them as time*cov for all the covariate separately?

Option 2: Shall I use mixed method ANOVA by putting one particular covariate in the ‘between subject factor’ and see if it has any effect? Q.1 Shall I look at the between subject effect in the output file to see the impact? Or need to look at the effect size or both? Q2. Shall I need to put other 7 covariates under the space labelled ‘covariates’ while considering one covariate as a between subject factor? Q3. How shall I then term the tool of analysis?

Looking forward to hearing from you.

Many thanks in advance.

July 5, 2015 at 8:01 am

I really like your explanation about the term. Enjoyed reading it. Thanks a lot. It certainly helps to stop the arguments between my students and me.

Best, Claudia

May 31, 2015 at 3:57 pm

Hi Karen, Thanks for the helpful article. I’m trying to decide which variables to include as control variables in my regression model. Aside from theoretical reasons, I have examined my correlation matrix. I have a variable that correlates with my IV, but not with my DV…Am I correct in assuming that this variable should NOT be controlled for in the model (since it is unrelated to the DV)?

Thanks for your help!

April 10, 2018 at 2:22 pm

Not sure if it is too late to help, but generally, you are correct. Covariates are variables that vary with the DV, and can be an alternative explanation for the effect of your IV. IF no correlation exists with the DV, there is no need to control for it. However, Keep in mind that for an ANCOVA your IV needs to be a discrete variable, and the DV needs to be continuous. Linear regression would not properly diagnose a relationship between these two.

May 22, 2015 at 12:28 pm

Thanks so much for this clarification. So, in other words, if I want to control for a categorical variable, I still run ANOVA. and if I want to control for a continous variable, I run ANCOVA. *phew*

June 3, 2016 at 9:15 am

Yes, exactly.

May 10, 2015 at 2:27 pm

Hi Karen, thank a lot for the site. Its really great. Currently, I’m designing an experiment on stress treatment of plants with single (heat or drought) or combined (heat+drought) fixed effect IVs to measure one or more response DVs. Each stress treatment will be applied with several levels at (combined with) 3 time points. Measurement(s) are from different experimental units and not from the same individual plant. I’m thinking using either 2-way ANOVA or MANOVA depends on the no. of DV to be measured. However, I’m a little bit confused about the time point factor. I would expect that the measurements from control plants will not be significantly changed by time (as there is no stress), but under (e.g. heat treatment) I would expect to find a significant change in the main effect factor between the 3 time points. Shall I consider the time point as a covariance factor in this case, and use, for example, one-way ANOVA instead of two-way to measure one DV under one heat. Do you recommend any specific analysis or model different from above. I would appreciate a lot your advice and help

March 14, 2015 at 9:28 pm

Hi Karen, I need some help with choosing the statistical analysis. Details of my study. all participants will complete surveys measuring self compassion, whether their psychological needs are met, how high on non attachment scale they are and other trait scales.

then, Half will be given some induction training on how to meditate. half will not.

after that, everyone will be tested to see how mindful and meditative they were using some breath counting measures.

i want to see relationship between the outcome and the survey responses and the training received.

for eg. someone whose needs are met and who received training did well on being mindful. what about someone whose needs are met but was part of the group that did not receive training, what if he does well or what if he doesn’t.

is the survey responses the covariates? How would i write up the research question?

would the test be correlation or regression? ANCOVA?

My original hypothesis was needs met led to being good at meditation. But now the control group has been included, I am confused.

Please help. Thanks in advance

January 8, 2015 at 2:46 pm

Just wanted to say thank you for the site and the easy to follow text. Great job!

December 5, 2014 at 11:07 am

First of all, thank you for your site because it has helped me a lot of times 😉

I have a doubt in the statistical analysis of my study. I am studying stress on mothers and fathers (two independent samples). An important variable in my study is the number of children (usually mothers or fathers with more children feel more stress) and my subjects have between 1, 2 or 3 children. However because I am only looking for gender (mothers vs fathers) differences in stress I want to control this variable (number of children) in my sample. Thus, I decided to apply a chi-square analysis to see if my sample of mothers differed from my sample of fathers in the number of children. The test is non significant so i assumed that my two independent samples do not differ in number of children and that number of children is not a covariate when comparing the mothers and fathers of my sample. Is this correct?

Thanks a lot*

December 1, 2014 at 11:20 pm

Very helpful. Thanks

November 25, 2014 at 2:16 pm

Maybe you could clarify something for me. I have a model that consist of 3 independent variables and on dependent variable. However, in my research I identify two other concepts that acts as mediators (social exchange and perceived organizational support). Would these concepts be considered as covariates?

November 30, 2014 at 11:56 am

Mediators are a little different than covariates. See this: https://www.theanalysisfactor.com/five-common-relationships-among-three-variables-in-a-statistical-model/

July 30, 2014 at 1:12 am

I found your review of covariates really helpful. My research involves exploring the impact of anxiety on communication. I’m using a transmission chain methodology where one person reads a story and reproduces that for the next person in the chain, who reproduces it for the next person in the chain and so on until 4 people have read and reproduced the story. I used a mood induction procedure, which is my between subjects factor. Typically these studies classify the ‘generations’ in the transmission chain as a within subjects factor as the output of person 2-4 is dependent on what they received from the previous person (as if you are taken measurements at different points in time). I am measuring the number of positive and negative statements produced by each person in their reproductions. So what I have done is a 2x4x2 ANOVA.

What I also want to do is explore the impact of trait anxiety, which I measured prior to testing and is a continuous variable. Is it possible to enter trait anxiety as a covariate in an ANOVA/ANCOVA to determine if trait anxiety was related to/or differentially impacted performance under each condition? The hypothesis is that high trait anxiety participants, under a negative mood induction would show a different pattern of results to low trait anxious.

I can perform a median split and enter it as a between subjects factor but this will result in a small number of observations, particularly once I try to look at more than one generation and I’m worried about the loss of power.

I would really appreciate your advice.

All the best, Keely

May 26, 2016 at 6:18 am

I’m having the same problem 🙁

June 12, 2014 at 7:45 am

Dear Karen,

i was wondering if a covariate is the same as the mediator in a repeated measurement model? I am using it that way. But maybe there is another way to test a mediation?

Thanks in advance for your help

April 22, 2014 at 3:46 pm

What does the phrase “covary out the effects” mean? Your article has been very helpful in helping to clear up confusion with terms!! But I’m still at a loss as to why this phrase was used in the context of a test evaluation. In discussing potential changes to a control group (outside the bounds of the test), we were told they could “covary out the effects” of a change.

May 7, 2014 at 10:59 am

Great question, and one that will take another article to describe. 🙂 I’ll add that to my list of future articles to write.

October 15, 2013 at 4:29 pm

Hi, I am new to this site and this has always confused me. How do you “control” for a variable? Could you please explain, how “controlling “works? One of the ways to control a variable might be just taking random samples, i.e., if you want to control for age, then take a rs from all age groups. Another example would be like a “control group” (placebo in a drug experiment). Am I correct in the above examples? Also, I assume there are other standard techniques, could you please clarify how they work?

October 16, 2013 at 9:56 am

That’s a big question. Or rather, a small question with a big answer. I will see if I can write a post (or two or 10) explaining it.

October 17, 2013 at 10:11 am

Thanks Karen! Looking forward to it!!

July 11, 2013 at 7:50 pm

Hi Karen, Thanks for this article! I just want to further clarify some of the discussion on dichotomous covariates/fixed factors. Specifically, I am running a chi squared with one dichotomous outcome and one dichotomous predictor- too see how well group membership of the two-level predictor discriminates groups membership of the two level outcome. Now, I want to ensure that gender (I think my covariate/fixed factor) does not modify the relationship of my predictor’s ability to discriminate membership of my outcome variable. To answer this question I plan to run the model where both genders are combined and separately for each gender. The answer is that discriminate ability of my predictor does depend on gender to some extent.

So my question is – is gender a fixed factor here? is there a better word for it? Random factor?

Thank you so much!

July 15, 2013 at 3:44 pm

Within the context of SPSS GLM, Gender is a fixed factor. Don’t make it random–that’s a whole other thing!

But if you’re doing a chi-square, Fixed Factor and covariate aren’t really issues. Just add it in as another variable

June 16, 2013 at 2:06 pm

Easy to understand definition of Covariate for ANCOVA. Thank you!

May 31, 2013 at 5:01 am

I’m running an ANOVA with repeated measures. As I can see from the correllation matrix there are significant correllations between my possible cofounding variables and my dependent variable but here is my problem: if the possible co-variate is correllating with the dep. variable at time2 but not at time1, do I have to include it as a “normal” co-variate when I perform the GLM?

Thanks for any help. Anika

June 6, 2013 at 5:30 pm

It would be a good idea to include a covariate*time interaction. That will allow the effect of the covariate to be different at time 1 and time 2.

April 24, 2013 at 3:05 pm

I have a follow-up question please. I am looking at memory performance in young and older adults under two conditions: when a negative or a positive stereotype is activated. I’m using a mixed model to compare performance across time (before/after the intervention) across the two age groups and conditions.

I obtain a significant difference between the two age groups on levels of verbal IQ (NART scores) which is continuous, and hence a covariate (that exerts a significant effect).

The paper that I am basing my study on also obtains a difference between age groups over verbal IQ scores. They do not obtain a significant difference within age groups but between conditions, however, and so have not included it as a covariate. This seems wrong to me. Surely if a significant difference over a background variable occurs on one of your IVs you should include the covariate in the model, regardless of whether there’s a difference between groups on the second IV?

If you could clear this up for me I’d really appreciate it, as I am confused!

April 29, 2013 at 6:45 pm

I am missing something. Does verbal IQ relate to the DV? (It seems it would but you don’t mention that). So in this paper, the two stereotype condition groups have different verbal IQ scores, but age groups didn’t? It really comes down to whether the potential covariate is related to the DV.

March 21, 2013 at 9:58 pm

Thanks for the information. The way you write it very clear and easy to understand. Thanks~

April 2, 2013 at 5:43 pm

March 13, 2014 at 9:15 pm

So can we say based on your answer that every confounding variable is a covariate but not every covariate is a confounding variable?

April 4, 2014 at 9:55 am

Depends on how you’re using them. 🙂

October 9, 2012 at 2:37 pm

What is the difference between a confound and a covariate in simple terms?

October 23, 2012 at 3:36 pm

Alexa, that’s a really great question. A confound is a variable that is perfectly (or so near perfectly that you can’t distinguish) associated with another variable that you can’t tell their effects apart.

In most areas of the US, for example, neighborhood and school attended overlap so much because most kids from a neighborhood all go to the same school. So you couldn’t separate out the school effects on say, grade 3 test scores, from the neighborhood effects.

A covariate is a variable that affects the DV in addition to the IV. It doesn’t have to be correlated with the independent variable. If not, it may just explain some of the otherwise unexplained variation in the DV.

October 6, 2012 at 12:12 am

Thanks Karen! I’m so happy this website exists. I found this page because I am stuck on something related but at a way lower level (I am no statistician). I’m running a repeated measures ANOVA in SPSS, using GLM. I need to control for a between-subjects categorical variable that might be adding noise to the data and washing out any effects of my factors. I can’t figure out if the right way to do this is to put it in as a between-subjects factor, or pretend that is a continuous variable and put it in the covariate box. Can you help?

October 8, 2012 at 9:02 am

Yes. In fact, there is already an article here on that exact topic. It’s the same in all SPSS glm procedures, whether you’re using univarate, repeated measures, etc. https://www.theanalysisfactor.com/spss-glm-choosing-fixed-factors-and-covariates/

February 11, 2013 at 7:25 am

I’m still a little confused on the same issue as Emily, despite reading the article you suggested. Although the suggested article seems to clearly spell out that any true categorical variable, including dichotomous variables, should be included in an ANOVA model as a fixed factor rather than a covariate, the article above states

“In our little math training example, you may be unable to pretest the participants. Maybe you can only get them for one session. But it’s quick and easy to ask them, even after the test, “Did you take Calculus?” It’s not as good of a control variable as a pretest score, but you can at least get at their previous math training…You’d use it in the model in the exact same way you would the pretest score. You’d just have to define it as categorical.”

It’s the last sentence that I get stuck on – because if you included a dichotomous variable in the model in the exact same way you would a continuous pre-test score, you would include it as a covariate, not as a fixed factor.

Sorry if this seems obvious, perhaps I’m getting caught up in the terminology too! Any clarification would be greatly appreciated, I’ve found your explanations to be more helpful than most!

February 13, 2013 at 3:04 pm

SPSS’s definitition of “Covariate” is “continuous predictor variable.” It’s definition of “fixed factor” is categorical predictor variable. (There’s actually more to this in comparing fixed and random factors, but that’s a tangent here).

“It’s the last sentence that I get stuck on – because if you included a dichotomous variable in the model in the exact same way you would a continuous pre-test score, you would include it as a covariate, not as a fixed factor. ”

I don’t mean define it the same way, I mean use it as a control variable in the same way. I’m trying to separate out the use of the variable in the model (as something to control for vs. something about whose effect you have a hypothesis) from the way it was measured and therefore needs to be defined (categorical vs. continuous).

So yes, it goes into Fixed Factors because it’s categorical.

And don’t apologize for getting confused with terminology–that’s my whole point. The inprecision of the terminology is what makes it so confusing! 🙂 Karen

October 31, 2013 at 5:29 pm

Hello Karen

I was going through the discussion and had same confusion. my study evaluates effectiveness of a school based program on preschoolers behavior problems. my intervention and control groups differ on strength of students in class measured in categories and fathers education also measured in categories. I need to see if ANCOVA results remain significant after controlling for these baseline differences. the covariate option in SPSS should be a continues measure. So how should I do the analysis with categorical variable. Please answer soon.

November 8, 2013 at 11:36 am

If a control variable is simply a categorical variable, put it into “Fixed Factors” instead of Covariate in Univariate GLM. By default, SPSS will also add in an interaction term, but you can take that out in the Design dialog box.

fyi, if it’s helpful, we have a workshop available on demand that goes through all these details of SPSS GLM: http://theanalysisinstitute.com/spss-glm-ondemand-workshop/

August 15, 2012 at 10:40 am

Pls what is the relationship between covariate partial eta squared in ancova result and partial eta squared of treatment and moderator variables. what is the implication of that of covariate which can be the pretest being higher than that of main treatment or moderator variable

September 11, 2012 at 4:54 pm

I don’t know–I’d need more information on the model. For example, is the covariate different from the moderator? Which interactions are being included?

Thanks, Karen

July 19, 2012 at 9:26 am

Thanks a lot Caren. Your notes were very helpful. I have been looking for the answers in tens of books for several months. Thank God, many of my uncertainties on GLM command are answered today in your site. FYI, in my area of study (accountancy), GLM command is almost nonexistent in literatures.

July 19, 2012 at 10:43 am

I’m so glad.

Most of the time in the literature it will be called ANOVA, ANCOVA or linear regression. But they’re all the same model and all can be run in GLM.

May 2, 2012 at 1:05 am

Very clear. Found the answer I was looking for. Thanks much! I’ll pass on word of your site.

May 3, 2012 at 2:39 pm

Thanks, Ben! Glad it was helpful.

April 11, 2012 at 9:33 am

Hi Akinboboye,

I would suggest starting with the SPSS category link at the right.

If you need more help at the beginning level, I’d be happy to send you my book (I have a few extra copies) for the cost of shipping. Please email me directly.

If you want more help with the concepts discussed above, you really want the Running Regressions and ANCOVAs in SPSS GLM workshop. It walks you through the univariate GLM procedure step-by-step and shows where it’s the same and where it’s different from the regression procedure. You can get to that here: http://www.theanalysisinstitute.com/workshops/SPSS-GLM/index.html

It’s not running right now, but you can use our contact form to get access to it as a home study workshop.

Best, Karen

April 10, 2012 at 7:57 pm

I really enjoy this write-up. Kindly send me details on how to use spss. Thanks

## Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

## Privacy Overview

## What is a Covariate in Statistics?

In statistics, researchers are often interested in understanding the relationship between one or more explanatory variables and a response variable .

However, occasionally there may be other variables that can affect the response variable that are not of interest to researchers. These variables are known as covariates .

Covariates: Variables that affect a response variable, but are not of interest in a study.

For example, suppose researchers want to know if three different studying techniques lead to different average exam scores at a certain school. The studying technique is the explanatory variable and the exam score is the response variable.

However, there’s bound to exist some variation in the student’s studying abilities within the three groups. If this isn’t accounted for, it will be unexplained variation within the study and will make it harder to actually see the true relationship between studying technique and exam score.

One way to account for this could be to use the student’s current grade in the class as a covariate . It’s well known that the student’s current grade is likely correlated with their future exam scores.

Thus, although current grade is not a variable of interest in this study, it can be included as a covariate so that researchers can see if studying technique affects exam scores even after accounting for the student’s current grade in the class.

Covariates appear most often in two types of settings: ANOVA (analysis of variance) and Regression.

## Covariates in ANOVA

When we perform an ANOVA (whether it’s a one-way ANOVA , two-way ANOVA , or something more complex), we’re interested in finding out whether or not there is a difference between the means of three or more independent groups.

In our previous example, we were interested in understanding whether or not there was a difference in mean exam scores between three different studying techniques. To understand this, we could have conducted a one-way ANOVA.

However, since we knew that a student’s current grade was also likely to affect exam scores we could include it as a covariate and instead perform an ANCOVA (analysis of covariance).

This is similar to an ANOVA, except that we include a continuous variable (student’s current grade) as a covariate so that we can understand whether or not there is a difference in mean exam scores between the three studying techniques, even after accounting for the student’s current grade .

## Covariates in Regression

When we perform a linear regression, we’re interested in quantifying the relationship between one or more explanatory variables and a response variable.

For example, we could run a simple linear regression to quantify the relationship between square footage and house prices in a certain city. However, it may be known that the age of a house is also a variable that affects house price.

In particular, older houses may be correlated with lower house prices. In this case, the age of the house would be a covariate since we’re not actually interested in studying it, but we know that it has an effect on house price.

Thus, we could include house age as an explanatory variable and run a multiple linear regression with square footage and house age as explanatory variables and house price as the response variable.

Thus, the regression coefficient for square footage would then tell us the average change in house price associated with a one unit increase in square footage after accounting for house age .

## Additional Resources

An Introduction to ANCOVA (Analysis of Variance) How to Interpret Regression Coefficients How to Perform an ANCOVA in Excel How to Perform Multiple Linear Regression in Excel

## Standardized vs. Unstandardized Regression Coefficients

How to calculate standard error of the mean in google sheets, related posts, how to normalize data between -1 and 1, how to interpret f-values in a two-way anova, how to create a vector of ones in..., vba: how to check if string contains another..., how to determine if a probability distribution is..., what is a symmetric histogram (definition & examples), how to find the mode of a histogram..., how to find quartiles in even and odd..., how to calculate sxy in statistics (with example), how to calculate sxx in statistics (with example).

- Schools & departments

## Chapter 16. Understanding covariates: simple regression and analyses that combine covariates and factors

This chapter introduces approaches to model continuous data as an independent variable. We refer to continuous independent variables as ‘covariates’.

Until now, we have considered statistical tools that allow one to compare and estimate differences of averages between groups (e.g., t-tests, 1- and multi-Factor GLMs): i.e., we have modeled data where the independent variable comprises ‘treatments’ or ‘levels’. Sometimes we wish to examine effects of a ‘continuous’ variable, instead. Height, weight, speed, mass, volume, length and density are all examples of ‘continuous’ variables: they can have any real-number value within a given range. This chapter introduces approaches to model continuous data as an independent variable. We refer to continuous independent variables as ‘ covariates ’ .

We first introduce linear regression. This technique fits a straight line through data with continuous data on both the x- and y-axes (the independent and dependent variables, respectively). Linear regression allows one to estimate the slope (and y-intercept) for the relationship between the x- and y-variables, with appropriate estimates of uncertainty (i.e., standard errors, 95% confidence intervals); it also provides evidence (p-values) to judge whether the estimates of the slope or intercept differ from zero. (That said, one could also use the output from linear regression to judge whether the line’s slope differs from any arbitrary value (not just zero), which we describe, below.)

This technique allows us to ask questions like, “does a flower’s width (independent variable) affect the amount of pollen removed from the flower (dependent variable)?”, or “does metabolic rate (dependent variable) change with ambient temperature (independent variable)?”. Note that, when conducting linear regression, we assume that:

the covariate (i.e., x-axis; independent variable) affects the dependent variable (y-axis). Therefore, we must consider the variables’ functional relationship to decide which will be the (in)dependent variable. For example, in the flower size example, above, we would model flower size and amount of pollen removed as the independent- and dependent-variables, respectively.

This is because it makes biological sense to hypothesize that flower size affects pollen removal (e.g., by affecting how a pollinator handles a flower), but it makes little sense to hypothesize that the amount of pollen removed would determine how big a flower was.

the covariate (x-variable) is measured precisely (i.e., with little measurement error) or is controlled by the experimenter.

Note that some biological disciplines commonly analyze models that include multiple covariates (i.e., ‘multiple regression’). We will discuss analyses with multiple covariates in the future.

We next introduce models that include a single covariate as well as one or more factors. This approach was once called ‘ANCOVA’ (i.e., ANalysis of CO-VAriance); in a general linear model context, we simply note that a glm includes both a covariate(s) and a factor(s).

As we saw in the Chapter, “Analyzing experiments with multiple factors”, models that include both a covariate and at least one factor allow a researcher to assess evidence for multiple hypotheses, simultaneously.

For example, imagine that we wished to compare the dispersal of seeds from maple trees vs. ash trees. Both trees produce seeds with ‘wings’, but their morphology differs (see pictures, below). We might conduct an experiment involving a random sample of seeds from each tree species. We could drop a seed from a known height and measure the distance it travels; we could repeat this process for many seeds for each species at a variety of known heights (ranging from, say, 3 to 25 meters, which spans biologically plausible heights). With these data, we could address several hypotheses:

- Does the covariate (Height) affect Dispersal distance after accounting for effects of the factor (tree Species)? i.e., This hypothesis tests whether we have evidence for a linear relationship between Height and Distance, accounting for differences between species.
- Does Dispersal distance differ between levels of the factor (Tree Species) affect after accounting for effects of the covariate (Height)?
- Do the slopes of the relationships between Height and Dispersal distance differ between levels of the factor (Tree Species)? i.e., do we find evidence for an interaction between the covariate and the factor?

As noted in the Chapter, Analyzing experiments with multiple factors, these hypotheses differ qualitatively from those we might ask with, say, 1-factor glm. Therefore, understanding analyses that include covariates increases the scope of biological questions we might investigate beyond simpler methods. Studies in ecology and evolution analyze models with covariates on a regular basis. However, my experience is that analyses in biomedical sciences rarely include covariates (with notable exceptions, e.g., epidemiology), but might benefit from doing so.

We might include a covariate in an analysis for several reasons. Foremost, we could include a covariate in a model because we’re specifically interested in its biological effect; this reason should be self-evident. However, we might also model the effects of a covariate not because the covariate interests us biologically, per se, but because including the covariate may help us understand effects of another term in our model (e.g., a factor).

First, we might include a covariate to account for confounding effects in a study. For example, imagine that we wished to test whether blood pressure differed between adult human females vs. males. To test this, we might measure blood pressure for an appropriate sample of many females and males and analyze the data with a 1-factor general linear model (blood pressure and Sex would be the dependent and independent variables, respectively).

However, we might also know that body size can affect blood pressure and that, on average, mass differs females and males. Therefore, if we found evidence for differences in blood pressure between females and males in our 1-factor glm, we might wonder whether an apparent effect of Sex arose due to differences in body size between the Sexes, rather than another biological aspects of Sex. A model that included body size as a covariate would help resolve this issue because the results for the effect of Sex would have accounted for effects of body size; i.e., we test whether Sex affects blood pressure independent of differences due to body size.

Clearly, this approach can deepen understanding of biology. Second, we might include a covariate because, if the covariate accounts for a reasonable amount variation in the dependent variable, we increase statistical power to examine effects of a factor that interests us; again, this provides clear benefits.

Conversely, including a covariate that does not explain reasonable variation in the dependent variable decreases power to examine effects of a factor. Therefore, we should think carefully about including a given covariate in an analysis. But, with this careful thinking, covariates improve analyses and provide deeper biological understanding.

The videos and practice problems, below, equip you with the skills to implement models with a covariate.

## Introduction to linear regression

An introduction to GLM with 1 covariate, and comparison with 1-Factor GLM

Link to sharepoint folder for example 1 regression axon

## Covariate vs 1-Factor

This video demonstrates that a 1-Factor GLM works in a similar manner as a 1-covariate GLM

## Example regression

An example regression analysis (with 1 covariate).

Please note that this video needs to be updated to also discuss the third residual plot (where the square root of standardized residuals lies on the y-axis). If you pause the video at this point, you will see that the red line is not perfectly flat, but it is sufficiently flat that we do not worry about equal variance.

The p-value should also be described as strong evidence for an effect.

## Beware of extrapolating and a summary of regression

This video discusses perils of extrapolating beyond the data and provides a summary overview of regression.

## An introduction to analyzing factors and covariates simultaneously

This video provides a conceptual introduction to GLMs that include both Factors and Covariates as independent variables. It considers the types of biological questions that can be addressed, lists assumptions of this approach, and briefly compares this approach to GLMs with multiple Factors.

## GLM with factor and covariate Example: Blood Pressure

This video walks through an analysis of measurements of undergraduate students at the University of Edinburgh: we test whether weight and sex affect systolic blood pressure. The video provides:

i) simple suggestions to plot data;

ii) two approaches to analyze the data.

iii) guidance when an interaction between a factor and covariate appears present vs. absent in the data

Please note that I mis-speak at the very end of the video, where I say there's a typo about d.f., when reporting the results (362 vs 361) (there is no typo; the df come from different models, which I forgot under the pressure of arriving to the end of a long video!)

This video needs to be edited to consider the third residual plot (where we find the square root of standardized residuals on the y-axis). If you pause the video on this plot (you have to be quick!) you will see that this plot indicates the data meet the assumption of equal variance: the red line is flat and the points are evenly spread around the line.

The video should also be edited with respect to interpreting p-values. The interaction has p = 0.917, which constitutes (at most) weak evidence for an interaction. The p-value for the effect of Adj.Weight is interpreted as strong evidence for an effect; similarly, we eventually find p-values that provide strong evidence for an effect of Sex.

data on sharepoint chapter 16

## Practice problems and answers

## Recommended Reading

Grafen & Hails: Modern statistics for the life sciences, Chapter 2. A nice introduction to Regression

Ruxton & Colegrave: Experimental Design for the life sciences (4 th Edition),Chapter 9. This Chapter some materials that are not directly related to our current chapter on ‘covariates’. However, Ruxton & Colegrave’s chapter does nicely discuss experimental design with respect to covariates, and discusses interactions between covariates and factors.

Whitlock & Schluter: The analysis of biological data, Chapter 17. Another nice introduction to regression.

Whitlock & Schluter, ‘The analysis of biological data’; Chapter: ‘Multiple explanatory variables’. This Chapter provides a generally great introduction to models with multiple explanatory variables, including models with both factors and covariates. It includes some materials that we discuss in our previous Chapter, dealing with multi-factor models.

## Covariate in Statistics: Definition and Examples

What is covariate.

In statistical experiments, researchers often measure the independent variable (main treatment variable) and the dependent variable (response to treatment).

Additionally, there could be other variables known to affect the dependent variable besides the main treatment variable. Generally, this variable is not of main interest in the experiment and is referred to as a covariate .

Definition : Covariate is a variable that is not of main interest in an experiment but can affect the dependent variable and the relationship of the independent variable with the dependent variable.

The covariate is not a planned variable but often arises in experiments due to underlying experimental conditions.

Covariate should be identified and analyzed to increase the accuracy and reduce the unexplained variation of the statistical model. Hence, the covariate is also known as a control variable .

## Covariate example

For example, a plant researcher wants to test whether the yield of the plant depends on the genotype of the plant. The researcher collects the data for a yield of the different plant genotypes.

However, the researcher also knows that the height of the plants also affects the plant yield. The plant height is not of primary interest to the researcher, but it should be considered in a statistical model to get accurate results.

In this example, plant genotype is an independent variable (main treatment variable), plant yield is a dependent variable (response variable), and plant height is a covariate .

Note : Covariate is always a continuous variable

## Covariate in Statistical analysis

The ANCOVA (Analysis of Covariance) and regression analysis are commonly used statistical methods that consider covariate in the model.

## ANCOVA (Analysis of Covariance)

ANCOVA is an extension to ANOVA in which a covariate is considered in the statistical model.

The ANOCVA analyzes the effect of the independent variable (main treatment variable) on the dependent variable while there is a covariate in the study.

In the example discussed above, the researcher could have used one-way ANOVA to study the effect of plant genotypes on yield. But these results could be misleading without considering the effect of plant height (covariate) on plant yield.

ANCOVA considers the covariate in the model and estimates the differences in genotypes while adjusting the effect of plant height (covariate). ANCOVA increases the accuracy of the model by removing the variance associated with the covariate.

## Regression analysis

Simple and multiple regression analyses are useful for studying the relationship between independent variables and a dependent variable.

The simple regression analysis is used for understanding the relationship between one independent variable with that of the dependent variable. Whereas, multiple regression analysis is used for understanding the relationship between multiple independent variables with that of the dependent variable.

For example, the effect of plant height on plant yield can be quantified using simple regression analysis.

In addition to plant height, the researcher also knows that ambient temperature also affects the yield of the plant. In this case, the ambient temperature could be used as a covariate in the model.

The multiple regression analysis can be performed in between plant height and ambient temperature as independent variables and plant yield as a dependent variable.

The multiple regression analysis reports adjusted R-squared and regression coefficients (slope), which can be used for estimating the effect of plant height on plant yield after adjusting the effect for ambient temperature.

Note : Sometimes in regression analysis, the independent variables are also known as covariates. This is because they predict the outcome of the dependent variable and can be of primary interest.

## Enhance your skills with courses on statistics

- Statistics with Python Specialization
- Advanced Statistics for Data Science Specialization
- Introduction to Statistics
- Python for Everybody Specialization
- Understanding Clinical Research: Behind the Statistics
- Inferential Statistics

## Subscribe to get new article to your email when published

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.

## You may also enjoy

Samtools: extract reads from specific genomic regions.

Learn how to extract reads from BAM files that fall within a specified region with samtools

## Samtools: Extract Mapped and Unmapped Paired-end Reads

Learn how to filter mapped and unmapped paired-end reads with Samtools

## Samtools: How to Filter Mapped and Unmapped Reads

Learn how to filter mapped and unmapped sequence reads with Samtools

## What is Singleton in Bioinformatics?

This tutorial explains what is singleton in bioinformatics

## Understanding covariates

What is a covariate.

Covariates are usually used in ANOVA and DOE. In these models, a covariate is any continuous variable, which is usually not controlled during data collection. Including covariates the model allows you to include and adjust for input variables that were measured but not randomized or controlled in the experiment. Adding covariates can greatly improve the accuracy of the model and may significantly affect the final analysis results. Including a covariate in the model can reduce the error in the model to increase the power of the factor tests. Common covariates include ambient temperature, humidity, and characteristics of a part or subject before a treatment is applied.

For example, an engineer wants to study the level of corrosion on four types of iron beams. The engineer exposes each beam to a liquid treatment to accelerate corrosion, but cannot control the temperature of the liquid. Temperature is a covariate that should be considered in the model.

In a DOE, an engineer may be interested in the effect of the covariate ambient temperature on the drying time of two different types of paint.

## Example of adding a covariate to a general linear model

A textile company uses three different machines to manufacture monofilament fibers. They want to determine whether the breaking strength of the fiber differs based on which machine is used. They collect data on the strength and diameter for 5 randomly selected fibers from each machine. Because fiber strength is related to its diameter, they also record the fiber diameter for use as a possible covariate.

- Choose Stat > Regression > Fitted Line Plot .
- In Response (Y) (Y) enter Strength .
- In Predictor (X) (X) enter Diameter .
- Assess how closely the data fall beside the fitted line and how close R 2 is to a "perfect fit" (100%).

The fitted line plot indicates a strong linear relationship (87.2%) between diameter and strength.

- Choose Stat > ANOVA > General Linear Model > Fit General Linear Model .
- In Responses , enter Strength .
- In Factors , enter Machine .
- In Covariates , enter Diameter .

For the fiber production data, Minitab displays the following results:

## General Linear Model: Strength versus Diameter, Machine

The F-statistic for machines is 2.61 and the p-value is 0.118. Because the p-value >0.05, you fail to reject the null hypothesis that the fiber strengths do not differ based on the machine used at the 5% significance level. You can assume the fiber strengths are the same on all the machines. Notice that the F-statistic for diameter (covariate) is 69.97 with a p-value of 0.000. This indicates that the covariate effect is significant. That is, diameter has a statistically significant impact on the fiber strength.

Now, suppose you rerun the analysis and omit the covariate. This will result in the following output:

## General Linear Model: Strength versus Machine

Notice that the F-statistic is 4.09 with a p-value of 0.044. Without the covariate in the model, you reject the null hypothesis at the 5% significance level and conclude the fiber strengths do differ based on which machine is used.

This conclusion is completely opposite the conclusion you got when you performed the analysis with the covariate. This example shows how the failure to include a covariate can produce misleading analysis results.

- Minitab.com
- License Portal
- Cookie Settings

You are now leaving support.minitab.com.

Click Continue to proceed to:

- school Campus Bookshelves
- menu_book Bookshelves
- perm_media Learning Objects
- login Login
- how_to_reg Request Instructor Account
- hub Instructor Commons

## Margin Size

- Download Page (PDF)
- Download Full Book (PDF)
- Periodic Table
- Physics Constants
- Scientific Calculator
- Reference & Cite
- Tools expand_more
- Readability

selected template will load here

This action is not available.

## 9.1: Role of the Covariate

- Last updated
- Save as PDF
- Page ID 33166

- Penn State's Department of Statistics
- The Pennsylvania State University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

To illustrate the role the covariate has in the ANCOVA, let’s look at a hypothetical situation wherein investigators are comparing the salaries of male vs. female college graduates. A random sample of 5 individuals for each gender is compiled, and a simple one-way ANOVA is performed:

\(H_{0}: \ \mu_{\text{males}} = \mu_{\text{females}}\)

## SAS Example

SAS coding for the One-way ANOVA:

Here is the output we get:

## Minitab Example

To perform a one-way ANOVA test in Minitab, you can first open the data ( ANCOVA Example Minitab Data ) and then select Stat > ANOVA > One Way…

In the pop-up window that appears, select salary as the Response and gender as the Factor.

Click OK , and the output is as follows.

## Analysis of Variance

Model summary.

- Load the ANCOVA example data.
- Obtain the ANOVA table.
- Plot the data.

1. Load the ANCOVA example data and obtain the ANOVA table by using the following commands:

2. Plot for the data, salary by gender, by using the following commands:

3. Plot for the data, salary vs years, by using the following commands:

Because the \(p\)-value > \(\alpha\) (=0.05), they can't reject the \(H_{0}\).

A plot of the data shows the situation:

However, it is reasonable to assume that the length of time since graduation from college is also likely to influence one's income. So more appropriately, the duration since graduation, a continuous variable, should be also included in the analysis, and the required data is shown below.

The plot above indicates an upward linear trend between salary and the number of years since graduation, which could be a marker for experience and/or postgraduate education. The fundamental idea of including a covariate is to take this trend into account and to "control" it effectively. In other words, including the covariate in the ANOVA will make the comparison between Males and Females after accounting for the covariate.

- Subscriber Services
- For Authors
- Publications
- Archaeology
- Art & Architecture
- Bilingual dictionaries
- Classical studies
- Encyclopedias
- English Dictionaries and Thesauri
- Language reference
- Linguistics
- Media studies
- Medicine and health
- Names studies
- Performing arts
- Science and technology
- Social sciences
- Society and culture
- Overview Pages
- Subject Reference
- English Dictionaries
- Bilingual Dictionaries

## Recently viewed (0)

- Save Search
- Share This Facebook LinkedIn Twitter

## Related Content

Related overviews.

analysis of covariance

analysis of variance

confounding

## More Like This

Show all results sharing this subject:

## Quick Reference

(covariable) n. (in statistics) a continuous variable that is not part of the main experimental manipulation but has an effect on the dependent variable. The inclusion of covariates increases the power of the statistical test and removes the bias of confounding variables (which have effects on the dependent variable that are indistinguishable from those of the independent variable).

From: covariate in Concise Medical Dictionary »

Subjects: Medicine and health

## Related content in Oxford Reference

Reference entries, covariate n..

View all reference entries »

View all related items in Oxford Reference »

Search for: 'covariate' in Oxford Reference »

- Oxford University Press

PRINTED FROM OXFORD REFERENCE (www.oxfordreference.com). (c) Copyright Oxford University Press, 2023. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single entry from a reference work in OR for personal use (for details see Privacy Policy and Legal Notice ).

date: 19 May 2024

- Cookie Policy
- Privacy Policy
- Legal Notice
- Accessibility
- [66.249.64.20|81.177.182.174]
- 81.177.182.174

Character limit 500 /500

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings
- Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Velentgas P, Dreyer NA, Nourjah P, et al., editors. Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan.

## Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide.

- Hardcopy Version at Agency for Healthcare Research and Quality

## Chapter 7 Covariate Selection

Brian Sauer , PhD, M. Alan Brookhart , PhD, Jason A Roy , PhD, and Tyler J VanderWeele , PhD.

This chapter addresses strategies for selecting variables for adjustment in nonexperimental comparative effectiveness research (CER), and uses causal graphs to illustrate the causal network relating treatment to outcome. While selection approaches should be based on an understanding of the causal network representing the common cause pathways between treatment and outcome, the true causal network is rarely known. Therefore, more practical variable selection approaches are described, which are based on background knowledge when the causal structure is only partially known. These approaches include adjustment for all observed pretreatment variables thought to have some connection to the outcome, all known risk factors for the outcome, and all direct causes of the treatment or the outcome. Empirical approaches, such as forward and backward selection and automatic high-dimensional proxy adjustment, are also discussed. As there is a continuum between knowing and not knowing the causal, structural relations of variables, a practical approach to variable selection is recommended, which involves a combination of background knowledge and empirical selection using the high-dimensional approach. The empirical approach could be used to select from a set of a priori variables on the basis of the researcher's knowledge, and to ultimately select those to be included in the analysis. This more limited use of empirically derived variables may reduce confounding while simultaneously reducing the risk of including variables that could increase bias.

- Introduction

Nonexperimental studies that compare the effectiveness of treatments are often strongly affected by confounding. Confounding occurs when patients with a higher risk of experiencing the outcome are more likely to receive one treatment over another. For example, consider two drugs used to treat hypertension—calcium channel blockers (CCB) and diuretics. Since many clinicians perceive CCBs as particularly useful in treating high-risk patients with hypertension, patients with a higher risk for experiencing cardiovascular events are more likely to be channeled into the CCB group, thus confounding the relation between antihypertensive treatment and the clinical outcomes of cardiovascular events. 1 The difference in treatment groups is a result of the differing baseline risk for the outcome and the treatment effects (if any). Any attempt to compare the causal effects of CCBs and diuretics on cardiovascular events would require taking patients' underlying risk for cardiovascular events into account through some form of covariate adjustment. The use of statistical methods to make the two treatment groups similar with respect to measured confounders is sometimes called statistical adjustment, control, or conditioning.

The purpose of this chapter is to address the complex issue of selecting variables for adjustment in order to compare the causative effects of treatments. The reader should note that the recommended variable selection strategies discussed are for nonexperimental causal models and not prediction or classification models, for which approaches may differ. Recommendations for variable selection in this chapter focus primarily on fixed treatment comparisons when employing the so-called “incident user design,” which is detailed in chapter 2 .

This chapter contains three sections. In the first section, we explain causal graphs and the structural relations of variables. In the second section, we discuss proxy, mismeasured, and unmeasured variables. The third section presents variable selection approaches based on full and partial knowledge of the data generating process as represented in causal graphs. We also discuss approaches to selecting covariates from a high-dimensional set of variables on the basis of statistical association, and suggest how these approaches may be used to complement variable selection based on background knowledge. Ideally, when information is available, causal graph theory would be used to complement any variable selection technique. We provide a separate supplement ( supplement 2 ) on directed acyclic graphs for the more advanced reader.

- Causal Models and the Structural Relationship of Variables

This section introduces notation to illustrate basic concepts. Causal graphs are used to represent relationships among variables and to illustrate situations that generate bias and confounding.

## Treatment Effects

The goal of comparative effectiveness research (CER) is to determine if a treatment is more effective or safer than another. Treatments should be “well defined,” as described in chapter 4 , and should represent manipulable units; e.g., drug treatments, guidelines, and devices. Causal graphs are often used to illustrate relationships among variables that lead to confounding and other types of bias. The simple causal graph in Figure 7.1 indicates a randomized trial in which no unmeasured or measured variables influence treatment assignment where A 0 is the assigned treatment at baseline (time zero) and Y 1 is the outcome after followup (time 1). The arrow connecting treatment assignment ( A 0 ) to the outcome ( Y 1 ) indicates that treatment has a causal effect on the outcome. Causal graphs are used to represent the investigator's beliefs about the mechanisms that generated the data. Knowledge of the causal structure that generates the data allows the investigator to better interpret statistical associations observed in the data.

Causal graph illustrating a randomized trial where assigned treatment ( A 0 ) has a causal effect on the outcome ( Y 1 ).

## Risk Factors

We now let C 0 be one or more baseline covariates measured at time zero. Covariates that are predictive of the outcome but have no influence on treatment status are often referred to as pure risk factors, depicted in Figure 7.2 . Conditioning on such risk factors is unnecessary to remove bias but can result in efficiency gains in estimation 2 - 3 and does not induce bias in regression or propensity score models. 4 Researchers need to be careful not to include variables affected by the outcome, as adjustment for such variables can increase bias. 2 We recommend including risk factors in statistical models to increase the efficiency/precision of an estimated treatment effect without increasing bias. 4

Causal graph illustrating a baseline risk factor ( C 0 ) for the outcome ( Y 1 ).

## Confounding

The central threat to the validity of nonexperimental CER is confounding. Due to the ways in which providers and patients choose treatments, the treatment groups may not have similar underlying risk for the outcome.

Confounding is often illustrated as a common cause pathway between the treatment and outcome. Measured variables that influence treatment assignment, are predictive of the outcome, and remove confounding when adjusted for are often called confounders. Unmeasured variables on a common cause pathway between treatment and outcome are referred to as unmeasured confounders. For example, in Figure 7.3 , unmeasured variables U1 and U2 are causes of treatment assignment and outcome. In general, sources of confounding in observational comparative effectiveness studies include provider actions, patient actions, and social and environmental factors. Unmeasured variable U1 has a measured confounder C 0 that is a proxy for U1 , such that conditioning on C 0 removes confounding by U1 , while the unmeasured variable U2 does not.

A causal graph illustrating confounding from the unmeasured variable U2 . Conditioning on the measured variable ( C 0 ), as indicated by the box around the variable, removes confounding from U1 . Measured confounders are often proxies for unmeasurable constructs. (more...)

## Provider Actions

Confounding by indication : Confounding by indication, also referred to as “channeling bias,” is common and often difficult to control in comparative effectiveness studies. 5 - 9 Prescribers choose treatments for patients who they believe are most likely to benefit or least likely to be harmed. In a now historic example, Huse et al. surveyed United States physicians about their use of various classes of antihypertensive medications and found that physicians were more likely to prescribe CCBs to high-risk patients than for uncomplicated hypertension. 1 Any attempt to compare the safety or effectiveness between CCBs and other classes of antihypertensive medication would need to adequately account for the selective use of CCBs for higher risk patients. If underlying disease severity and prognosis are not precisely measured and correctly modeled, CCBs would appear more harmful or less effective simply because higher risk patients are more likely to receive CCBs. Variables measuring risk for the outcome being investigated need to be adequately measured and modeled to address confounding by indication.

Selective treatment and treatment discontinuation of preventive therapy in frail and very sick patients : Patients who are perceived by a physician to be close to death or who face serious medical problems may be less likely to receive preventative therapies. Similarly, preventative treatment may be discontinued when health deteriorates. This may explain the substantially decreased mortality observed among elderly users of statins and other preventive medications compared with apparently similar nonusers. 10 - 11 Even though concerns with discontinuation of therapy may be addressed using time-varying measures of treatment, this type of selective discontinuation presents problems when analyzing fixed treatments. For example, when conducting database studies, data are extracted and analyzed on the basis of the specified study period. The more frail elderly who discontinued treatment prior to the study window would appear to have never received treatment.

Patients with certain chronic diseases or patients who take many medications may also have a lower probability of being prescribed a potentially beneficial medication due to concerns regarding drug-drug interactions or metabolic problems. 8 For example, patients with end-stage renal disease are less likely to receive medications for secondary prevention after myocardial infarction. 12 Additionally, in a study assessing the potential for bias in observational studies evaluating use of lipid-lowering agents and mortality risk, the authors found evidence of bias due to an association between noncardiovascular comorbidities and the likelihood of treatment. 11 Due to these findings, researchers have recommended statin use and other chronic therapies as markers for health status in their causal models. 11 , 13

## Patient Actions

Healthy user/adherer bias : Patients who initiate a preventive therapy may be more likely than other patients to engage in other healthy, prevention-oriented behaviors. Patients who start a preventive medication may have a disposition that makes them more likely to seek out preventive health care services, exercise regularly, moderate their alcohol consumption, and avoid unsafe and unhealthy activities. 14 Incomplete adjustment for such behaviors representative of specific personality traits can make preventative medications spuriously or more strongly associated with reduced risk of a wide range of adverse health outcomes.

Similar to patients who initiate preventive medications, patients who adhere to treatment may also engage in more healthful behaviors. 14 - 15 Strong evidence of this “healthy adherer” effect comes from a meta-analysis of randomized controlled trials where good adherence to placebo was found to be associated with mortality benefits and other positive health outcomes. 16 The benefit can be explained by the healthy behaviors of the patients who use the medication as prescribed rather than placebo effects. Treatment adherence is an intermediate variable between treatment assignment and health outcomes. Any attempt to evaluate the effectiveness of treatment rather than the effect of assigned treatment would require time-varying treatment analysis where subjects are censored when treatment is discontinued. Proper adjustment for predictors of treatment discontinuation is required to resolve the selection bias that occurs when conditioning on patients who adhered to assigned treatment. 17 - 18

Physician assessment that patients are functionally impaired (defined as having difficulty performing activities of daily living) may also influence their treatment assignment and health outcomes. Functionally impaired patients may be less able to visit a physician or pharmacy; therefore, such patients may be less likely to collect prescriptions and receive preventive health care services. 8 This phenomenon could exaggerate the benefit of prescription medications, vaccines, and screening tests. 8

## Environmental and Social Factors

Access to health care : Within large populations analyzed in multi-use health care databases, patients may vary substantially in their ability to access health care. Patients living in rural areas, for example, may have to drive long distances to receive specialized care. 8 Other patients face different obstacles to accessing health care, such as cultural factors (e.g., trust in the medical system), economic factors (e.g., ability to pay), and institutional factors (e.g., prior authorization programs, restrictive formularies), all of which may have some direct or indirect relation to treatment and study outcomes. 8

## Intermediate Variables

An intermediate variable is generally thought of as a post-treatment variable influenced by treatment that may or may not lie on the causal pathway between the treatment and the outcome. Figures 7.4 and 7.5 illustrate variables affected by treatment. In Figure 7.4 , C 0 is a baseline confounder and must be adjusted for, but a subsequent measurement of the variable at a later time ( C 1 ) is on the causal pathway between treatment and outcome. For example, consider the study previously described comparing classes of antihypertensive medications ( A 0 ) on the risk for cardiovascular events ( Y 1 ). The baseline measure of blood pressure is represented by C 0 . Blood pressure measured after treatment is initiated, with adequate time for the treatment to reach therapeutic effectiveness and before the outcome assessment, is considered an intermediate variable and is represented by C 1 in Figure 7.4 . When the goal of CER is to estimate the total causal effect of the treatment on the outcome, adjustment for variables on the causal pathway between treatment and outcome, such as blood pressure after treatment is initiated ( C 1 ), is unnecessary and is likely to induce bias 2 toward a relative risk of 1.0, though the direction can sometimes be in the opposite direction. The magnitude of bias is greatest if the primary mechanism of action is through the intermediate pathway. Thus, it would be incorrect to adjust for blood pressure measured after the treatment was initiated ( C 1 ), because most of the medication's effects on cardiovascular disease are mediated through improvements in blood pressure. This kind of overadjustment would mask the antihypertensive effect of the treatment A 0 .

A causal graph representing an intermediate causal pathway. Blood pressure after treatment initiation ( C 1 ) is on the causal pathway between antihypertensive treatment ( A 0 ) and cardiovascular events ( Y 1 ). Baseline blood pressure ( C 0 ) is a measured confounder (more...)

A causal diagram illustrating the problem of adjustment for the intermediate variable, low birth weight ( M 1 ), when evaluating the causal effect of maternal smoking ( A 0 ) on infant mortality ( Y 1 ) after adjustment for measured baseline confounders ( C 0 ) between (more...)

Pharmacoepidemiological studies that do not restrict analyses to incident episodes of treatments are subject to this type of overadjustment. Measurement of clinical covariates such as blood pressure at the time of registry enrollment rather than at the time of treatment initiation in an established medication user is such an example. For such patients, a true baseline measurement is unobtainable. The clinical variables for established users at the time of enrollment have already been influenced by investigational treatments and are considered intermediate variables rather than baseline confounders. The ability to adequately adjust for baseline confounders and not intermediate variables is one reason the new user design described in chapter 2 is so highly valued.

Investigators are sometimes interested in separating total causal effects into direct and indirect effects. In mediation analysis, the investigator intentionally measures and adjusts intermediate variables to estimate direct and indirect effects. Mediation analysis requires a stronger set of identifiability assumptions and is discussed in several articles. 19 - 33

When conditioning on an intermediate, biases can also arise for “direct effects” if the intermediate is a common effect of the exposure and an unmeasured variable that influences the outcome as in Figure 7.5 . The “birth-weight paradox” is one of the better known clinical examples of this phenomenon. 27 , 32 , 34 Maternal smoking seems to have a protective effect on infant mortality in infants with the lowest birth weight. The seemingly protective effect of maternal smoking is a predictable association produced from conditioning on an intermediate without adequate control for confounding between the low birth weight (intermediate) and infant mortality (outcome). This is illustrated in Figure 7.5 . The problem of conditioning on a common effect of two variables will be further discussed below in the section on colliders.

## Time-Varying Confounding

The intention-to-treat analogue of a randomized trial, where subjects are assigned to the treatment they are first exposed to regardless of discontinuation or switching treatments, may not be the optimal design for all nonexperimental CER. Researchers interested in comparing adverse effects of medications that are thought to occur only in proximity to using the medication may, for example, want to censor subjects who discontinue treatment. This type of design is described as a “per protocol” analysis. An “as treated” analysis allows subjects to switch treatment groups on the basis of their use of treatment. Both the “as treated” and “per protocol” analysis can be used to evaluate time-varying treatment.

In a nonexperimental setting, time-varying treatments are expected to have time-varying confounders. For example, if we are interested in comparing cardiovascular events between subjects who are completely adherent to CCBs versus completely adherent to diuretics, then we may consider a time-varying treatment design where subjects are censored when they discontinue the treatment to which they were first assigned (as illustrated in Figure 7.6 ). If joint predictors of compliance and the outcome are present, then some sort of adjustment for the time-varying predictors must be made. Standard adjustment methods may not produce unbiased effects when the predictors of adherence and the outcome are affected by prior adherence, and a newer class of causal effect estimators, such as inverse-probability-of-treatment weights or g-estimation, may be warranted. 18 , 35

A simplified causal graph illustrating adherence to initial antihypertensive therapy as a time-varying treatment ( A 0 , A 1 ), joint predictors of treatment adherence and the outcome ( C 0 , C 1 ). The unmeasured variable ( U1 ) indicates this is a nonexperimental (more...)

## Collider Variables

Colliders are the result of two independent causes having a common effect. When we include a common effect of two independent causes in our statistical model, the previously independent causes become associated, thus opening a backdoor path between the treatment and outcome. This phenomenon can be explained intuitively if we think of two causes (sprinklers being on or it is raining) of a lawn being wet. If we know the lawn is wet, and we know the value of one of the other variables (it is not raining), then we can predict the value of the other variable (the sprinkler must be on). Therefore, conditioning on a common effect induces an association between two previously independent causes, that is, sprinklers being on and rain.

Bias resulting from conditioning on a collider when attempting to remove confounding by covariate adjustment is referred to as M -collider bias. 36 Pure pretreatment M -type structures that statistically behave like confounders may be rare; nevertheless, any time we condition on a variable that is not a direct cause of either the treatment or outcome but merely associated with the two, we have the potential to introduce M -bias. 37

A hypothetical example of how two independent variables can become conditionally associated and increase bias follows. Consider a highly simplified hypothetical study to compare rates of acute liver failure between new users of CCB and diuretics using administrative data from a distributed network of managed care organizations. As illustrated in Figure 7.7 , if some of the managed care organizations had a formulary policy ( U1 ) that caused a lower proportion of patients to be initiated on a CCB ( A 0 ), and that same policy reduced the chance of receiving medical treatment for erectile dysfunction ( F 0 ), and patients with a long history of unmeasured alcohol abuse ( U2 ) are more likely to receive treatment for erectile dysfunction ( F 0 ), then adjustment for erectile dysfunction treatment may introduce bias by generating an association and opening a backdoor path that did not previously exist between formulary policy ( U1 ) and alcohol abuse ( U2 ).

Hypothetical causal diagram illustrating M -type collider stratification bias. Formulary policy ( U1 ) influences treatment with CCB ( A 0 ) and treatment for erectile dysfunction ( F 0 ). Unmeasured alcohol use ( U2 ) influences impotence and erectile dysfunction (more...)

Although conditioning on a common effect of two variables can induce an association between two otherwise independent variables, we currently lack many compelling examples of pure M -bias for pretreatment covariates. Such structures do, however, arise more commonly in the analysis of social network data. 38 Compelling examples of collider stratification bias (i.e., selection bias) do exist when conditioning on variables affected by treatment (as illustrated in Figure 7.5 ). Collider stratification bias can give rise to other biases in case-control studies and studies with time-varying treatments and confounding. 39

## Instrumental Variables

An instrumental variable is a pretreatment variable that is a cause of treatment but has no causal association with the outcome other than through its effect on treatment such as Z 0 in Figure 7.8 . When treatment has an effect on the outcome, an instrumental variable will be associated with treatment and the outcome, and can thus statistically appear to be a confounder. An instrumental variable will also be associated with the outcome even when conditioning on the treatment variable whenever there is an unmeasured common cause of the treatment on the outcome. It has been established that inclusion in statistical models of variables strongly associated with treatment ( A 0 ) but not independently associated with the outcome ( Y 1 ) will increase the standard error and decrease the precision of the treatment effect. 2 , 4 , 40 - 41 It is less well known, however, that the inclusion of such instrumental variables into statistical models intended to remove confounding can increase the bias of an estimated treatment effect. The bias produced by the inclusion of such variables has been termed “ Z -bias,” as Z is often used to denote an instrumental variable. 8

Bias is amplified ( Z -bias) when an instrumental variable ( Z 0 ) is added to a model with unmeasured confounders ( U1 ).

Z -bias arises when the variable set is insufficient to remove all confounding, and for this reason Z -bias has been described as bias-amplification. 42 - 43 Figure 7.8 illustrates a data-generating process where unmeasured confounding exists along with an instrumental variable. In this situation, the variation in treatment ( A 0 ) can be partitioned into three components: the variation explained by the instrument ( Z 0 ), the variation explained by U1 , and the unexplained variation. The magnitude of unmeasured confounding is determined by the proportion of variation explained by U1 , along with the association between U1 and Y 1 . When Z 0 is statistically adjusted, one source of variation in A 0 is removed making the variation explained by U1 a larger proportion of the remaining variation. This is what amplifies the residual confounding bias. 44

Any plausible instrumental variable can potentially introduce Z -bias in the presence of uncontrolled confounding. Indication for treatment was found to be a strong instrument 45 and provider and ecologic causes of variation in treatment choice have been proposed as potential instrumental variables that may amplify bias in nonexperimental CER. 8

A simulation study evaluating the impact of adjusting instruments of varying strength when in the presence of uncontrolled confounding demonstrated that the impact of adjusting instrumental variables was small in certain situations, a result which led the authors to suggest that over-adjustment is less of a concern than under-adjustment. Analytic formulae, on the other hand, indicate that this bias may be quite large, especially when dealing with multiple instruments. 42 We have discussed bias amplification due to adjusting for instrumental variables. The use of instrumental variables, however, can be employed as an alternative strategy to deal with unmeasured confounding. 46 This strategy is discussed in detail in chapter 10 .

We have presented multiple types of variable structures, with a focus on variables that either remove or increase bias when adjusted. The dilemma is that many of these variable types statistically behave like confounders, which are the only structural type needing adjustment to estimate the average causal effect of treatment. 47 - 48 For this reason, researchers should be hesitant to rely on statistical associations alone to select variables for adjustment. The variable structure must be considered when attempting to remove bias through statistical adjustment.

- Proxy, Mismeasured, and Unmeasured Confounders

It is not uncommon for a researcher to be aware of an important confounding variable and to lack data on that variable. A measured proxy can sometimes stand in for an unmeasured confounder. For example, use of oxygen canisters could be a proxy for failing health and functional impairment; use of preventive services, such as flu shot, is sometime thought to serve as a proxy for healthy behavior and treatment adherence. Likewise, important confounders sometimes are measured with error. For example, self-reported body mass index will often be subject to underreporting.

Researchers routinely adjust analyses using proxy confounders and mismeasured confounders. Adjusting for a proxy or mismeasured confounder will reduce bias relative to the unadjusted estimate, provided the effect of the confounder on the treatment and the outcome are “monotonic.” 48 In other words, any increase in the confounder should on average always affect treatment in the same direction, and should always affect the outcome in the same direction for both the treated and untreated groups. If an increase in the confounder increased the outcome for the treated group and decreased the outcome for the untreated group, then adjustment for the proxy or mismeasured confounder can potentially increase bias. Unfortunately, there are cases, even when the measurement error of the confounder is nondifferential (i.e., does not depend on treatment or outcome), where adjustment for proxy or mismeasured confounders can increase, rather than decrease, bias. 49

Another common problem in trying to estimate causal effects is that of unmeasured confounding. Sensitivity analysis techniques have been developed to address misclassified and unmeasured confounding. The reader is referred to chapter 11 for further discussion of sensitivity analyses.

- Selection of Variables To Control Confounding

We present two general approaches to selecting variables in order to control confounding in nonexperimental CER. The first approach selects variables on the basis of background knowledge about the relationship of the variable to treatment and outcome. The second approach relies primarily on statistical associations to select variables for control of confounding, using what can be described as high-dimensional automatic variable selection techniques. The use of background knowledge and causal graph theory is strongly recommended when there is sufficient knowledge of the causal structure of the variables. Sufficient knowledge, however, is likely rare when conducting studies across a wide geography and many providers and institutions. For this reason, we also present practical approaches to variable selection that empirically select variables on the basis of statistical associations.

## Variable Selection Based on Background Knowledge

Causal graph theory.

Assuming that a well-defined fixed treatment employing an intention-to-treat paradigm and no set of covariates predicts treatment assignment with 100 percent accuracy, control of confounding is all that is needed to estimate causal effects with nonexperimental data. 47 - 48 The problem, as described above, is that colliders, intermediate variables, and instruments can all statistically behave like confounders. For this reason, an understanding of the causal structure of variables is required to separate confounders from other potential bias-inducing variables. This dilemma has led many influential epidemiologists to take a strong position for selecting variables for control on the basis of background knowledge of the causal structure connecting treatment to outcome. 50 - 54

When sufficient knowledge is available to construct a causal graph, a graphical analysis of the structural basis for evaluating confounding is the most robust approach to selecting variables for adjustment. The goal is to use the graph to identify a sufficient set of variables to achieve unconfoundedness, sometimes also called conditional exchangeability. 24 , 55 The researchers specify background causal assumptions using causal graph criteria (see supplement 2 of this User's Guide ). If the graph is correct, it can be used to identify a sufficient set of covariates ( C ) for estimating an effect of treatment ( A 0 ) on the outcome ( Y 1 ). A sufficient set C is observed when no variable in C is a descendant of A 0 and C blocks every open path between A 0 and Y 1 that contains an arrow into A 0 . Control of confounding using graphical criteria is usually described as control through the “back-door” criteria, the idea being that variables that influence treatment assignment—that is, variables that have arrows pointing to treatment assignment—provide backdoor paths between the A 0 and Y 1 . It is the open back-door pathways that generate dependencies between A 0 and Y 1 and can produce spurious associations when no causal effect of A 0 on Y 1 is present, and that alter the magnitude of the association when A 0 causally affects Y 1 .

Although it is quite technical, causal graph theory has formalized the theoretical justification for variable selection, added precision to our understanding of bias due to under- and over-adjustment, and unveiled problems with historical notions of statistical confounding. The main limitation of causal graph theory is that it presumes that the causal network is known and that the only unknown is the magnitude of the causal contrast between A 0 and Y 1 being examined. In practice, where observational studies include large multi-use databases spanning vast geographic regions, such complete knowledge of causal networks is unlikely. 56 - 57

Since we rarely know the true causal network that represents all common-cause pathways between treatment and outcome, investigators have proposed more practical variable selection approaches based on background knowledge when the causal structure is only partially known. These strategies include adjusting for all observed pretreatment variables thought to have some connection to the outcome, 58 all known risk factors for the outcome, 4 , 44 , 59 and all direct causes of the treatment or the outcome. 57 The benefits and limitations to each approach to removing confounding are briefly discussed.

## Adjustment for All Observed Pretreatment Covariates

Emphasis is often placed on the treatment assignment mechanism and on trying to reconstruct the hypothetical broken randomized experiment that led to the observational data. 58 Propensity score methods are often employed for this purpose and are discussed in chapter 10 ; they can be used in health care epidemiology to statistically control large numbers of variables when outcomes are infrequent. 60 , 61 Propensity scores are the probability of receiving treatment given the set of observed covariates. The probability of treatment is estimated conditional on a set of covariates and the predicted probability is then used as a balancing score or matching variable across treatment groups to estimate the treatment effect.

The greatest importance is often placed on balancing all pretreatment covariates. However, when attempts are made to balance all pretreatment covariates, regardless of their structural form, biases, for example from including strong instruments and colliders, can result, 37 , 57 , 62 though, as noted above, in practice, pretreatment colliders are likely rarer than ordinary confounding variables.

## Adjustment for All Possible Risk Factors for the Outcome

Confounding pathways require common cause structures between the outcome and treatment. A common strategy for removing confounding without incidentally including strong instruments and colliders is to include in propensity score models only variables thought to be direct causes of the outcome, that is, risk factors. 4 , 59 , 63 This approach requires only background knowledge of causes of the outcome, and it does not require an understanding of the treatment assignment mechanism or how variables that influence treatment are related to risk factors for the outcome. This strategy, however, may fail to include measured variables that predict treatment assignment but have an unmeasured ancestor that is an outcome risk factor ( A 0 ← C 0 ← U1→Y 1 ) as illustrated in Figure 3. 57

## Disjunctive Cause Criterion

The main practical use of causal graphs is to ensure adjustment for confounders and avoid adjusting for known colliders. 51 In practice, one only needs to partly know the causal structure of variables relating treatment to the outcome. The disjunctive cause criterion is a formal statement of the conditions in which variable selection based on partial knowledge of the causal structure can remove confounding. 57 It states that all observed variables that are a cause of treatment, a cause of outcome, or a cause of both should be included for statistical adjustment. It can be shown that when any subset of observed variables is sufficient to control confounding, the set obtained by applying the disjunctive cause criteria will also constitute a sufficient set. 57 This approach requires more knowledge of the variables' relationship to the treatment and outcome using all pretreatment covariates, or all risk factors, but less knowledge than the back-door path criterion.

Whenever there exists some set of observed variables that block all back-door paths (even if the researcher does not know which subset this is), the disjunctive cause criterion when applied correctly by the investigators will identify a set of variables that also blocks all back-door paths. The other variable selection criteria based on all pretreatment covariates and risk factors do not have this property. 57 The approach performs well when the measured variables include some sufficient set, but presents problems when unmeasured confounding remains. In this case, conditioning on an instrument can amplify the bias due to unmeasured confounding. Thus, in practice, known instruments should be excluded before applying the criterion. The best approach to variable selection is less clear when unmeasured confounding may remain after statistical adjustment for measured variables, which is often expected in nonexperimental CER. In this case, every variable selection approach will result in bias. The focus would then be on minimizing bias, which requires thoughtful consideration of the tradeoff between over- and underadjustment. Strong arguments exist for error on the side of overadjustment (adjusting for instruments and colliders) rather than failing to adjust for measured confounders (underadjustment). 36 , 44 Nevertheless, adjustments for instrumental variables have been found to amplify bias in practice. 45

## Empirical Variable Selection Approaches

Historically, data for nonexperimental studies was primarily collected prospectively, and thoughtful planning was needed to ensure complete measurement of all important study variables. We now live in an era where every interaction between the patient and the health care system produces hundreds, if not thousands, of data points that are recorded for clinical and administrative purposes. 64 These large multi-use data sources are highly dimensional in that every disease, medication, laboratory result, and procedure code, along with any electronically accessible narrative statements, can be treated as variables.

The new challenge to the researcher is to select a set of variables from this high-dimensional space that characterizes the patient's baseline status at the time of treatment selection to enable identification of causal effects, or that at least produces the least biased estimates. Advances in computer performance and the availability of high-dimensional data have provided unprecedented opportunities to use data empirically to “learn” associational relationships. Empiric variable selection techniques include identifying a subset of variables of statistical associations with the treatment and/or outcome from the original set on the basis of background knowledge of the relationship with treatment and/or outcome, as well as methods that are considered fully automated, where all variables are initially selected on the basis of statistical associations.

## Forward and Backward Selection Procedures

When using traditional regression it is not uncommon to use, for the purposes of covariate selection, what are sometimes called forward and backward selection procedures. Forward selection procedures begin with an empty set of covariates and then consider whether for each covariate, the covariate is associated with the outcome conditional on treatment (usually using a p-value cutoff in a regression model of 0.05 or 0.10). The variable that is most strongly associated with outcome (based on having the smallest p-value below the cutoff) is then added to the collection of variables for which control will be made. Then the process begins again, and one considers whether each covariate is associated with the outcome conditional on the treatment and the covariate already selected; the next covariate that is most strongly associated is again added to the list. The process repeats until all remaining covariates are independent of the outcome conditional on the treatment and the covariates that have been previously selected for control.

Backward selection begins with all covariates in the model; then the investigator considers whether, for each covariate, that covariate is independent of the outcome conditional on the treatment and all other covariates (generally using a p-value cutoff in a regression model of 0.05 or 0.10). The covariate with the largest p-value above the cutoff is then discarded from the list of covariates for which control is made. The process begins again, and the investigator considers whether, for each covariate, that covariate is independent of the outcome, conditional on the treatment and the other covariates not yet discarded; the next covariate with the weakest association with the outcome based on p-value is again discarded. The process repeats itself until all variables still in the list are associated with the outcome conditional on the treatment and the other covariates that have not been discarded.

Provided that the original set of covariates with which one begins suffices for unconfoundedness of treatment effects estimates, then if the backward selection process correctly discards variables that are independent of the outcome conditional on the treatment and other covariates, the final set of covariates selected by the backwards selection procedure will also yield a set of covariates that suffices for conditional exchangeability. 57 Likewise, under an additional assumption of “faithfulness,” 57 the forward selection procedure will identify a set of covariates that suffices for unconfoundedness provided that the original set of covariates with which one begins suffices to achieve unconfoundedness and that the forward selection process correctly identifies the variables that are and are not independent of the outcome conditional on the treatment and other covariates. The forward and backward procedures can thus be useful for covariate reduction, but both of them suffer from the need to specify a set of covariates to begin with that suffice for unconfoundedness. Thus, even if an investigator intends to employ forward or backward selection procedures for covariate reduction, other approaches will be needed to decide on what set of covariates these forward and backward procedures should begin with. Moreover, when the initial set of covariates does not suffice for unconfoundedness, it is not clear how forward and backward selection procedures will perform. Variable selection procedures also suffer from the fact that estimates about treatment effects are made after having already used the data to decide on covariates.

Similar but more sophisticated approaches using machine learning algorithms such as boosting, random forest, and other ensemble methods have become increasingly common, as have sparsity-based methods such as LASSO, in dealing with high-dimensional data. 65 All of these empirically driven methods are limited, however, in that they are in general unable to distinguish between instruments, colliders, and intermediates on the one hand and genuine confounders on the other. Such differentiation needs to be made a priori on substantive grounds.

## Automatic High-Dimensional “Proxy” Adjustment

In an attempt to capture important proxies for unmeasured confounders, Schneeweiss and colleagues proposed an algorithm that creates a very large set of empirically defined variables from health care utilization data. 56 The created variables capture the frequency of codes for procedures, diagnoses, and medication fills during a pre-exposure period. The variables created by the algorithm are required to have a minimum prevalence in the source population and to have some marginal association with both treatment and outcome. After they are defined, the variables can be entered into a propensity score model. In several example studies where the true effect of a treatment was approximately known from randomized controlled trials, the algorithm appeared to perform as well as or better than approaches based on simply adjusting for an a priori set of variables. 45 , 66 By defining variables prior to treatment, propensity score methods will not “over-adjust” by including causal intermediates. Using statistical associations to select potential confounders can result in selection and adjustment of colliders and instruments. Therefore, the analyst should attempt to remove such variables from the set of identified variables. For example, variables that are strong predictors of treatment but have no obvious relation to the outcome should be considered potential sources of Z -bias.

## A Practical Approach Combining Causal Analysis With Empirical Selection

There is a continuum between knowing and not knowing the causal, structural relations of variables. We suggest that a practical approach to variable selection may involve a combination of (1) a priori variable selection based on the researcher's knowledge of causal relationships together with (2) empirical selection using the high-dimensional approach described above. 8 The empirical approach could be used to select from a set of a priori variables on the basis of the researcher's knowledge, and to ultimately select those to be included in the analysis. This more limited use of empirically derived variables may reduce confounding while simultaneously reducing the risk of including variables that could increase bias.

In practice, the particular approach that one adopts for observational research will depend on the researcher's knowledge, the data quality, and the number of covariates. A deep understanding of the specific clinical and public health risks and opportunities that lie behind the research question often drives these decisions.

Regardless of the strategy employed, researchers should clearly describe how variables are measured and provide a rationale for a priori selection of potential confounders, ideally in the form of a causal graph. If the researchers decide to further eliminate variables using an empiric variable selection technique, then they should present both models and describe what criteria were used to determine inclusion and exclusion. Researchers should consider whether or not they believe adequate measurement is available in the dataset when employing a specific variable selection strategy. In addition, all variables included for adjustment should be listed in the manuscript or final report. When empirical selection procedures are newly developed or modified, researchers are encouraged to make the protocol and code publicly available to improve transparency and reproducibility.

Even when researchers use the methods we describe in this chapter, confounding can persist. Sensitivity analysis techniques are useful for assessing residual confounding resulting from unmeasured and imperfectly measured variables. 67 - 75 Sensitivity analysis techniques assess the extent to which an unmeasured variable would have to be related to the treatment and outcome of interest in order to substantially change the conclusions drawn about causal effects. We refer the reader to chapter 11 for discussion of sensitivity analysis techniques.

## Checklist: Guidance and key considerations for covariate selection in CER protocols

View in own window

Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide is copyrighted by the Agency for Healthcare Research and Quality (AHRQ). The product and its contents may be used and incorporated into other materials on the following three conditions: (1) the contents are not changed in any way (including covers and front matter), (2) no fee is charged by the reproducer of the product or its contents for its use, and (3) the user obtains permission from the copyright holders identified therein for materials noted as copyrighted by others. The product may not be sold for profit or incorporated into any profitmaking venture without the expressed written permission of AHRQ.

- Cite this Page Sauer B, Brookhart MA, Roy JA, et al. Covariate Selection. In: Velentgas P, Dreyer NA, Nourjah P, et al., editors. Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan. Chapter 7.
- PDF version of this title (5.8M)

## In this Page

Other titles in these collections.

- AHRQ Methods for Effective Health Care
- Health Services/Technology Assessment Text (HSTAT)

## Related information

- PMC PubMed Central citations
- PubMed Links to PubMed

## Recent Activity

- Covariate Selection - Developing a Protocol for Observational Comparative Effect... Covariate Selection - Developing a Protocol for Observational Comparative Effectiveness Research: A User's Guide

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

- Review Article
- Epidemiology and Global Health

## Misstatements, misperceptions, and mistakes in controlling for covariates in observational research

- David A Fluharty
- Luis M Mestre
- Danny Valdez
- Carmen D Tekwe
- Colby J Vorland
- Yasaman Jamshidi-Naeini
- Sy Han Chiou
- Stella T Lartey
- Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, United States ;
- Department of Applied Health Science, Indiana University School of Public Health-Bloomington, United States ;
- Department of Statistics and Data Science, Southern Methodist University, United States ;
- University of Memphis, School of Public Health, United Kingdom ;
- Open access
- Copyright information

## Share this article

Cite this article.

- Roger S Zoh
- David B Allison
- Copy to clipboard
- Download BibTeX
- Download .RIS

## Introduction

Misperception 1. construct validity, misperception 2. measurement error in a covariate only attenuates associations or effect estimates and does not create apparent effects, misperception 3 (two parts), misperception 4. controlling for a covariate reduces the power to detect an association of the iv of interest with the dv of interest, misperception 5 (two parts), misperception 6. one should check whether covariates are normally distributed and take corrective action if not, misperception 7. if the relation between a plausible confounder and the iv of interest is not statistically significant , the plausible confounder can be excluded with no concern for bias, misperception 8. analyzing the residuals of an analysis in which a dv is regressed on the pbc is equivalent to including the pbc in an overall statistical model with the iv of interest, misperception 9. excluding a covariate that is not associated with the outcome of interest does not affect the association of the iv with the outcome, misperception 10. if a plausible confounding variable is one that has a bivariate unadjusted correlation of zero with the iv, then it does not create bias in the association of the iv with the outcome, misperception 11. the method used to control for a covariate can be assumed to have been chosen appropriately and other methods would not, on average, produce substantially different results, misperception 12. p values derived from implementing statistical methods incorporating covariates mean exactly what they appear to mean and can be interpreted at face value, article and author information.

We discuss 12 misperceptions, misstatements, or mistakes concerning the use of covariates in observational or nonrandomized research. Additionally, we offer advice to help investigators, editors, reviewers, and readers make more informed decisions about conducting and interpreting research where the influence of covariates may be at issue. We primarily address misperceptions in the context of statistical management of the covariates through various forms of modeling, although we also emphasize design and model or variable selection. Other approaches to addressing the effects of covariates, including matching, have logical extensions from what we discuss here but are not dwelled upon heavily. The misperceptions, misstatements, or mistakes we discuss include accurate representation of covariates, effects of measurement error, overreliance on covariate categorization, underestimation of power loss when controlling for covariates, misinterpretation of significance in statistical models, and misconceptions about confounding variables, selecting on a collider, and p value interpretations in covariate-inclusive analyses. This condensed overview serves to correct common errors and improve research quality in general and in nutrition research specifically.

In observational or nonrandomized research, it is common and often wise to control for certain variables in statistical models. Such variables are often referred to as covariates . Covariates may be controlled through multiple means, such as inclusion on the ‘right-hand side’ or ‘predictor side’ of a statistical model, matching, propensity score analysis, and other methods ( Cochran and Rubin, 1961 ; Streeter et al., 2017 ). Authors of observational research reports will frequently state that they controlled for a particular covariate and, therefore, that bias due to (often phrased as ‘confounding by’) that covariate is not present ( Box 1 ). However, authors may write ‘We controlled for…’ when in fact they did not because of common misstatements, misperceptions, and mistakes in controlling for covariates in observational research.

“Since both tickets had an equal probability of winning the same payoff, uncertainty about the true value of the goods exchanged could not confound results.” ( Arlen and Tontrup, 2015 )

“In our study population, NSAIDs other than Aspirin was not associated to PC risk and, therefore, could not confound result.” ( Perron et al., 2004 )

“Most of the demographic, social, and economic differences between patients in different countries were not associated significantly with acquired drug resistance and, therefore, could not confound the association.” ( Cegielski et al., 2014 )

“Furthermore, study time, as well as self-expectation regarding educational achievements (another potential confounder), could be controlled in the IV models. Therefore, these potential channels could not confound our analysis.” ( Shih and Lin, 2017 )

Herein, we describe these multiple misperceptions, misstatements, and mistakes involving the use of covariates or control variables. We have discussed misperceptions that in our collective years of experience as authors, reviewers, editors, and readers, in areas including but not limited to aging and geroscience, obesity, nutrition, statistical teaching, cancer research, behavioral science, and other life-science domains have observed to prevail in the literatures of these fields. Determining the frequency with which these misperceptions are held would require very extensive and rigorous survey research. Instead, we offer them as those we find pertinent and readers may decide for themselves which they wish to study. We now make this clear in the manuscript. Some terms we use are defined in the Glossary in Box 2 . Because of the critical role of attempting to minimize or eliminate biases in association and effect estimates derived from observational research, as recently pointed out elsewhere ( Brown et al., 2023 ), we primarily focus on misperceptions, misstatements, or mistakes leading to decisions about whether and how to control for a covariate that fails to actually control for and minimize or eliminate the possibility of bias. We also consider other errors ( Brenner and Loomis, 1994 ) in implementation, interpretation, and understanding around analyses that involve covariate adjustment.

## Terminology/Glossary

Bias. Here, we define bias as either bias in coefficients in a model or bias in frequentist statistical significance tests. Frequentist statistical significance tests, or the ordinary tests of statistical significance using p values, are commonly reported in this journal and are described more fully here ( Mayo and Cox, 2006 ). Under the null hypothesis that there is no true association or effect to detect in a situation, a proper unbiased frequentist test of statistical significance with continuous data and a continuous test statistic yields a uniform sampling distribution of p values (i.e. rectangular) on the interval zero. The distribution is such that the probability of observing any p value less than or equal to a, where a is the preset statistical significance level (i.e. most often 0.05), is a itself. Any statistical significance test that does not meet this standard can be said to be biased. With respect to coefficients or parameter estimates, we can say that bias is equal to the expected value of the coefficient or parameter estimate minus the actual value of the parameter or quantity to be estimated. In an unbiased estimation procedure, that quantity will be zero, meaning that the expected value of the estimate is equivalent to the value to be estimated.

Replicability. The National Academies of Sciences uses the following working definition for replicability: “Obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data” ( National Academies of Sciences, Engineering, and Medicine, 2019 ).

Reproducibility. The National Academies of Sciences uses the following working definition for reproducibility: “Obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis. This definition is synonymous with ‘computational reproducibility” ( National Academies of Sciences, Engineering, and Medicine, 2019 ). Disqualifying reproducibility criteria include nonpublic data and code, inadequate record keeping, nontransparent reporting, obsolescence of the digital artifacts, flawed attempts to reproduce others’ research, and barriers in the culture of research ( National Academies of Sciences, Engineering, and Medicine, 2019 ).

Confounder. There are many definitions of confounder and not all are equivalent. One definition is “(…) A pre-exposure covariate C [can] be considered a confounder for the effect of A on Y if there exists a set of covariates X such that the effect of the exposure on the outcome is unconfounded conditional on (X, C) but for no proper subset (X, C) is the effect of the exposure on the outcome unconfounded given the subset. Equivalently, a confounder is a member of a minimally sufficient adjustment set” ( VanderWeele and Shpitser, 2013 ).

Collider . “A collider for a certain pair of variables is any variable that is causally influenced by both of them” ( Rohrer, 2018 ).

Covariate . We utilize the word covariate to indicate a variable which could, in principle, be included in a statistical model assessing the relations between an independent variable (IV) and a dependent variable (DV).

Residual. The difference between the observed and fitted value of the outcome ( Bewick et al., 2003 ).

Independent Variable. “Independent variables (IVs) generally refer to the presumed causes that are deliberately manipulated by experimenters” ( Chen and Krauss, 2005 ) or observed in non-interventional research.

Dependent Variable. “Dependent variables (DVs) are viewed as outcomes that are affected by the independent variables” ( Chen and Krauss, 2005 ).

Association. Two variables are associated when they are not independent, i.e., when the distribution of one of the variables depends on the level of the other variable ( Hernán, 2004 ).

Related. We say that two variables are related; when the distribution of one variable depends on the level of the other variable. In this context, we use the words ‘related’, ‘associated’, and ‘dependent’ as interchangeable and a complement of independent ( Dawid, 1979 ).

[The 4 highlighted variables merit different and better definitions] Causal effect : “A difference between the counterfactual risk of the outcome had everybody in the population of interest been exposed and the counterfactual risk of the outcome had everybody in the population been unexposed” ( Hernán, 2004 ).

Statistical Model : A model used to represent the data-generating process embodying a set of assumptions, and including the uncertainties about the model itself ( Cox, 2006 ).

Precision : How dispersed the measurements are between each other ( ISO, 1994 ).

Mediator: Variable that is on the causal pathway from the exposure to outcome ( VanderWeele and Vansteelandt, 2014 ).

*We have used some definitions as phrased in this glossary in some of our other manuscripts currently under review, published, or in-press articles.

We sometimes use the words confound , confounder , confounding , and other variants by convention or for consistency with the literature we are citing. However, because of the difficulty and inconsistency in defining confounding ( Pearl and Mackenzie, 2018 ), we will minimize such use and try to refer primarily to potentially biasing covariates (PBCs). We define PBCs as variables other than the independent variable (IV) or dependent variable (DV) for which decisions about whether and how to condition on them, including by incorporation into a statistical model, can affect the extent to which the expected value of the estimated association of the IV with the DV deviates from the causal effect of the IV on the DV.

Simply because we believe an observed variable (e.g. highest educational degree earned) is a measure of a construct (e.g. socioeconomic status), it does not mean that that the observed variable accurately measures that construct or that it has sufficient validity for the elimination of it as a source of covariation biasing estimation of a parameter. This scenario is a misperception attributed to construct validity, which is defined as the extent to which a test or measure accurately measures what it is intended to measure. This misperception is conceptually defined as the assumption that a measure or set of measures accurately measures the outcome of interest; however, associations between tested variables may not adequately or appropriately represent the outcome of interest. This specific type of construct validity is perhaps best exemplified through the use of proxy variables, or variables believed to measure a construct of interest while not necessarily holding a strong linear relationship with that construct. In psychology, the Patient Health Questionaire-9 (PHQ-9) is a highly reliable, nine-item psychological inventory for screening, diagnosing, and monitoring depression as an internalizing disorder. Although these nine items have been extensively tested as an appropriate measure for depression and other internalizing disorders ( Bell, 1994 ), it is not uncommon for researchers to modify this scale for shorter surveys ( Poongothai et al., 2009 ). However, because the PHQ-9 has been empirically tested with a specific item set, any modification may not effectively measure depressive symptomology as accurately as when the PHQ-9 is used as intended. This problem is also salient in nutritional epidemiology for food categorization ( Hanley-Cook et al., 2023 ). For example, ongoing debate remains about ‘food addiction’ as a measurable construct despite limited evidence to suggest such a phenomenon exists and can be empirically measured ( Fletcher and Kenny, 2018 ).

## Why misperception 1 occurs

This misperception persists simply because issues with construct validity are difficult to identify. First, owing to continuous scientific innovations, we are finding new ways to measure complex behaviors. However, the production of new instruments or tests remains greatly outpaced by such innovation. As such, scientists may rely on old, established instruments to measure problems germane to the 21 st century. However, the use of these instruments has not been tested in such scenarios, i.e., measuring screentime as a predictor/construct/measure of depression and other internalizing disorders. Second, although it is easy to create a new test or instrument, testing the instrument to ensure construct validity is time-consuming and tedious. If a new instrument is not tested, then no certainty exists as to whether the construct measures what it is intended to measure. Additionally, outcomes measured from old, adapted, and new measures may only be marginally incorrect. Thus, any ability to identify unusual metrics or outcomes becomes impeded, allowing this misperception to continue.

## How to avoid misperception 1

We offer two practical recommendations to avoid this misperception. First, if using an established test or instrument that measures many constructs, then the instrument should be used in its entirety. Any alteration to the instrument (particularly relating to question wording, format, and question omission) may alter response patters to a large enough degree that the construct no longer appropriately measures what it is intended to measure. However, in cases where measures are adapted, tailored to specific populations, or created anew, the instrument will ideally be empirically tested using a variety of psychometric analyses (e.g. confirmatory factor analysis) to compare factor weights and loadings between new and adapted measures. Ideally, adaptations to an existing instrument will perform the same such that scores reflect the outcome of interest equally across versions. Other options beyond a confirmatory factor analysis include test/retest reliability—a measure of how consistently a measure obtains similar data between participants—as a secondary metric to again test the reliability and validity of an instrument relative to a measured construct.

Measurement errors can take many forms ( Fuller, 2009 ; Carroll et al., 2006 ) and are not limited to random, independent, or normally distributed errors. The errors themselves may be correlated, or the errors in measurement may be correlated, with true values of the covariate or with true values of other variables or errors in other variables. The distribution of a covariate’s measurement errors, including their variance and their associations with other variables, can greatly influence the extent to which controlling for that error-contaminated covariate will reduce, increase, or have no appreciable impact on the bias of model parameter estimation and significance testing. That is, the extent to which including a PBC will eliminate, reduce, not effect, or even potentially increase bias in estimating some elements of the model is also influenced by the measurement error distributions. Indeed, a recent review by Yland et al., 2022 delineates seven ways in which even so-called ‘non-differential’ measurement error can lead to biases away from the null hypothesis in observational epidemiologic research. We do not include all of them here but refer the reader to this cogent paper.

A frequent misleading statement in the epidemiologic literature is that ‘classical’ measurement error only attenuates effects. For example, Gibson and Zezza state, “Classical measurement errors provide comfort …since they don’t cause bias if on the left-hand side, and just attenuate if on the right-hand side, giving a conservative lower bound to any estimated causal impacts” ( Gibson and Zezza, 2018 ). That this is untrue is knowable from theory ( Fuller, 2009 ; Carroll et al., 2006 ) and has been demonstrated empirically on multiple occasions. While it is well known that the presence of measurement error in simple linear regression models leads to attenuation, the influence of measurement errors in more complex statistical models depends on the outcomes and the statistical models. Therefore, measurement error and covariates, as well as outcomes, need to be considered ( Tosteson et al., 1998 ; Buonaccorsi et al., 2000 ; Yi et al., 2012 ; Tekwe et al., 2014 ; Tekwe et al., 2016 ; Tekwe et al., 2018 ; Tekwe et al., 2019 ).

Measurement error in the covariates is often ignored or not formally modeled. This may be the result of a general lack of awareness of the consequences on estimation and conclusions drawn regarding the covariates in regression models. This may also be the result of insufficient information regarding the measurement error variance to be included in the modeling. Yet, as a field, we should move toward analyses that account for measurement error in the covariates whenever possible ( Tekwe et al., 2019 ).

## Why misperception 2 occurs

The influence of measurement error depends on the regression model. Therefore, it cannot be generalized that measurement error always attenuates covariate effects. In some models, the presence of measurement error does lead to attenuation, while in others, it leads to inflated effects of the covariates. A simple way to think about how measurement error can lead to bias is by exploring the nature of random measurement error itself. Let us assume that the random measurement error in our covariate exists. By random we mean that all the errors are independent of each other and of all other factors in the model or pertinent to the model. We know that under such circumstances, the variance in the measured values of the covariate will simply be the sum of the true variance of the construct the covariate represents plus the variance of the random measurement errors. As the ratio of the variance of the random errors over the variance of the true construct approaches infinity, the proportion of variance due to the true value of the construct approaches zero and the covariate itself is effectively nothing more than random noise. For example, we wouldn’t expect that simply controlling for the random noise generated from a random number generator would reduce the bias of the IV–DV relationship from any PBC. Although this is an extreme and exaggerated hypothetical, it makes the point that the greater the error variance, the less that controlling for the covariate actually controls for the PBC of interest. Because we know that many PBCs in the field of nutrition and obesity, perhaps most notably those involving self-reported dietary intake, are measured with error, we cannot assume that when we have controlled for a covariate, we have eliminated its biasing influence. If we allow for the possibility—indeed the virtual certainty ( Dhurandhar et al., 2015 ; Gibney, 2022 ; Gibney et al., 2020 )—that the errors are not all random but in some cases will be correlated with important factors in the model, then ‘all bets are off.’ We cannot predict what the effect on the model will be and the extent to which biases will be created, reduced, or both by the inclusion of such covariates without fully specifying the nature of the error structure relative to the model.

## How to avoid misperception 2

One way to reduce the concerns of such measurement error is through measurement error correction methods. Fully elucidating them is beyond the scope of this article, but thorough discussions are available ( Fuller, 2009 ). Of course, the best way of dealing with measurement error is not to have it, but that is unachievable, particularly in observational studies. Nevertheless, we should continue to strive for ever better measurements in which measurement error is minimized ( Westfall and Yarkoni, 2016 ) to levels that plausibly have far less biasing capacity.

## Misperception 3a. Continuous covariates divided into polychotomous categories for better interpretation are still well-controlled

Why misperception 3a occurs.

Another way in which controlling for PBCs can fail involves the intersection of residual confounding and nonlinearity discussed later (see Misperception 5B).

An astute investigator may recognize the potential for nonlinearity and, therefore, choose to allow for nonlinear effects or associations of the covariate with the outcome by breaking the covariate into categories that could also allow for easier interpretation ( Blas Achic et al., 2018 ).

This is most commonly done through the use of quantiles (on a terminological note, the adjacent bins into which subjects can be placed when the covariate is ‘chopped up’ in this manner might better be termed ‘quantile-defined categories’ and not as quantiles, quintiles, quartiles, etc). The quantiles are the cut points, not the bins formed by the cutting ( Altman and Bland, 1994 ). Yet doing so yields, as many have explained ( Veiel, 1988 ; Fitzsimons, 2008 ; Hunter and Schmidt, 1990 ; Irwin and McClelland, 2003 ; Naggara et al., 2011 ), ‘coarse categorization’ that effectively creates additional measurement error in the covariate. This is true even if there was no measurement error to begin with, unless the true relationship between the covariate and the outcome miraculously happens to be exactly a series of step functions with the stepping occurring exactly at the points of cutting. In contrast, if the true association is more monotonic, then this categorization loses information and increases the likely residual bias (aka ‘residual confounding’). The result is an apparent control for the covariate of interest that does not truly eliminate bias from the PBC.

## How to avoid misperception 3a

For optimal analysis, it is advisable for researchers to avoid dichotomizing continuous covariates as much as possible, as this approach may lead to unnecessary suboptimal analysis.

## Misperception 3b. Covariates categorized in coarse rather than fine categories are more reliable in the presence of measurement error

Why misperception 3b occurs.

A similar misperception to 3 a is that in the presence of certain forms of measurement error, coarse categorization will make the covariate data more reliable because the original data cannot support fine-grained distinctions. As described by MacCallum et al., 2002 :

In questioning colleagues about their reasons for the use of dichotomization, we have often encountered a defense regarding reliability. The argument is that the raw measure of X is viewed as not highly reliable in terms of providing precise information about individual differences but that it can at least be trusted to indicate whether an individual is high or low on the attribute of interest. Based on this view, dichotomization, typically at the median, would provide a ‘more reliable’ measure .

It may be true that for some communication purposes, data measured with low precision merit being communicated only in broad categories and not with more precise numbers. Yet, as MacCallum et al. explains after studying dichotomization (a special case or ‘the lower limit’ of polychotomization or categorization), “...the foregoing detailed analysis shows that dichotomization will result in moderate to substantial decreases in measurement reliability under assumptions of classical test theory, regardless of how one defines a true score. As noted by Humphreys, 1978 , this loss of reliable information due to categorization will tend to attenuate correlations involving dichotomized variables, contributing to the negative statistical consequences described earlier in this article. To argue that dichotomization increases reliability, one would have to define conditions that were very different from those represented in classical measurement theory” ( MacCallum et al., 2002 ).

## How to avoid misperception 3b

Researchers are advised to refrain from dichotomizing covariates that have low reliability because this can have a negative impact on the analysis. Claiming dichotomization will improve reliability would require defining conditions that deviate significantly from classical measurement theory ( MacCallum et al., 2002 ), which is simply difficult to verify in real application.

## Why misperception 4 occurs

Investigators are often reluctant to control for covariates because they believe that doing so will reduce the power to detect the association or effective interest between the IV and the DV or outcome. Therefore, if they perceive that the covariate is one that has a bivariate unadjusted correlation of zero with the IV, they may seize upon this as an opportunity to dismiss that nonsignificant covariate from further consideration. Ironically, this is the very situation in which controlling for the covariate may be most helpful for detecting a statistically significant association between the IV and the DV. This is most clearly recognized by statistical methodologists in randomized experiments or randomized controlled trials, but is frequently misunderstood by non-statistician investigators.

If a covariate is correlated (especially if it is strongly correlated) with the outcome of interest but uncorrelated with (orthogonal to in linear models) the IV (e.g. treatment assignment in a randomized experiment), then controlling for that covariate reduces residual variance in the DV without affecting the parameter estimate for the association or effect of the IV with the DV. Unless the sample size is extremely small such that the loss of a degree of freedom by including the covariate in the analysis makes an important difference (again, it rarely will in observational studies of any size), then this increases power, often quite substantially, by reducing the residual variance and thereby lowering the denominator of the F-statistic in a regression or ANOVA context or related statistics with other testing. Omission of orthogonal covariates has been well described in the literature ( Allison et al., 1997 ; Allison, 1995 ). Although omission of orthogonal covariates is ‘cleanest and clearest’ in the context of randomized experiments, conditions may prevail in observational studies in which a variable is strongly related to the DV but minimally related to the IV or exposure of interest.

Such covariates are ideal to help the investigator explore his or her hypothesis, or better yet, to formally test them with frequentist significance testing methods. Doing so will increase statistical power and precision of estimation (i.e. reduced confidence intervals on the estimated associations or effects of interest).

## How to avoid misperception 4

When conducting an analysis, it is important to base the decision to control for covariates on the scientific knowledge of the problem at hand, rather than solely on the desire for a powerful test. Researchers should keep in mind that the main purpose of adjusting for covariates is to eliminate any influence of PBCs that may distort the estimate of the desired effect. To finish, we also note that including too many variables in the model can be detrimental because one runs the risk of inducing excessive multicollinearity and overfitting.

## Misperception 5 a. If when controlling for X and Z simultaneously in a statistical model as predictors of an outcome Y , X is significant with Z in the model, but Z is not significant with X in the model, then X is a ‘better’ predictor than Z

Why misperception 5a occurs.

Investigators may also incorrectly conclude that X has a true causal effect on Y and that Z does not, that X has a stronger causal effect on Y than does Z , or that Z may have a causal effect on Y but only through X as a mediating variable. None of the above conclusions necessarily follow from the stated conditions. An example of a context in which these misperceptions occur was discussed recently in a podcast in which the interlocutors considered the differential associations or effects of muscle size versus muscle strength on longevity in humans ( Attia, 2022 ). After cogently and appropriately noting the limitations of observational research in general and in the observational study under consideration in particular, the discussants pointed out that when a statistical model was used in which both muscle size and muscle strength measurements were included at the same time, muscle size was not a significant predictor of mortality rate conditional upon muscle strength, but muscle strength was a significant predictor of mortality rate conditional upon muscle size. The discussants thus tentatively concluded that muscle strength had a causal effect on longevity and that muscle size either had no causal effect, conditional upon muscle strength, or had a lesser causal effect.

While the discussants’ conclusions may be entirely correct, as the philosophers of science say, the data are underdetermined by the hypotheses. That is, the data are consistent with the discussants’ interpretation, but that interpretation is not the only one with which the data are consistent. Therefore, the data do not definitively demonstrate the correctness of the discussants' tentative conclusions. There are alternative possibilities. In Figure 1 , we show two DAGs consistent with the discussants’ conclusions. Yet they imply a completely different causal association between X and Y. Figure 1a is a simple DAG and agrees with the discussants’ conclusion. Figure 1b also agrees with the discussants’ conclusion, but X has no causal relationship with Y (no arrows). Yet, in some settings and some level of correlation between X and Z, X appears significant in a regression model with Z ’ included in the model in lieu of Z.

## Agree (a) vs. disagree (b) with the interpretation of Misperception 5a.

Demonstrates a nonlinear and non-monotonic association between body mass index (BMI) and mortality among U.S. adults aged 18–85 years old. This figure suggests that BMI ranging between 23–26 kg/m 2 formed the nadir of the curve with the best outcome while persons with BMI levels below or above the nadir of the curve experienced increased mortality on average. Source: ( Fontaine et al., 2003 ).

First, there is tremendous collinearity between muscle mass and muscle strength. Given that almost all the pertinent human studies have non-experimental designs, the collinearity makes it especially difficult to determine whether there is cause and effect here and, if so, which of the two variables has a greater effect. With such strong multicollinearity between the strength and the mass measurements, any differential measurement error could make it appear that the more reliably measured variable had a greater causal effect over the less reliably measured variable, even if the opposite were true. Similarly, any differential nonlinearity of the effects of one of the two variables on the outcome relative to the others, if not effectively captured in the statistical modeling, could lead one variable to appear more strongly associated or effective than the other. In fact, the variable may just be more effectively modeled in this statistical procedure because of its greater linearity or greater conformity of its nonlinear pattern to the nonlinear model fit. We note that variance inflation factors are often used to diagnose multicollinearity in regressions.

Finally, even in linearly related sets of variables, the power to detect an association between a postulated cause and a postulated effect is highly dependent on the degree of variability in the causal factor in the population. If the variance were to be increased, the significance of the relationship would likely be accentuated. Thus, without an understanding of the measurement properties, the variability in the population, the variability which could exist in the population, and the causal structure among the variables, such analyses can only indicate hypotheses that are provisionally consistent with the data. Such analyses do not demonstrate that one variable does or does not definitively have a greater causal effect than the other or that one variable has a causal effect and the other variable does not. Note, regression coefficients within a model can be tested for equivalence in straightforward manners. Tests for non-trivial (non-zero) equivalence of some regression parameters can be done when it makes sense. In the linear regression model, testing for equivalence between parameters amounts to comparing the reduction in the sum of square error between a larger (in terms of number of parameters) model and a smaller model (with selected parameters constrained to be equal) relative to the large model sum of squares. The test then has an F distribution from which we can obtain the critical values and compute the p value ( Neter et al., 1996 ).

## How to avoid misperception 5a

Researchers should ensure that the variables to be adjusted for in the model are not too correlated to avoid multicollinearity issues. Variance inflation (VIF) tests available in most statistical software can be used to diagnose the presence of multicollinearity. Additionally, if measurement error or low covariate reliability is suspected, measurement error correction should be considered if possible.

## Misperception 5b. Controlling for the linear effect of a covariate is equivalent to controlling for the covariate

Why misperception 5b occurs.

This assumption is not necessarily true because the relationships between some variables can be nonlinear. Thus, if one controls for only the linear term (which is typical) of a quantitative variable, say Z , as a PBC, then one does not effectively control for all the variance and potential bias induced by Z . The extent to which any residual bias in Y due to controlling Z only in its linear effects or association may be large or small depending on the degree of nonlinearity involved. In practice, much nonlinearity is monotonic. However, this is not true in all cases. For many risk factors such as body mass index (BMI), cholesterol, and nutrient intakes like sodium, there are often U-shaped (or more accurately concave upward) relationships in which persons with intermediate levels have the best outcomes and persons with covariate levels below or above the nadir of the curve have poorer outcomes, on average. An example of the nonlinear and non-monotonous relationship between BMI (the explanatory variable) and mortality (the outcome variable) is illustrated in Figure 2 ; Fontaine et al., 2003 . In this example, mortality was treated as a time-to-event outcome modeled via survival analysis. This relationship has often been demonstrated to be U- or J-shaped ( Fontaine et al., 2003 ; Flegal et al., 2007 ; Flegal et al., 2005 ; Pavela et al., 2022 ). Thus, when BMI is modeled linearly, the estimates will likely be potentially highly biased compared to when it is non-linearly modeled.

## Association between body mass index and hazard ratio for death among U.S. adults aged 18–85 years old.

How to avoid misperception 5b.

It is important that one assesses for residual relationships (the relationships between nonlinear functions of Z and the model residuals after controlling for a linear function of Z ) or chooses to allow for nonlinearity from the onset of the analysis. Nonlinearity can be accommodated through models that are nonlinear in the parameters (e.g. having parameters be exponents on the covariates) ( Meloun and Militký, 2011 ; Andersen, 2009 ) or through use of techniques like the Box-Tidwell method transformations ( Armstrong, 2017 ), splines ( Schmidt et al., 2013 ; Oleszak, 2019 ), knotted regressions ( Holmes and Mallick, 2003 ), categorical values (although see the next section for caveats around course categorization) ( Reed Education, 2021 ), or good old-fashioned polynomials ( Oleszak, 2019 ; Reed Education, 2021 ; Hastie et al., 2017 ) or in some cases factional polynomials ( Sauerbrei et al., 2020 ; Binder et al., 2013 ; Royston and Altman, 1994 ; Royston and Sauerbrei, 2008 ).

## Why misperception 6 occurs

This is not true. It is a common misperception that variables included in a parametric statistical model must be normally distributed. In fact, there is no requirement that any variable included in standard parametric regression or general linear models ( Allison et al., 1993 ), either as a predictor or as a DV, be normally distributed. What is embedded in the Gauss Markov Assumptions ( Berry, 1993 ), the assumptions of ordinary least-squares regression models (the models typically used in this journal), is that the residuals of the model be normally distributed. That is, the differences between the observed value of a DV for each subject and the predicted value of that DV from the model (and not any observed variable itself) are assumed to be normally distributed.

Moreover, this assumption about residuals applies only to the residuals of the DV. No assumption about the distribution of the predictor variables, covariates, or IV is made other than that they have finite mean and variance. Therefore, there is no need to assess the distributions of predictive variables, to take presumed corrective action if they are not normally distributed, or to suspect that the model is violated or biased if predictor variables are not normally distributed. One might be concerned with highly skewed or kurtotic covariates in that such distributions may contain extreme values, or outliers, that may serve as leverage points in the analysis, but that is a different issue. For an overview of outlier detection and the influence detection statistics best for managing concerns in this domain, see Fox, 2019 .

## How to avoid misperception 6

This misperception can be avoided by recalling that in the regression model, the analysis is done conditional on the IVs (or covariates), which are assumed to be fixed. Thus, their distributions are irrelevant in the analysis. However, it is required that the residuals be uncorrelated with the IVs.

In this misperception, the emphasis is on a relation that is not statistically significant instead of merely not related . This strategy is often implemented through stepwise regression techniques that are available in most statistical software. Statistical-significance-based criteria for including covariates can, if the predictor variable in question is actually a confounder (we rely on the word ‘confounder’ here for consistency with much of the scientific literature), lead to bias in both coefficient estimates and tests of statistical significance ( Maldonado and Greenland, 1993 ; Greenland, 1989 ; Lee, 2014 ). As Greenland has pointed out, this “too often leads to deletion of important confounders (false negative decisions)” ( Greenland, 2008 ). This is because the statistical-significance-based approach does not directly account for the actual degree of confounding produced by the variable in question.

## Why misperception 7 occurs

There could be confusion in understanding the nature of the question asked when selecting a variable for its confounding potential and the question asked in usual statistical significance testing ( Dales and Ury, 1978 ). These two questions are fundamentally different. Even though a plausible confounder may not have a statistically significant association with the IV or the DV, or a statistically significant conditional association in the overall model, its actual association may still not be zero. That non-zero association in the population, even though not statistically significant in the sample, can still produce sufficient biases to allow false conclusions to occur at an inflated frequency. Additionally, a motivation for significance testing to select confounders may be to fit a more parsimonious final model in the large number of covariates and relatively modest sample size setting ( VanderWeele, 2019 ). That is, false-positive decisions (i.e. selecting a harmless nonconfounder) are considered more deleterious than false-negative decisions (deleting a true confounder). It has been argued that the opposite applies: deleting a true confounder is more deleterious than including a harmless nonconfounder. The reason is that deleting a true confounder introduces bias and is only justified if the action is worth the precision gained. Whereas, including a harmless nonconfounder reduces precision, which is the price of protection against confounding ( Greenland, 2008 ). We note that in not all circumstances is including a nonconfounder ‘harmless’ ( Pearl, 2011 ).

## How to avoid misperception 7

Selection of confounders may be best when relying on substantive knowledge informing judgments of plausibility, the knowledge gained from previous studies in which similar research questions were examined, or a priori hypotheses and expectations for relationships among variables. If a variable is plausibly a confounder, it should be included in the model regardless of its statistical significance. As an additional approach, one can conduct the analysis with and without the confounder as a form of sensitivity analysis ( VanderWeele and Ding, 2017 ; Rosenbaum, 2002 ) and report the results of both analyses. Such an approach is often referred to as the approach of the wise data analyst, who is willing to settle for, as Tukey defines, “an approximate answer to the right question, which is often vague, [rather] than an exact answer to the wrong question, which can always be made precise” ( Tukey, 1962 ). We note that serious criticisms have been leveraged against the use of E-values in a sensitivity analysis as they tend to understate the residual confounding effect ( Greenland, 2020 ; Sjölander and Greenland, 2022 ). However, attending to those critics is not within the scope of the current review.

## Why misperception 8 occurs

This is untrue. As Maxwell pointed out several decades ago, the effects of analyzing residuals as opposed to including the PBC of interest in the model will depend on how those residuals are calculated ( Maxwell et al., 1985 ). As Maxwell puts it, ANOVA on residuals is not ANCOVA. Maxwell shows that if the residuals are calculated separately for different levels of the IV, bias may accrue in one direction. In contrast, if residuals are calculated for the overall sample, bias may accrue in a different manner.

Although this conceptualization of an equivalence between the two procedures [ANOVA on residuals vs ANCOVA] may be intuitively appealing, it is mathematically incorrect. If residuals are obtained from the pooled within-groups regression coefficient (b w ), an analysis of variance on the residuals results in an inflated a-level. If the regression coefficient for the total sample combined into one group (b T ) is used, ANOVA on the residuals yields an inappropriately conservative test. In either case, analysis of variance of residuals fails to provide a correct test, because the significance test in analysis of covariance requires consideration of both b w and b T , unlike analysis of residuals ( Maxwell et al., 1985 ).

Notably, this procedure can introduce bias in the magnitude of the coefficients (effect sizes) characterizing the effects or associations of the IV of interest, and not just the test of statistical significance.

## How to avoid misperception 8

As Maxwell points out, there are ways to use residualization that do not permit these biases to occur. Hence, in some situations where models become so complex that residualizing for covariate effects beforehand makes the analysis that would otherwise be intractable tractable, this may be a reasonable approach. Nevertheless, additional concerns may emerge ( Pain et al., 2018 ) and under ordinary circumstances, it is best to include PBCs in the model instead of residualizing for them first outside the model.

## Why misperception 9 occurs

This is referred to as the suppressor effect. Adenovirus 36 (Ad36) infection provides an example of a suppressor effect. Although Ad36 increases adiposity, which is commonly linked to impaired glucoregulatory function and negative lipid profiles, Ad36 infection surprisingly leads to improved glucoregulatory function and serum lipid profiles ( Akheruzzaman et al., 2019 ).

To illustrate the point, we set β A = 0.5, β B = –0.24, λ A = 0.8 and λ B = 0.6 implying zero-order correlation between the intake of fats of type B and Y would be zero. Yet, by controlling for fats of type B in the model, we would obtain an unbiased estimate of the effect of fats of type A on Y as β A , whereas if we did not control for fats of type B, we would mistakenly calculate the correlation between fats of type A and Y to be λ A β B . This example demonstrates that failing to control for the suppressor variable, or the PBC that creates omitted variable bias, could result in a biased estimate of IV effects on the outcome, even when the suppressor variable has no correlation with the outcome. This disputes the premise that a covariate uncorrelated with the outcome cannot be biasing the results of an association test between another variable and the outcome as an indicator of a causal effect, thus undermining the original assumption. Whereas in the psychometrics literature, such patterns have commonly been termed suppressor effects , in a nutrition epidemiology paper they were referred to as negative confounders ( Choi et al., 2008 ). We provide both theoretical and empirical justifications for these observations in Appendix A in the supplementary text file.

## How to avoid misperception 9

This misperception is easily avoided if we refrain from only relying on marginal correlation to select covariates to include in the model and instead apply a backdoor criterion ( Pearl and Mackenzie, 2018 ) to help decide which variables to adjust for and which to not adjust for. Provided that the directed acyclic diagram (DAG) in Figure 3 conforms to the true DAG, intake of fats B meets the backdoor criterion and must be adjusted for when estimating the effect of intake of fats, A on the outcome Y.

## Causal relationships of health outcome, dietary fat consumption, and the belief that consumption of dietary fat is not dangerous.

Direction of arrows represents causal directions and λ A , λ B , β A , and β B are structural coefficients.

Figure 3 shows a simple causal model. On the left side of the figure is a variable representing an individual’s belief about the danger of dietary fat consumption. This belief affects their consumption of two types of fats, A and B. Fat type A is harmful and has a negative impact on health, while fat type B has a positive effect and improves health outcomes. The Greek letters on the paths indicate the causal effects in the model. Without loss of generality, we assume all variables have been standardized to have a variance of 1.0. From the rules of path diagrams ( Alwin and Hauser, 1975 ; Bollen, 1987 ; Cheong and MacKinnon, 2012 ), we can calculate the correlations between Y and intake of fats of type B to be ρ Y B = β B + λ B λ A β A . This correlation is zero when λ A λ B = − β B β A .

## Why misperception 10 occurs

This misperception is based on the same premises as stated above but manifests differently. Let us replace ‘confounding variable’ with ‘PBC,’ which we defined earlier. For Misperception 10, the presumption is that a PBC, if not properly included and controlled for in the design or analysis, will only bias the extent to which the association between the IV and the DV represents the cause or effect of the IV on the DV if the PBC is related to both the IV and the DV.

Under those assumptions, if we consider a PBC and find that it is one that has a bivariate unadjusted correlation of zero with the IV, then it cannot be creating bias. Yet, this is not true. Multiple circumstances could produce a pattern of results in which a biasing variable has a correlation of zero, as well as no nonlinear association with the outcome, and yet creates a bias if not properly accommodated by design or analysis. Moreover, there may be circumstances in which statistically adjusting for a variable does not reduce the bias even though in other circumstances such adjustment would. Consider the causal model depicted in Figure 4 , which follows the same notational conventions as Figure 3 .

## Causal relationships of outcome, covariate, and potentially biasing covariate (PBC).

Direction of arrows represents causal directions and λ z , α z , α x , β z , and βx are structural coefficients. The error terms e1 and e2 have variances chosen so Y1 and Y2 have variances 1 (see the Appendices for more details).

In this case, both X and Z have a causal effect on Y 1 . Y 1 can then be referred to as a ‘collider’ (see Glossary). It is well established that conditioning on a collider will alter the association between joint causes of it. Most often, collider bias is discussed in terms of creating associations. For example, in the figure shown here, if Z and X were not correlated, but both caused increases in Y 1 , then conditioning on (i.e. ‘controlling for’) Y 1 would create a positive or negative correlation between X and Z . However, as Munafò et al., 2018 explain, collider bias need not simply create associations, but can also reduce or eliminate associations: “Selection can induce collider bias… which can lead to biased observational… associations. This bias can be towards or away from any true association, and can distort a true association or a true lack of association.”

In Figure 4 Appp, there is an association between X and Z , and Z would be the PBC (confounding variable in conventional terminology) of the relationship between X and Y 1 and Y 2 . But, if we set up the coefficients to have certain values, selecting on Y 1 (for example, by studying only people with diagnosed hypertension defined as a systolic blood pressure greater than 140 mm Hg) could actually drive the positive association between X and Z to zero. Specifically, for these coefficient values [β x = 0.1857, , β z = 0.8175, λ z = 0.4, α x = 0.0, α x = 0.6], if all variables were normally distributed (in the derivation in Appendix 2, we assume that all variables have a joint multivariate normal distribution. Whether this applies to cases in which the data are not multivariate normal is not something we have proven one way or another). with mean zero and standard deviation 1 (this would be after standardization of the variables), then using a cutoff of approximately 1.8276 standard deviations above the mean of Y 1 would cause the correlation in that subsample between X and Z to be zero ( Arnold and Beaver, 2000 ; Azzalini and Capitanio, 2013 ).

Furthermore, let us assume that all the relations in this hypothetical circumstance are linear. This can include linear relationships of zero, but no nonlinear or curved relationships. Here, when we control for the PBC Z in the selected sample of persons with hypertension, it will have no effect on the estimated slope of the regression of Y 2 on X . The collider bias has altered the association between Z and X such that controlling for Z in a conventional statistical model, i.e., an ordinary least-squares linear regression, no longer removes the bias. And yet, the bias is there. We justify this through mathematical arguments along with a small simulation to elucidate the manifestation of this misperception in Appendix 2.

More sophisticated models involving missing data approaches and other approaches could also be brought to bear ( Groenwold et al., 2012 ; Yang et al., 2019 ; Greenwood et al., 2006 ), but this simple example shows that just because a PBC has no association with a postulated IV (i.e. cause), this does not mean that the variable cannot be creating bias (confounding) in the estimated relationship between the postulated IV and the postulated result or outcome. In the end, as Pedhazur put it, quoting Fisher, “If…we choose a group of social phenomena with no antecedent knowledge of the causation or the absence of causation among them, then the calculation of correlation coefficients, total or partial, will not advance us a step towards evaluating the importance of the causes at work…In no case, however, can we judge whether or not it is profitable to eliminate a certain variate unless we know, or are willing to assume, a qualitative scheme of causation” ( Fisher, 1970 ).

In the end, there is no substitute for either randomization or, at a minimum, informed argument and assumptions about the causal structure among the variables. No simple statistical rule will allow one to decide whether a covariate or its exclusion is or is not creating bias.

## How to avoid misperception 10

Selecting or conditioning on a collider can bias estimated effects in unforeseeable ways. Given a causal DAG, the use of the backdoor criterion can help the analyst identify variables that can safely be adjusted for and those that can bias (confound) the effect estimate of interest. In Figure 4 , for example, Y 1 does not meet the backdoor criterion from Y 2 to X, and adjusting for it or selecting on it will bias the estimate of the effect estimate.

This is, in essence, a statement of the unbiasedness of an analytic approach. By this we mean that the method of controlling for the covariate is not chosen, intentionally or unintentionally, to achieve a particular study finding, and that the answer obtained does not deviate from the answer one would get if one optimally controlled for the covariate. By ‘optimally controlled,’ we mean using a method that would eliminate or reduce to the greatest extent possible any effects of not controlling for the covariate and that is commensurate with the stated goals of the analysis (which is more important than the interests of the investigator).

Unfortunately, we have substantial evidence from many sources that many investigators instead choose analytical approaches, including the treatment of covariates, that serve their interests (e.g. Head et al., 2015 ; Wicherts et al., 2016 ; Bruns and Ioannidis, 2016 ). Conventionally, this is termed ‘p-hacking’ ( Simonsohn et al., 2014 ), ‘investigator degrees of freedom’ ( Simmons et al., 2011 ), ‘taking the garden of forking paths’ ( Gelman and Loken, 2013 ), and so on. If such methods are used, that is, if investigators try multiple ways of controlling for which, how many, or form of covariates until they select the one that produces the results most commensurate with those they wish for, the results will most certainly be biased ( Sturman et al., 2022 ; Kavvoura et al., 2007 ; Banks et al., 2016 ; O’Boyle et al., 2017 ; Simmons et al., 2011 ; Christensen et al., 2021 ; Stefan and Schönbrodt, 2022 ; Austin and Brunner, 2004 ).

## Why misperception 11 occurs

To our knowledge, surveys do not exist describing the extent to which authors are aware of the consequences of intentionally choosing and reporting models that control for covariates to obtain a certain result. Some evidence exists, however, that suggests authors do sometimes intentionally select covariates to achieve statistical significance, such as a survey by Banks et al. of active management researchers ( Banks et al., 2016 ). O’Boyle et al. observed changes in how control variables were used in journal articles compared with dissertations of the same work, with the final publications reporting more statistically significant findings than the dissertations ( O’Boyle et al., 2017 ). Research on the motivations of these practices may help to focus preventive interventions.

## How to avoid misperception 11

This concern with P -hacking is one of the major impetuses behind those in our field encouraging investigators in observational studies to preregister their analyses (; Dal Ré et al., 2014 ). Many steps in the model-building process could consciously or unconsciously influence the probability of type I error, from the conceptualization of the research question (e.g. the quality of prior literature review, discussions with collaborators and colleagues that shape modeling choices), to any prior or exploratory analysis using that dataset, or to the numerous analytical decisions in selecting covariates, selecting their forms, accounting for missing data, and so on. Future theoretical and empirical modeling is needed to inform which decisions have the least likelihood of producing biased findings.

However, that is not to say that investigators should not limit their flexibility in each of these steps, engage in exploratory analyses, or change their minds after the fact—or that we do not do that ourselves. But this should be disclosed so that the reader can make an informed decision about what the data and results mean. Within our group, we often say colloquially, we are going to analyze the heck out of these data and try many models, but then we are then going to disclose this to the reader. Indeed, transparency is often lacking for how the inclusion or form of adjustment is determined in observational research ( Lenz and Sahn, 2021 ). In situations where authors want to explore how covariate selection flexibility may affect results, so-called multiverse-style methods ( Steegen et al., 2016 ) (also called vibration of effects Patel et al., 2015 ) or specification curve analysis ( Simonsohn et al., 2020 ) can be used, although careful thought is needed to ensure such analyses do not also produce misleading conclusions ( Del Giudice and Gangestad, 2021 ).

## Why misperception 12 occurs

This is not necessarily true. An article from many years ago discusses the problem of a reproducible ‘Six Sigma’ finding from physics ( Linderman et al., 2003 ). A Six Sigma finding is simply a finding whose test statistic is six or more standard deviations from the expectation under the null hypothesis. Six Sigma findings should be indescribably rare based on known probability theory (Actually, they are exactly describably rare and should occur, under the null hypothesis, 10e-10 proportion of the time.). However, it seems that all too often, Six Sigma findings, even in what might be seen as a mature science like physics, are regularly overturned ( Harry and Schroeder, 2005 ; Daniels, 2001 ). Why is this? There are likely multiple reasons, but one is plausible that the assumptions made about the measurement properties of the data, the distributions of the data, and the performance of the test statistics under violations of their pure assumptions were not fully understood or met ( Hanin, 2021 ). This issue involving violations of assumptions of statistical methods ( Greenland et al., 2016 ) may be especially important when dealing with unusually small alpha levels (i.e. significance levels) ( Bangalore et al., 2009 ). This is because a test statistic that is highly robust to even modest or large violations of some assumptions at higher alpha levels such as 0.05 may be highly sensitive to even small violations of assumptions at much smaller alpha levels, such as those used with Six Sigma results in physics. Another example is with the use of multiple testing ‘corrections’ in certain areas like genetic epidemiology with genome-wide association testing in nutrition and obesity research, where significance levels of 10e-8 are commonly used and p values far, far lower than that are not infrequently reported.

## How to avoid misperception 12

In short, robustness at one significance level does not necessarily imply robustness at a different significance level. Independent replication not only takes into account purely stochastic sources of error but also potentially allows one to detect the inadvertent biasing effects of other unknown and unspecifiable factors beyond stochastic variation.

We have discussed 12 issues involving the use of covariates. Although our description of each misperception is mostly done in a linear model setting, we note that these issues also remain in the nonlinear model. We hope that our attention to these issues will help readers better understand how to most effectively control for potential biases, without inducing further biases, by choosing how and when to include certain covariates in the design and analysis of their studies. We hope the list is helpful, but we wish to note several things. First, the list of issues we provide is not exhaustive. No single source, that we are aware of, will necessarily discuss them all, but some useful references exist ( Cinelli et al., 2020 ; Gelman et al., 2020 ; Ding and Miratrix, 2015 ). Second, by pointing out a particular analytical approach or solution, we do not mean to imply that these are the only analytic approaches or solutions available today or that will exist in the future. For example, we have not discussed the Bayesian approach much. Bayesian approaches differ from their non-Bayesian counterparts in that the researcher first posits a model describing how observable and unobservable quantities are interrelated, which is often done via a graph. Many of the misconceptions detailed here are related to covariate selection bias and omitted or missing covariates bias, which can be corrected for in a Bayesian analysis provided it is known how the unobserved variables are related to other model terms (see ( McElreath, 2020 ) for an accessible and concise introduction to Bayesian analysis and its computation aspects). Third, most of the misconceptions discussed here and ways to avoid them have a direct connection with causal inference. Namely, assuming a DAG depicting the data-generating process, we can use the front-door or front-door criterion derived from the do-calculus framework of Pearl and Mackenzie, 2018 ; Pearl et al., 2016 . Determination of the adjusting set in a DAG can sometimes be challenging, especially in larger DAGs. The freely available web application dagitty ( https://www.dagitty.net/ ) allows users to specify their DAGs and the application provides the set of controlling variables ( Textor et al., 2016 ).

We encourage readers to seek the advice of professional statisticians in designing and analyzing studies around these issues. Furthermore, it is important to recognize that no one statistical approach to the use or nonuse of any particular covariate or set of covariates in observational research will guarantee that one will obtain the ‘right’ answer or an unbiased estimate of some parameter without demanding assumptions. There is no substitute for the gold standard of experimentation: strictly supervised double-blind interventional experiments with random selection and random assignment. This was aptly illustrated in a study by Ejima et al., 2016 . This does not mean that one should not try to estimate associations or causal effects in observational research. Indeed, as Hernán effectively argues ( Hernán, 2018 ), we should not be afraid of causation. When we do much observational research, we are interested in estimating causal effects. But we must be honest: what we are actually estimating is associations, and we can then discuss the extent to which those estimates of associations may represent causal effects. Our ability to rule out competing explanations for the associations observed, other than causal effects, strengthens the argument that the associations may represent causal effects, and that is where the wise use of covariates comes in. But such arguments used with covariates do not demonstrate causal effects, they merely make more or less plausible in the eyes of the beholder that an association does or does not represent causation. In making such arguments, as cogently noted on the value of epistemic humility and how to truly enact it, “Intellectual humility requires more than cursory statements about these limitations; it requires taking them seriously and limiting our conclusions accordingly” ( Hoekstra and Vazire, 2021 ). That is, consideration of arguments about the plausibility of causation from association should not be given in such a way as to convince the reader, but rather to truly give a fair and balanced consideration of the notion that an association does or does not represent a particular causal effect. As Francis Bacon famously said, “Read not to contradict and confute; nor to believe and take for granted; nor to find talk and discourse; but to weigh and consider” ( Bacon, 2022 ).

## Data availability

All data generated or analyzed during this study are included in the manuscript and supplementary files; R studio software used for the description and illustration of misperception 5 a, misperception 9 and misperception 10 are publicly available on GitHub.

Consider a simple causal model depicted in Appendix 1—figure 1 ( Figure 2 in the main text). At the left side of Appendix 1—figure 1 , we have a variable that is the degree of one’s belief that dietary fat consumption is not dangerous or, conceived alternatively, one minus the strength of belief that dietary fat consumption is dangerous or should be avoided. Suppose this variable relates to dietary consumption of two kinds of dietary fats, and , where dietary fat of type A decreases some health outcome of interest (i.e. is harmful). In contrast, dietary fat of type B decreases the negative health outcome (i.e. is helpful).

We can use the following linear model to describe the causal effects in Appendix 1—figure 1 :

where Y is the response variable, X A and X B are IVs representing the fat consumptions of dietary fat types A and B , respectively, and ε is an independent error term with the variance σ ε 2 . Of the two covariates, we suppose X A is the exposure of interest that is correlated with Y and X B is a confounding variable that is correlated with Y , resulting in the correlations ρ ( X A , Y ) ≠ 0 , ρ ( X B , Y ) = 0 , respectively. Following the causal diagram in Appendix 1—figure 1 , we generate X A and X B from a latent variable Z, where

where λ A ≠ o , λ B ≠ 0 and η and γ are independent error terms with variances σ η 2 and σ γ 2 , respectively. Without loss of generality, we assume that variables X A , X B , and Z have been standardized to unit variance, and the additional regression parameters are chosen so that the Y also has unit variance. This then implies the causal effects ρ ( Y , X A ) = β A + β A λ A , λ B , ρ ( Y , X B ) = β B + β A λ A , λ B , and ρ ( X A , X B ) = λ A , λ B .

Consider the following reduced model where the confounding variable, X B , is excluded from the full model (1):

and ε ≡ β B X B + ε . The least-squares estimate for β A under the reduced model (2) is

Under the assumption that ρ ( Y , X B ) = 0 , we have β B = − β A λ A λ B . Plugging this into equation β ^ A , we have

Since λ A ≠ 0 , λ B ≠ 0 . The above derivation demonstrates that omitted variable bias cannot be avoided under the imposed assumption in the causal model of Appendix 1—figure 1 . However, those requirements contradict the imposed assumption in the causal model of Appendix 1—figure 1 indicating that the omitted variable bias cannot be avoided.

Despite the theoretical justification, we conducted simulation studies to illustrate our points. To generate simulated data under the imposed assumptions, we select regression parameters following the restrictions:

where restrictions (5), (6), and (7) are required to have V a r ( X A ) = 1 , V a r ( X B ) = 1 , and V a r ( Y ) = 1 , respectively. Plugging (5) and (6) into (3) yields σ γ 2 + σ η 2 − σ γ 2 σ η 2 ≠ 0 . We consider simulation settings based on the parameter specifications presented in Appendix 1—table 1 , where variables Z , ε , η , and γ were generated from independent normal distributions with zero means. For all scenarios considered, the empirical Pearson’s correlations between Y and X B are close to zero. With the simulated data, we examined the bias of least-squares estimator for β A under the full model of (1) and the reduced model (2). With 10,000 replications and three levels of sample sizes n ϵ { 500 , 1000 , 2000 } , the summary of bias is presented in Appendix 1—table 2 . As expected, the bias of β A is virtually zero when controlling for X B in the full model. On the contrary, failing to control for X B in the model, one would mistakenly estimate the causal effect between X A and Y resulting in a bias that agrees closely to β A λ A λ B . Our simulation results confirm that excluding confounding variables from the model could bias the coefficient estimates, hence introducing omitted variable biases. In addition, our results dispute the premise that a covariate that is uncorrelated with the outcome cannot be biasing the results of an association test between another variable and the outcome as an indicator of a causal effect and disputes the premise we began with. Whereas in the psychometrics literature such patterns have usually been termed suppressor effects , in a nutrition epidemiology paper they were referred to as negative confounders ( Choi et al., 2008 ).

## Parameters used to generate simulated data for the simulation studies under Misperception 9.

Summary of bias when fitting the full model (𝑀 𝐹 ) and the reduced model ( m r )..

The bias is defined as β ^ − β A , where β ^ is the least-squares estimate under the corresponding model.

Direction of arrows represents causal directions and 𝜆 A , 𝜆 B , 𝛽 A , and 𝛽 B are structural coefficients.

## Misperception 10. If a plausible confounding variable is unrelated to the IV, then it does not create bias in the association of the IV with the outcome

To illustrate this misconception, let’s consider the causal diagram shown in Appendix 2—figure 1 .

Data-generating model:

Let us consider the setting we had in the description of Misconception 9 in Appendix 1. Furthermore, Y 1 and Y 2 are the outcomes or DVs, X is the covariate of primary interest, and Z is the confounder in the casual association of the covariate with each response. We have the following model:

We further assume that Z has a Gaussian distribution with mean 0 and variance 1; ϵ x has a normal distribution with mean zero and variance σ x 2 ; e 1 has a normal distribution with mean zero and variance σ ϵ , 1 2 ; and e 2 has a normal distribution with mean zero and variance σ ϵ , 2 2 . Without loss of generality, we select the variance term so that the outcomes Y 1 and Y 2 and the exposure X and the confounder Z all have unit variance. The joint distribution of ( Z , X , Y 2 , Y 1 ) is a multivariate normal distribution with mean zero vector and the correlation matrix provided in Appendix 2—table 1 .

Where σ ϵ , 1 2 = 1 - β z 2 + β x 2 λ z 2 + 2 β z β x λ z + β x 2 ; σ ϵ , 2 2 = 1 - α z 2 + α x 2 λ z 2 + 2 α z α x λ z + α x 2 ; σ x 2 = 1 - λ z 2 . Thus, V a r Y 1 = V a r Y 2 = 1 . This clearly implies the following constraints on the parameters 0 < β z < 1 − β x 2 − λ z β x , 0 < α z < 1 − α x 2 − λ x α x .

## The correlation matrix among Z , X , Y 2 , and Y 1 without selecting on Y 1 .

Suppose we restrict the sample to values of Y 1 > E ( Y 1 ) + τ σ Y 1 and v a r Y 1 = 1 . This will ultimately perturb the joint distribution of Z , X , Y 2 . We can analytically derive the joint distribution of ( Z , X , Y 2 ) ∨ Y 1 > τ . Using results from (2), the joint distribution of ( Z , X , Y 2 ) ∨ Y 1 > τ is an extended multivariate skew-normal. Namely, the density of the vector is

Where ϕ . and Φ . denote the density function and the cumulative density function of the normal distribution. After selecting on values of Y 1 > τ

and ρ i is the ith element of ρ and σ i j 2 is the entry of the matrix in Appendix 2—table 1 at row i and column j. To find τ so that σ 12 2 = c o r ( X , Z | Y 1 > τ ) = 0 , we need to solve the following equation for τ

Since the quantity on the right-hand side is non-negative and takes on a value between 0 and 1, then for any choices of the triplet λ z , β x , β z , where λ z < ( β z + β x λ z ) ( β x + β x λ z ) , we can find a τ so that c o r r X , Z = 0 based on the data for which Y 1 > τ . Let’s further assume that for a given λ x , β x = a 1 + λ x 2 ∧ β z = b 1 + λ x 2 + a 2 - λ x a 1 + λ x 2 = b a β x 1 + λ x 2 + a 2 - λ x a , for fixed 0 < λ z < 1 , ∧ 0 ≤ a , b ≤ 1. Thus for any pair ( a , b ) w i t h 0 ≤ a , b < 1 , we have a set of possible values of λ z that satisfies all the constraints enumerated above. In the setting, β x ∧ β z involve the constants a , b , ∧ λ z in a nonlinear fashion. We rely on numerical approaches to identify values of λ z consistent with the values of a ∧ b . We illustrate this with the case where a , b ∈ { 0.2,0.9 } , which results in four possible pairs 0.2,0.2 , 0.2,0.9 , 0.9,0.2 , and 0.9,0.9 . In Appendix 2—figure 2 , we show the set of possible values of λ z , combined with the value of β x , β z for which we can find a value of τ .

Selecting on Y 1 affects the joint dependence between these variables. To this consider Appendix 2—table 1 , the correlation matrix among X , Z , Y 1 , and Y 2 without selecting on Y 1 . Contrast this with A1 , the correlation matrix among X , Z , Y 1 , and Y 2 with selecting on Y 1 .

Let a 0 = ϕ ( − α ) Φ ( − α ) ( α − ϕ ( − α ) Φ ( − α ) ) , a 1 = α z + α x λ z , a 2 = α x + α z λ z , ρ 1 = β z + β x λ z , ρ 2 = β x + β z λ z , and ρ 3 = ( α z + α x λ z ) β z ( α x + α z λ z )

From the elements of Appendix 2—table 1 and , we can derive the elements of Appendix 2—table 2 , which shows the squared (partial or zero-order) correlation and (partial or univariable) slope of the regression of Y 2 on X with and without controlling for Z and with and without selecting on Y 1 .

## The squared correlation and slope of regression.

σ ~ 12 , σ ~ 13 , σ ~ 23 , σ ~ 11 , σ ~ 22 , σ ~ 33 are the entries of the matrix Σ ~ 12 and ρ ~ are entries of the correlation matrix obtained from Σ ~ 12 .

When α x = 0 , then a 1 = α z , a 2 = α z λ z , ρ 3 = α z ρ 1 , σ ~ 12 = λ z + a 0 ρ 1 ρ 2

and σ ~ 23 ∗ , σ ~ 22 ∗ , σ ~ 33 ∗ are coming from the inverse of Σ ~ 12

In all, our derivations show that selecting on Y 1 can have some impact on the causal estimate of the effect of covariate X on Y 2 . To bring our point home, we perform a simulation study where we randomly select a data set of 1000 according to our data generating above. We consider the eight pairs ( a , b ) discussed above, and for each pair, we chose two values of τ (high and low). To each value of τ , we have an associated value of λ z , β x , β z . We choose the value of α x = 0 and α z = 0.6 . The sample size for each data set simulated is 50,000 and we report the average bias for the adjusted (controlling for the confounder Z ) and unadjusted causal effect of the covariate X on the response Y 2 based on the full data (All Data) and the data obtained after selecting on Y 1 ( S e l e c t o n Y 1 > τ ) . We report our findings in Appendix 2—table 3 . Adjusting for the confounder yields an unbiased estimate of the causal effect of X on Y 2 . Under both full and selected data scenarios, that estimated effect is biased. However, the estimated causal effect is biased for the full data and unbiased for the selected data when omitting the confounder.

## Estimated Average Bias of α x Under Various Scenarios.

Where τ , β x , β z are selected to Induce a Zero Correlation Between X and Z After Selecting on Y 1 . Results are based on sample size of n=50,000 and 1000 samples obtained from the data-generating model described above.

## Causal relationships of outcome, covariate, and confounding.

Direction of arrows represents causal directions and 𝜆 z , 𝛼 z , 𝛼 x , 𝛽 z , and 𝛽 x are structural coefficients.

## Possible values of λ z based on each choice of the pairs of a, b.

The area shaded in green denotes the area for which a λ z value has a value τ that makes Equation 13 equal zero.

- Akheruzzaman M
- Dhurandhar NV
- Google Scholar
- Primavera LH
- Pi-Sunyer FX
- Armstrong D
- Capitanio A
- Bangalore SS
- Batchelor JH
- Whelpley CE
- Sauerbrei W
- Blas Achic BG
- Aslibekyan S
- Ferreira da Silva R
- Klurfeld DM
- Mayo-Wilson E
- Menachemi N
- Schoeller D
- Ioannidis JPA
- Buonaccorsi J
- Demidenko E
- Stefanski LA
- Crainiceanu CM
- Cegielski JP
- Wattanaamornkiet W
- Volchenkov GV
- Van Der Walt M
- Kvasnovsky C
- Kuznetsova T
- Kurbatova E
- Kiryanova EV
- Kazennyy BY
- Contreras C
- Chernousova LN
- Global Preserving Effective TB Treatment Study (PETTS) Investigators
- MacKinnon DP
- Grandjean P
- Christensen JD
- Lagerkvist CJ
- Ioannidis JP
- La Vecchia C
- Weiderpass E
- Del Giudice M
- Gangestad SW
- Heymsfield SB
- Sørensen TIA
- Speakman JR
- Jeansonne M
- the Energy Balance Measurement Working Group
- Miratrix LW
- van Groen T
- Fitzsimons GJ
- Graubard BI
- Williamson DF
- Fletcher PC
- Fontaine KR
- Westfall AO
- Greenland S
- Greenwood DC
- Gilthorpe MS
- Groenwold RHH
- Donders ART
- Carpenter JR
- Hanley-Cook GT
- Huybrechts I
- Schroeder RR
- Tibshirani R
- Jennions MD
- Humphreys LG
- McClelland GH
- Kavvoura FK
- Liberopoulos G
- Linderman K
- Schroeder RG
- MacCallum RC
- Preacher KJ
- Maldonado G
- Manheimer JM
- McElreath R
- Davey Smith G
- National Academies of Sciences, Engineering, and Medicine
- Nachtsheim CJ
- Wasserman W
- Gonzalez-Mulé E
- Dudbridge F
- Mackenzie D
- Poongothai S
- Reed Education
- Rosenbaum PR
- Perperoglou A
- Abrahamowicz M
- Harrell FE Jr
- for TG2 of the STRATOS initiative
- Ittermann T
- Baumeister SE
- Simonsohn U
- Sjölander A
- Tuerlinckx F
- Vanpaemel W
- Schönbrodt FD
- Streeter AJ
- Crathorne L
- Cullings HM
- van der Zander B
- Liskiewicz M
- Tosteson TD
- Buonaccorsi JP
- VanderWeele TJ
- Vansteelandt S
- Wicherts JM
- Veldkamp CLS
- Augusteijn HEM
- van Aert RCM
- van Assen MALM
- Wesselink AK

## Author details

Contribution, contributed equally with, competing interests.

## For correspondence

National institutes of health (r25hl124208), national institutes of health (r25dk099080), national institutes of health (1r01dk136994-01), national institutes of health (1r01dk132385-01).

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

## Acknowledgements

DBA and CJV are supported in part by NIH grants R25HL124208 and R25DK099080. RSZ is supported in part by NIH grant 1R01DK136994-01 and CDT is supported in part by NIH grant 1R01DK132385-01. The authors thank the following colleagues for critical comments on the manuscript: Boyi Guo, Joseph Kush, Cai Li, Sanjay Shete, Lehana Thabane, Ahmad Zia Wahdat, and Rafi Zad. Jennifer Holmes, ELS, provided medical editing and editorial assistance.

## Senior Editor

- Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

## Reviewing Editor

- Jameel Iqbal, DaVita Labs, United States

## Version history

- Received: July 29, 2022
- Accepted: April 2, 2024
- Version of Record published: May 16, 2024 (version 1)

© 2024, Yu, Zoh et al.

This article is distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use and redistribution provided that the original author and source are credited.

- 17 downloads
- 0 citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

## Download links

Downloads (link to download the article as pdf).

- Article PDF

## Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools), categories and tags.

- independent variable
- covariate measurement error
- confounding
- causal effect
- association

## Further reading

Landscape drives zoonotic malaria prevalence in non-human primates.

Zoonotic disease dynamics in wildlife hosts are rarely quantified at macroecological scales due to the lack of systematic surveys. Non-human primates (NHPs) host Plasmodium knowlesi, a zoonotic malaria of public health concern and the main barrier to malaria elimination in Southeast Asia. Understanding of regional P. knowlesi infection dynamics in wildlife is limited. Here, we systematically assemble reports of NHP P. knowlesi and investigate geographic determinants of prevalence in reservoir species. Meta-analysis of 6322 NHPs from 148 sites reveals that prevalence is heterogeneous across Southeast Asia, with low overall prevalence and high estimates for Malaysian Borneo. We find that regions exhibiting higher prevalence in NHPs overlap with human infection hotspots. In wildlife and humans, parasite transmission is linked to land conversion and fragmentation. By assembling remote sensing data and fitting statistical models to prevalence at multiple spatial scales, we identify novel relationships between P. knowlesi in NHPs and forest fragmentation. This suggests that higher prevalence may be contingent on habitat complexity, which would begin to explain observed geographic variation in parasite burden. These findings address critical gaps in understanding regional P. knowlesi epidemiology and indicate that prevalence in simian reservoirs may be a key spatial driver of human spillover risk.

- Microbiology and Infectious Disease

## Disease Surveillance: Monitoring livestock pregnancy loss

Systematically tracking and analysing reproductive loss in livestock helps with efforts to safeguard the health and productivity of food animals by identifying causes and high-risk areas.

## A retrospective cohort study of Paxlovid efficacy depending on treatment time in hospitalized COVID-19 patients

Paxlovid, a SARS-CoV-2 antiviral, not only prevents severe illness but also curtails viral shedding, lowering transmission risks from treated patients. By fitting a mathematical model of within-host Omicron viral dynamics to electronic health records data from 208 hospitalized patients in Hong Kong, we estimate that Paxlovid can inhibit over 90% of viral replication. However, its effectiveness critically depends on the timing of treatment. If treatment is initiated three days after symptoms first appear, we estimate a 17% chance of a post-treatment viral rebound and a 12% (95% CI: 0–16%) reduction in overall infectiousness for non-rebound cases. Earlier treatment significantly elevates the risk of rebound without further reducing infectiousness, whereas starting beyond five days reduces its efficacy in curbing peak viral shedding. Among the 104 patients who received Paxlovid, 62% began treatment within an optimal three-to-five-day day window after symptoms appeared. Our findings indicate that broader global access to Paxlovid, coupled with appropriately timed treatment, can mitigate the severity and transmission of SARS-Cov-2.

## Be the first to read new articles from eLife

## Statistical inference for linear quantile regression with measurement error in covariates and nonignorable missing responses

- Published: 18 May 2024

## Cite this article

- Xiaowen Liang 1 &
- Boping Tian 1

In this paper, we consider quantile regression estimation for linear models with covariate measurement errors and nonignorable missing responses. Firstly, the influence of measurement errors is eliminated through the bias-corrected quantile loss function. To handle the identifiability issue in the nonignorable missing, a nonresponse instrument is used. Then, based on the inverse probability weighting approach, we propose a weighted bias-corrected quantile loss function that can handle both nonignorable missingness and covariate measurement errors. Under certain regularity conditions, we establish the asymptotic properties of the proposed estimators. The finite sample performance of the proposed method is illustrated by Monte Carlo simulations and an empirical data analysis.

This is a preview of subscription content, log in via an institution to check access.

## Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Carroll RJ, Ruppert D, Stefanski LA (1995) Measurement error in nonlinear models. Chapman and Hall, London

Book Google Scholar

Chen J, Shao J, Fang F (2021) Instrument search in pseudo-likelihood approach for nonignorable nonresponse. Ann Inst Stat Math 73(3):519–533. https://doi.org/10.1007/s10463-020-00758-z

Article MathSciNet Google Scholar

Ding X, Chen J, Chen X (2020) Regularized quantile regression for ultrahigh-dimensional data with nonignorable missing responses. Metrika 83(5):545–568. https://doi.org/10.1007/s00184-019-00744-3

Hansen L (1982) Large sample properties of generalized method of moments estimators. Econometrica 50(4):1029–1054. https://doi.org/10.1016/j.jeconom.2012.05.008

He X, Liang H (2006) Quantile regression estimates for a class of linear and partially linear errors-in-variables models. Stat Sin 10(1):129–140

MathSciNet Google Scholar

Jiang D, Zhao P, Tang N (2016) A propensity score adjustment method for regression models with nonignorable missing covariates. Comput Stats Data Anal 94:98–119. https://doi.org/10.1016/j.csda.2015.07.017

Jiang R, Qian W, Zhou Z (2016) Weighted composite quantile regression for single-index models. J Multivar Anal 148:34–48. https://doi.org/10.1016/j.jmva.2016.02.015

Jiang R, Qian W, Zhou Z (2018) Weighted composite quantile regression for partially linear varying coefficient models. Commun Stat Theory Methods 47(16):3987–4005. https://doi.org/10.1080/03610926.2017.1366522

Kai B, Li R, Zou H (2011) New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models. Ann Stat 39(1):305–332. https://doi.org/10.1214/10-AOS842

Koenker R, Bassett J (1978) Regression quantiles. Econometrica 46(1):33–50

Koenker R, Chernozhukov V, He X, Peng L (2017) Handbook of Quantile Regression. Chapman and Hall, New York

Little R, Rubin D (2002) Statistical Analysis with Missing Data. Wiley, New York

Ma W, Zhang T, Wang L (2022) Improved multiple quantile regression estimation with nonignorable dropouts. J Korean Stat Soc 52:1–32. https://doi.org/10.1007/s42952-022-00185-1

Qin G, Zhang J, Zhu Z (2016) Quantile regression in longitudinal studies with dropouts and measurement errors. J Stat Comput Simul 86(17):3527–3542. https://doi.org/10.1080/00949655.2016.1171867

Wang H, Stefanski L, Zhu Z (2012) Corrected-loss estimation for quantile regression with covariate measurement errors. Biometrika 99(2):405–421. https://doi.org/10.1093/biomet/ass005

Wang L, Shao J, Fang F (2021) Propensity model selection with nonignorable nonresponse and instrument variable. Stat Sin 31(2):647–672. https://doi.org/10.5705/ss.202019.0025

Wang S, Shao J, Kim JK (2014) An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat Sin 24(3):1097–1116. https://doi.org/10.5705/ss.2012.074

Wei Y, Carroll RJ (2009) Quantile regression with measurement error. J Am Stat Assoc 104(487):1129–1143. https://doi.org/10.1198/jasa.2009.tm08420

White H (1980) Nonlinear regression on cross-sectional data. Econometrica 48:721–746

Yu A, Zhong Y, Feng X, Wei Y (2022) Quantile regression for nonignorable missing data with its application of analyzing electronic medical records. Biometrics 79(3):2036–2049. https://doi.org/10.1111/biom.13723

Yu K, Lu Z (2004) Local linear additive quantile regression. Scand J Stat 31(3):333–346. https://doi.org/10.1111/j.1467-9469.2004.03_035.x

Zhao P, Zhao H, Tang N, Li Z (2017) Weighted composite quantile regression analysis for nonignorable missing data using nonresponse instrument. J Nonparametric Stat 29(2):189–212. https://doi.org/10.1080/10485252.2017.1285030

Download references

## Author information

Authors and affiliations.

School of Mathematics, Harbin Institute of Technology, Xidazhi, Harbin, 150001, Heilongjiang, China

Xiaowen Liang & Boping Tian

You can also search for this author in PubMed Google Scholar

## Corresponding author

Correspondence to Boping Tian .

## Ethics declarations

Conflict of interest.

No potential Conflict of interest was reported by the authors.

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Proofs of theorems

To establish the asymptotic properties of the proposed estimators, the following regularity conditions are imposed.

The samples \(\{({\textbf{X}}_i,Y_i,\delta _i):i=1,\ldots ,n\}\) are independent and identically distributed.

The parameter space of \(\varvec{\beta }\) denoted by \(\Theta _{\varvec{\beta }}\) is a compact set. The vector \(\varvec{\beta }\) is an interior point of the \(\Theta _{\varvec{\beta }}\) .

The expectation \({\mathbb {E}}\left[ \left\| {\textbf{X}}_i\right\| ^2\right] \) is bounded, and \({\mathbb {E}}\left[ {\textbf{X}}_i {\textbf{X}}_i^\top \right] \) is a positive definite \(p \times p\) matrix.

The probability density function of \(e_{i}\) conditional on \({\textbf{X}}_{i}\) is bounded from infinity, and it is bounded away from zero and has a bounded first derivative in the neighbourhood of zero.

For each i , \({\mathbb {E}}\left[ e_i^2 \mid {\textbf{X}}_i\right] \) is bounded as a function of \(\tau \) .

The kernel function K ( x ) is a bounded probability density function which exists finite fourth moment. Moreover, K ( x ) is twice-differentiable, and its second derivative \(K^{(2)}(x)\) is bounded and Lipschitz continuous on \((-\infty , \infty )\) .

Let \(\rho ^\star (Y, {\textbf{W}},\varvec{\beta },h,\delta ,\varvec{\alpha })\) and \(\rho (Y, {\textbf{W}}, \varvec{\beta }, h)\) respectively denote the weighted bias-corrected quantile loss function and the bias-corrected quantile loss function for either normal or Laplace measurement errors. Denote \(\psi _1(Y, {\textbf{W}}, \varvec{\beta },h,\delta , \alpha ) =\partial \rho ^\star (Y, {\textbf{W}}, \varvec{\beta },h,\delta , \varvec{\alpha }) / \partial \varvec{\beta }\) , \(\psi _2(Y, {\textbf{W}}, \varvec{\beta },h,\delta , \varvec{\alpha })=\) \(\partial ^2 \rho ^\star (Y, {\textbf{W}}, \varvec{\beta }, h, \delta ,\alpha )/ \partial \varvec{\beta }\partial \varvec{\beta }^\top \) and \({\tilde{\psi }}_1(Y, {\textbf{W}}, \varvec{\beta }, h )\) \(=\partial \rho (Y, {\textbf{W}}, \varvec{\beta }, h) / \partial \varvec{\beta }\) . As \(n \rightarrow \infty \) and \(h \rightarrow 0\) , there exist positive definite matrices \({\textbf{A}}\) , \({\textbf{B}}\) and \({\textbf{H}}\) such that \({\mathbb {E}}[\psi _2(Y, {\textbf{W}}, \varvec{\beta }_0, h, \delta ,\varvec{\alpha }_0)] \rightarrow {\textbf{A}}\) and \({\mathbb {E}}[\psi _1(Y, {\textbf{W}}, \varvec{\beta }_0, h,\delta ,\varvec{\alpha }_0)^{\otimes 2}] \rightarrow {\textbf{B}}\) , \({\mathbb {E}}\left[ \frac{\partial \Delta \left( {\textbf{V}}, Y, \varvec{\alpha }_0\right) / \partial \varvec{\alpha }}{\Delta \left( {\textbf{V}}, Y, \varvec{\alpha }_0\right) } {\tilde{\psi }}_1\left( Y, {\textbf{W}}, \varvec{\beta }_0, h\right) \right] \rightarrow {\textbf{H}}\) .

The propensity \(\Delta \left( {\textbf{V}},Y, \varvec{\alpha }\right) \) satisfies: (a) it is twice differentiable with respect to \(\varvec{\alpha }\) ; (b) \(0< c<\Delta \left( {\textbf{V}},Y,\varvec{\alpha }\right) < 1\) for a positive constant c ; (c) \(\partial \Delta ({\textbf{V}},Y,\varvec{\alpha })/\partial \varvec{\alpha }\) is uniformly bounded.

(C2) ensures the existence of \(\hat{\varvec{\beta }}_{\mathcal {N}}\) and \(\hat{\varvec{\beta }}_{\mathcal {L}}\) , and the uniformity of the convergence of the minimand over \(\Theta _{\varvec{\beta }}\) , as required to prove the consistency. (C3)-(C4) ensure that \(\varvec{\beta }_0\) is the unique minimizer of \({\mathbb {E}}\{\rho (Y,{\textbf{X}},\varvec{\beta })\}\) . (C7) are assumed to achieve the asymptotic normality of the estimators \({\varvec{\hat{\beta }}}_{\mathcal {N}}\) and \({\varvec{\hat{\beta }}}_{\mathcal {L}}\) .

## Proof of Theorem 1

Following Theorem 2 in Wang et al. ( 2012 ), we have \({\mathbb {E}} [M_{\mathcal {L}}\left( {\textbf{W}}, \varvec{\beta }, h, \varvec{\alpha }_0\right) ]={\mathbb {E}} [{\dot{M}}_{\mathcal {L}}\left( {\textbf{X}}, \varvec{\beta }, h, \varvec{\alpha }_0\right) ]\) , it leads to

For the first term in the right of the inequality, by Taylor expansion,

where \(\varvec{\alpha }^*\) lines between \(\varvec{\alpha }_0\) and \(\hat{\varvec{\alpha }}\) . According to Wang et al. ( 2014 ), we have \(\hat{\varvec{\alpha }}-\varvec{\alpha }_0=O_p(n^{-1 / 2})\) , then

Using the similar arguments to Wang et al. ( 2012 ) and Qin et al. ( 2016 ), the other four terms in the right of the inequality, respectively, have

Combining Eqs. ( A2 )–( A6 ), when \(h \rightarrow 0\) and \((n h)^{-1 / 2} \log (n) \rightarrow 0\) , we can obtain

By Conditions (C3) and (C4), \(\varvec{\beta }_0\) uniquely minimizes \({\mathbb {E}}[ M\left( {\textbf{X}}, \varvec{\beta }, \varvec{\alpha }_0\right) ]\) over \(\Theta _{\varvec{\beta }}\) . According to Lemma 2.2 in White ( 1980 ), \(\hat{\varvec{\beta }}_{\mathcal {L}}\) converges to \(\varvec{\beta }_0\) in probability. \(\square \)

## Proof of Theorem 2

The proof of Theorem 2 is similar to the proof of Theorem 1 and the proof of Part (ii) of Theorem 3 in Wang et al. ( 2012 ), so it is omitted here. \(\square \)

## Proof of Theorem 3

In the proof of Theorem 3 , we drop \({\mathcal {L}}\) in \(\hat{\varvec{\beta }}_{\mathcal {L}}\) for notational simplicity. Define

Furthermore, let

By Taylor expansion, we can obtain

After sorting, we get

Under the Conditions (C1)-(C6), we have

By Condition (C7),

For \(I_{1}\) , there is

By using the results of Theorems 2 in Wang et al. ( 2012 ) and methods like those used to obtain the asymptotic means and variances of kernel density estimators, we have

as \(n \rightarrow \infty \) and \(h \rightarrow 0\) . Therefore

Together with ( A8 ), ( A9 ), and ( A10 ), Condition (C7) and the central limit theorem, we can derive that

In terms of \(I_{2}\) , noticing that

thus we have

by the law of large numbers and Condition (C7). Furthermore, according to Wang et al. ( 2014 ), we have \(\hat{\varvec{\alpha }}-\varvec{\alpha }_0=O_p(n^{-1 / 2})\) and \(n^{1 / 2}\left( \hat{\varvec{\alpha }}-\varvec{\alpha }_0\right) {\mathop {\longrightarrow }\limits ^{d}} {\mathcal {N}}(0, \varvec{\Sigma }_{\alpha })\) with \(\varvec{\Sigma }_{\alpha }=\left\{ \varvec{\Lambda }^{\top } \varvec{\Omega }^{-1} \varvec{\Lambda }\right\} ^{-1}\) . As a result,

In addition, it is not difficult to verify that \({\mathbb {E}}[I_{i 1}+I_{i 2}]=o_p(1)\) . Note that

Direct calculation yields that

Employing similar idea in the proof of \(I_{i 2}\) , we get \(D_1=o_p(1)\) . On the other hand,

For \(i \ne j\) , note that \(\frac{-\delta _{{j}}\partial \Delta ({\textbf{V}}_{{j}},Y_{{j}},\varvec{\alpha }_{0})/\partial \varvec{\alpha }}{\Delta ({\textbf{V}}_{{j}},Y_{{j}},\varvec{\alpha }_0)^2}{\tilde{\psi }}_1\left( Y_{{j}},{\textbf{W}}_{{j}}, \varvec{\beta }_0, h \right) \left( \hat{\varvec{\alpha }}-\varvec{\alpha }_0\right) \) and \(\psi _1\big (Y_i,{\textbf{W}}_{i}, \) \(\varvec{\beta }_0, h,\delta _i,\varvec{\varvec{\alpha }}_0\big )\) are independent. Similar to the proofs of \(I_{i 1}\) and \(I_{i 2}\) , it leads to

Thus, we have \(D_2=o_p(1)\) . To conclude, \(\left( n^{-1 / 2} \sum _{i=1}^n I_{i 1}\right) \cdot \left( n^{-1 / 2} \sum _{i=1}^n I_{i 2}\right) =\) \(o_p(1)\) , then \({\text {Cov}}\left( I_{i 1}, I_{i 2}\right) \) \(=o_p(1)\) . Hence, it can be derived that \({\text {Cov}}\left( I_{i 1}+I_{i 2}\right) ={\textbf{B}}+{\textbf{H}} \varvec{\Sigma }_{\alpha } {\textbf{H}}^{\top }\triangleq {\textbf{D}}\) . As a result,

Combined with ( A7 ), ( A14 ) then

The asymptotic normality of \(\hat{\varvec{\beta }}_{\mathcal {N}}\) can be obtained by replacing \(\rho _{\mathcal {L}}(Y, {\textbf{W}}, \varvec{\beta }, h) \) and \(\dot{\rho }_{\mathcal {L}}\left( Y, {\textbf{X}}, \varvec{\beta }, h\right) \) with \(\rho _{\mathcal {N}}(Y, {\textbf{W}}, \varvec{\beta }, h) \) and \(\dot{\rho }_{\mathcal {N}}\left( Y, {\textbf{X}}, \varvec{\beta }, h\right) \) in the above proof. \(\square \)

## Additional simulation studies

1.1 simulation i.

Consider model

where \((X_{i1},X_{i2})^\top \sim {\mathcal {N}}\left( {\textbf{1}}, \varvec{\Sigma }_x\right) \) with \(\varvec{\Sigma }_x=\left( 0.5^{|j-k|}\right) _{1\le j,k\le 2}\) . The measurement error model is \({\textbf{W}}_{i}={\textbf{X}}_{i }+{\textbf{U}}_{i}\) , where \({\textbf{U}}_{i} \sim {\mathcal {L}}({\textbf{0}},\varvec{\Sigma })\) with

In this examples, we choose \(K(x)=\frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}}\) . The other settings are the same as those in the example in Sect. 4.2 , and the missing rate in this example is between \(30\%\) and \(48\%\) . The simulation results are presented in Tables 7 , 8 . Figure 2 presents the boxplots of \({\hat{\beta }}_k-\beta _{0k},(k=1,2)\) at \((\tau ,n)=(0.25,300)\) by all four methods. We can obtain similar conclusions as those in the first example.

Boxplots of \({\hat{\beta }}_{1} -\beta _{01}\) (left) and \({\hat{\beta }}_2-\beta _{02}\) (right) for different error distributions in Simulation I at \((\tau ,n)=(0.25,300)\)

## 1.2 Simulation II

Boxplots of \({\hat{\beta }}_1-\beta _{01}\) (left) and \({\hat{\beta }}_2-\beta _{02}\) (right) for different error distributions in Simulation II with \({\textbf{U}}_{i}\sim {\mathcal {N}}({\textbf{0}},\varvec{\Sigma })\)

Boxplots of \({\hat{\beta }}_{1} -\beta _{01}\) (left) and \({\hat{\beta }}_2-\beta _{02}\) (right) for different error distributions in Simulation II with \({\textbf{U}}_{i}\sim {\mathcal {L}}({\textbf{0}},\varvec{\Sigma })\)

In order to verify whether the proposed method is robust to the misspecification of the measurement error distribution, in this example, we exchange the distribution of the measurement \({\textbf{U}}_{i}\) between the example in Sect. 4.2 and the example in Simulation I without changing the other settings. More specifically, we define two estimators \({\hat{\beta }}_{{\mathcal {L}}{{\mathcal {N}}}}\) and \({\hat{\beta }}_{{\mathcal {N}}{{\mathcal {L}}}}\) as follows

Table 9 reports the Bias and RMSE of the estimators \(\hat{\varvec{\beta }}_{{\mathcal {L}}{{\mathcal {N}}}}\) and \(\hat{\varvec{\beta }}_{{\mathcal {N}}{{\mathcal {L}}}}\) in 200 simulation replicates. Figures 3 and 4 , present the boxplots of \({\hat{\beta }}_k-\beta _{0k},(k=1,2)\) at \(\tau =0.50\) . Simulation results show that the two proposed estimators are both robustness against misspecification of the measurement error distribution.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

## About this article

Liang, X., Tian, B. Statistical inference for linear quantile regression with measurement error in covariates and nonignorable missing responses. Metrika (2024). https://doi.org/10.1007/s00184-024-00967-z

Download citation

Received : 11 June 2023

Accepted : 09 April 2024

Published : 18 May 2024

DOI : https://doi.org/10.1007/s00184-024-00967-z

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Measurement errors
- Nonignorable missing
- Quantile regression
- Inverse probability weighting
- Find a journal
- Publish with us
- Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 10 May 2024

## Spatiotemporal trends and covariates of Lyme borreliosis incidence in Poland, 2010–2019

- Joanna Kulisz 1 ,
- Selwyn Hoeks 2 ,
- Renata Kunc-Kozioł 1 ,
- Aneta Woźniak 1 ,
- Zbigniew Zając 1 ,
- Aafke M. Schipper 2 ,
- Alejandro Cabezas-Cruz 3 &
- Mark A. J. Huijbregts 2

Scientific Reports volume 14 , Article number: 10768 ( 2024 ) Cite this article

215 Accesses

Metrics details

- Ecological epidemiology
- Ecological modelling
- Environmental impact
- Infectious diseases

Lyme borreliosis (LB) is the most commonly diagnosed tick-borne disease in the northern hemisphere. Since an efficient vaccine is not yet available, prevention of transmission is essential. This, in turn, requires a thorough comprehension of the spatiotemporal dynamics of LB transmission as well as underlying drivers. This study aims to identify spatiotemporal trends and unravel environmental and socio-economic covariates of LB incidence in Poland, using consistent monitoring data from 2010 through 2019 obtained for 320 (aggregated) districts. Using yearly LB incidence values, we identified an overall increase in LB incidence from 2010 to 2019. Additionally, we observed a large variation of LB incidences between the Polish districts, with the highest risks of LB in the eastern districts. We applied spatiotemporal Bayesian models in an all-subsets modeling framework to evaluate potential associations between LB incidence and various potentially relevant environmental and socio-economic variables, including climatic conditions as well as characteristics of the vegetation and the density of tick host species. The best-supported spatiotemporal model identified positive relationships between LB incidence and forest cover, the share of parks and green areas, minimum monthly temperature, mean monthly precipitation, and gross primary productivity. A negative relationship was found with human population density. The findings of our study indicate that LB incidence in Poland might increase as a result of ongoing climate change, notably increases in minimum monthly temperature. Our results may aid in the development of targeted prevention strategies.

## Similar content being viewed by others

## Meteorological and climatic variables predict the phenology of Ixodes ricinus nymph activity in France, accounting for habitat heterogeneity

## Potential drivers of human tick-borne encephalitis in the Örebro region of Sweden, 2010–2021

## Mapping ticks and tick-borne pathogens in China

Introduction.

Lyme borreliosis (LB) is the most commonly diagnosed tick-borne disease in the northern hemisphere 1 , 2 , 3 , 4 . It is estimated that over 1,000,000 people worldwide are affected by LB each year, with approximately 25% of cases occurring in Europe. The highest incidence rates among European countries are recorded in Belgium, Finland, the Netherlands, and Switzerland (> 100/100,000 inhabitants per year), and the lowest in Belarus, Croatia, Denmark, France, Ireland, Portugal, and the United Kingdom (except Scotland) (< 20/100,000 inhabitants per year) 5 , 6 . In Poland, the incidence of LB is at a level similar to that reported from neighboring countries, i.e., Czech Republic and Germany (20–40 cases per 100,000 inhabitants per year). The yearly economic costs associated with diagnosing and treating LB are substantial, surpassing $20 million in the Netherlands, over $40 million in Germany, and ranging from $800 million to over $3 billion in the United States 7 , 8 , 9 , 10 .

The agents of LB are spirochetes of the Borreliella genus, primarily B. burgdorferi in North America and B. afzelii , B. garinii, B. bavariensis and B. spielmanii in Europe. Ticks of the Ixodes genus, including I. scapularis and I. pacificus in North America and I. persulcatus and I. ricinus in Eurasia, are the primary vectors for Borreliella spirochetes 1 , 11 , 12 . Ixodes ricinus is widely distributed across Europe, has a non-specific host range, and plays a crucial role in the enzootic circulation of Borreliella spp. spirochetes 13 , 14 , 15 . The broad feeding capability of I. ricinus , enabling it to feed on over 300 vertebrate species from diverse taxonomic groups occurring in natural, urban, and suburban environments, greatly enhances the circulation and transmission of LB across the European continent 16 , 17 , 18 .

People spending time in ticks-populated habitats are at the highest risk of infection. Since an effective vaccine is currently not available, prevention of transmission is essential 19 , 20 , 21 . This, in turn, requires a good understanding of the spatio-temporal dynamics of LB transmission as well as the underlying factors. It is well-known that climatic factors, including temperature and precipitation, affect tick distribution and the prevalence of tick-borne pathogens 12 , 22 . The presence of potential tick hosts and the structure of their communities also influence ticks’ behavior, pathogen prevalence, and the risk of pathogen transmission 16 , 23 . Tick distribution and pathogen prevalence are further related to vegetation. Especially deciduous forests provide a suitable habitat for I. ricinus , due to favorable humidity conditions and the support of its hosts. Still, also urban green areas are inhabited by pathogen-infected ticks and pose a risk for disease transmission to humans 16 , 21 , 24 . The multitude of underlying environmental factors makes it challenging to understand their relative and combined contribution to LB incidence.

Statistical modeling based on reliable and consistent data is a useful tool to identify and unravel the different factors, which in turn can help to predict disease incidence, identify high-risk areas, and develop LB-focused educational and prophylaxis programs 21 , 25 . Unfortunately, large-scale modeling can be challenging due to differences in data collection systems, storage, availability, and legal requirements for reporting LB incidence within and between countries 26 . For instance, the European Centre for Disease Control and Prevention (ECDC) collects and presents reported data only on neuroborreliosis, which for Poland in 2019 constituted about 1.5% of all LB cases. In Poland, all diagnosed forms of LB are reported to the National Institute of Public Health – National Institute of Hygiene (NIPH-NIH) 27 , 28 .

In this study, we aimed to identify spatiotemporal trends and unravel covariates of LB incidence in Poland by using a unique and consistent dataset with yearly incidences collected from 380 Polish districts (organized into 320 territorial units) from 2010 through 2019. The dataset enabled us to systematically explore how the LB incidence is related to various relevant environmental and socio-economic factors, including climatic conditions, vegetation and tick host community characteristics, and human population density, covering the territory of the country up to the highest possible spatial resolution (district level). To provide a comprehensive understanding of potential associations between LB incidence and environmental or socio-economic variables, we utilized conditional autoregressive Bayesian models, accounting for spatiotemporal autoregressive processes 29 , 30 .

## Lyme borreliosis incidence in Poland

Our analysis of the trends in LB incidence indicated a clear overall increase of LB cases in Poland during the studied period, yet with considerable variation in the trends between the districts (Fig. 1 ). Based on mean incidence data over 2010–2019, we identified high-risk LB transmission regions in the eastern, north-eastern and southern districts, with mean values per 100,000 inhabitants reaching up to 276.2 in district located in Masovian Voivodeship, 213.3 in Podlachian, 192.3 in Lesser Poland, 176.5 in Lublin and 156.8 in Warmian-Masurian Voivodeship (Fig. 2 ; Supplementary Fig. S1 , Supplementary Table S1 ).

Relative change in Lyme borreliosis incidence per 100,000 inhabitants from 2010 to 2019. The red line depicts the median percentage change across all 320 (aggregated) districts. The grey-shaded area shows the interquartile range (25th and 75th percentiles) of all data. The light-grey lines show the individual district trends. This figure was generated with the use of R (version 4.3.2) using ggplot2 (version 3.4.4).

Mean Lyme borreliosis incidence per 100,000 inhabitants from 2010 to 2019 (log 10 transformed) for all 320 aggregated districts. This map was generated based on the compiled LB incidence data using an online tool ( https://www.datawrapper.de/ ).

## Modelling results

The most parsimonious spatiotemporal Bayesian model revealed positive relationships between the incidence of LB and the following factors: annual mean monthly precipitation (mm/month), the share of parks, lawns, and green areas in housing estate areas (%), annual minimum monthly temperature ( o C), yearly mean 8-day gross primary productivity (gC/m 2 ) and the percentage of area covered by forest (Fig. 3 ; Supplementary Table S2 ). In contrast, we found a negative relationship between LB incidence and human population density (individuals/km 2 ) (Fig. 3 ). The posterior distributions of the factors retained in the final model showed credible intervals (95% CI) that did not include zero (Fig. 3 ), except for the mean monthly precipitation (mean of 0.27 and 95% CI of − 0.03 to 0.56; Supplementary Table S2 ). The most parsimonious model had a Deviance Information Criterion (DIC) value that was approximately five points lower compared to the second lowest DIC value found during the model selection.

Partial response plots based on the posterior distributions for the fixed variables retained in the most parsimonious model. Blue lines depict the mean posterior response, and the grey lines show the individual Markov chain Monte Carlo samples ( n = 1000). Black dots represent the raw data points. This figure was generated with the use of R (version 4.3.2) using ggplot2 (version 3.4.4).

The spatial and temporal autoregressive parameters estimated in the final model had a mean of 0.74 (0.64 to 0.84 95% CI) and a mean of 0.88 (0.85 to 0.91 95% CI), respectively, confirming strong spatiotemporal autocorrelation in the data (Supplementary Table S2 ). Applying the Moran’s I test on the residuals of our final model showed that the spatial autocorrelation in the residuals was removed ( p > 0.05).

The time trend of predicted median LB incidence across all districts showed a strong resemblance to the yearly median observed values (Supplementary Fig. S2 ), indicating that the model captures the temporal trend well. Using the symmetric mean absolute percentage error (sMAPE), we evaluated the predictive accuracy of our model on the district level. The calculated sMAPE-values ranged between 1.1% and 31.2%, with an average sMAPE-value of 5.63% across the districts.

Tick-borne diseases, including LB, are among the most frequently diagnosed human infectious diseases in Europe 5 , 27 . Hence, a solid understanding of the factors influencing their incidence is extremely important. Our results indicate that LB transmission in Poland is especially prevalent in the eastern regions (Fig. 2 , Supplementary Fig. S1 , Supplementary Table S1 ). Mean LB incidence in Poland during the studied period ranged from 2.7 up to 276 cases per 100,000 inhabitants across the districts (Supplementary Table S1 ), which is higher than values reported from countries from the region 5 , 27 . Moreover, our spatiotemporal analysis revealed a substantial overall increase in LB incidence in Poland, with a median increase of ca. 80% from 2010 to 2019 (Fig. 1 ).

Based on consistent, long-term data covering the whole country up to the highest possible spatial resolution (district level), our research revealed that LB incidence in Poland is related to various environmental and socio-economic factors. Specifically, our analysis revealed a positive relationship between LB incidence and forest cover in Poland, which is in line with findings reported previously in the United States 31 . While ticks tend to hide in forest vegetation and litter, where they can reabsorb water and conserve energy during periods of non-feeding, forests also support a diverse range of ticks’ potential hosts, including mammals and birds 32 , 33 , 34 , 35 . Increased forest cover in a region, including forests accessible to the public, raises the risk of ticks’ interactions with humans, leading to pathogen transmission 36 . Next to forests, other green areas, such as parks, gardens, and lawns, may also provide suitable microhabitats for ticks and their hosts and are linked to LB transmission, as revealed by our analysis and in line with previous studies 37 , 38 . People living near forests and other types of green areas or engaging in recreational or professional activities, such as foresters, outdoor workers, gardeners, and athletes, face an elevated risk of tick bites and Borreliella spirochetes transmission 19 , 20 .

We further found that LB incidence increases with increasing minimum monthly temperature (Fig. 3 ). The importance of climatic conditions is in line with the findings of previous studies showing that ticks activity depends on local as well as large-scale climatic conditions, including temperature and humidity 33 , 35 , 39 . As the I. ricinus ticks overwinter buried in the litter, severe winter conditions can affect their survival and following spring activity 40 . Low temperatures may also negatively affect rodents, which are primary vectors for juvenile stadia, and deer, which are one of the hosts of I. ricinus adults 41 . Therefore, increases in minimum temperatures caused by climate change may lead to an increased risk of LB transmission. In this context, it is worth noting that particular microorganisms may also promote ticks’ winter survival, as it was reported that B. burgdorferi -infected females of I. scapularis had increased overwintering ability in comparison to uninfected ticks 42 .

Our model revealed a positive relationship between LB incidence and annual average 8-day gross primary productivity (GPP) (gC/m 2 ). As GPP is a measure of ecosystem productivity in terms of energy and/or biomass production by primary producers, the positive relationship indicates that increases in ecosystem productivity may support local tick populations by creating suitable microhabitats (including critters), and by mitigating adverse environmental conditions 43 , 44 , 45 . A higher ecosystem productivity may also benefit herbivores and predators, which are common tick hosts, thus contributing to an increased transmission to humans 46 . However, the density of mammals was not retained as a covariate in our best-supported model.

Moreover, our results confirmed a positive relationship between LB incidence and mean monthly precipitation. This is consistent with studies showing that ticks, including I. ricinus, are associated with microhabitats characterized by high humidity 47 . A suitable range of humidity promotes ticks’ host-seeking behavior and development, including embryogenesis, hatching, and molting 40 . Furthermore, humid conditions in the ticks’ microhabitats also affect the biology and phenology of their hosts, which may impact ticks’ success in both host-seeking and feeding 33 , 48 . It is worth underlining that changes in humidity may have long-term and delayed consequences for the risk of pathogen transmission since the life cycle of I. ricinus could last up to several years 40 . As humidity is linked to precipitation, changes in the magnitude and frequency of precipitation events due to climate change may alter the risk of LB transmission 32 , 33 , 39 .

Finally, we found a negative relationship between the incidence of LB and human population density (see Fig. 3 ). Highly urbanized areas, including large cities and agglomerations, have the highest human population density in Poland 49 . These areas are drastically influenced by human activities, likely reducing the availability of habitats of ticks and their hosts, resulting in decreased tick abundance and reduced risk of human-tick contact. On the other hand, increased human presence in developing suburbs and rural areas, as in the construction of houses and settlements in tick occurrence areas, can elevate the risk of pathogen transmission to humans and pets 18 . Adverse characteristics of urban habitats may be mitigated by improved microclimatic conditions, as cities located in temperate zones may be more suitable for ticks (even if their local populations are relatively small), due to slightly higher mean annual temperatures compared to surrounding areas 50 .

Finally, the spatial autocorrelation presented by the posterior distribution of the spatiotemporal random effects can be attributed to differences between districts not captured by the covariates included in our model (Supplementary Table S2 ). For example, the territory of Poland is characterized by varied topography, from lowlands in the north to mountains in the south of the country, including areas of lakelands and highlands, which may locally impact microclimatic conditions influencing both ticks and their hosts. Additional factors that may affect LB incidence include land cover types, specifically agriculturally used and fallow lands, as well as ecotones – transition zones between diverse types of ecosystems 37 , 51 , 52 . We also note that we did not consider the density of tick populations and the proportion of Borreliella -infected specimens. Although available data indicate that I. ricinus , the main vector of LB in Poland, can be found across the country, its occurrence is characterized by a patchy distribution 53 , 54 . Furthermore, Borreliella spp. prevalence in ticks varies across Poland 55 , 56 . However, the incorporation of these factors into country-wide analyses is hampered by differences in the methodologies applied for tick collection and pathogen detection 55 . Moreover, human behavior may also affect the risk of LB transmission, for example through encroachment into ticks’ habitats during recreation and traveling, as well as the ‘urbanization’ of tick species together with their hosts 18 . Follow-up research is needed to get a better understanding of the influence of these factors on LB incidence.

## Conclusions

Based on detailed epidemiological data gathered on the level of districts in Poland for the period 2010–2019, we were able to analyze spatio-temporal trends in the incidence of LB in Poland and link it to vegetation characteristics, climate factors, and socio-economic variables. The overall increase in LB incidence and potential future increases due to climate change justify increased attention in national health policy, for example via pathogen screening programs covering people in occupations associated with a high risk of LB transmission. Educating citizens about the disease, its vector, transmission routes, and preventive measures could also be a key component of the national health policy.

## Lyme borreliosis incidence data collection and preparation

We obtained data on yearly Lyme borreliosis (LB) cases for 380 districts (in Polish: powiat) between 2010 and 2019 from each of the 16 Voivodeship Sanitary Stations in Poland, upon request. We excluded the years 2020–2022 from the analysis to avoid bias caused by the SARS-CoV-2 pandemic, as suggested by previously published papers 57 . The dataset encompassed a total of 3,140 yearly reported LB cases. In some cases, a single Sanitary Station covered multiple administrative districts and reported accumulated epidemiological data. As a result, our analysis is based on 320 individual or aggregated districts. To calculate LB incidences (cases per 100,000 inhabitants), we divided the number of cases by the number of inhabitants per district or aggregated district 58 and we log-transformed the result to reduce the positive skew in the data distribution, as follows:

where \({N}_{LB}\) and \(N\) are the total number of LB cases and the total number of inhabitants for a specific year and district, respectively.

Subsequently, we calculated for each (aggregated) district the change in LB incidence relative to the first year using the following equation:

## Covariate data collection and processing

We compiled data on potentially relevant environmental and socio-economic factors associated with the incidence of LB for each year and district. We identified relevant covariates based on the literature (for details see Supplementary Table S3 ). We collected information on forest cover, the percentage of forest area dominated by specific tree species, the surface area of green spaces (including parks, lawns, and residential areas) as well as the total population of wild mammal species in each district from the Forest Data Bank 59 and Local Data Bank 58 (Supplementary Table S3 ). Regarding the forest cover, we aggregated the reported percentages of different tree species into two groups: deciduous or coniferous. For each district, the sum of the deciduous and coniferous groups equals 100%. These values were then multiplied by the fraction of the district area covered by forests to derive the relative area covered by deciduous or coniferous trees on a district level. Since data on tree dominance was available for 2012 to 2019, we assumed that the values of 2012 were representative of 2010 and 2011. As these data were available only from 2016 to 2021, we extrapolated linearly to estimate values for the missing years.

For obtaining climatic covariate data, we utilized various sources (TerraClimate, MODIS, LANDSAT 7, and ERA5) and employed the rgee R package in the Google Earth Engine 60 , 61 , 62 , 63 . We obtained year-specific data for climate variables from 2010 to 2019 (Supplementary Table S3 ). Specifically, we extracted the daily mean temperatures and calculated the growing degree days (GGD) based on the daily mean temperature values exceeding 5 ℃. We also extracted the yearly mean 8-day gross primary productivity (GPP) values (gC/m 2 ) for each district. Furthermore, we gathered data on the yearly average monthly precipitation (mm/month) and the yearly number of days with snow cover (Supplementary Table S3 ). Since the climatic covariate data was in raster format, we computed average values per district for each year using district polygons from Humanitarian Data Exchange v1.72.0 PY3 64 .

Lastly, we compiled socio-economic covariate data from the Local Data Bank 58 , including the number of nurses and medical doctors per 100,000 inhabitants, as well as the total population count for each district (Supplementary Table S3 ). To calculate population density (n/km 2 ) for each year, we divided the number of inhabitants by the surface area of the district polygon. The polygon area was calculated using the st_area() function included in the sf R package 65 .

## Model fitting

Before fitting models, we log 10 -transformed several variables because of their skewed distribution (see Supplementary Table S3 ; Supplementary Fig. S3 ; S4 ). Additionally, we assessed potential multicollinearity by calculating the variance inflation factor (VIF) for each covariate. We removed variables with a VIF greater than 3 to mitigate multicollinearity concerns. As a result, we excluded the maximum monthly temperature from the model selection procedure. Next, we assessed if the remaining covariates were able to capture the spatial and temporal autocorrelation in the LB incidence data. To that end, we specified a naive global regression model relating LB incidence (transformed via Eq. ( 1 ) to all remaining covariates (see Supplementary Table S3 ), ignoring any potential autocorrelation structure, and applying the Moran’s I statistic 66 on the residuals of our model for each year in the dataset. The Moran’s I test shows strong residual spatial autocorrelation for the individual years in our data set ( p < 0.05), with values ranging from 0.15 to 0.21.

To account for the spatiotemporal autocorrelation, we continued our analysis with a conditional autoregressive (CAR) Bayesian modeling framework relying on Markov chain Monte Carlo (MCMC) simulations 30 . Because the data compiled on LB incidence and potential covariates is partitioned into a set over areal units (districts) with multiple consecutive annual observations (from 2010 to 2019), we selected a Bayesian hierarchical model with first-order autoregressive processes 29 . This model includes random effects to account for any residual spatiotemporal autocorrelation presented by the data after considering the effects of the initial main covariates. We fitted the model using the CARBayesST R package 67 . We specifically used the ST.CARar() function, which incorporates the model suggested by Rushworth et al. 29 . The spatial association between the districts was described using functions from the spdep R package 68 , which generated a neighborhood matrix that indicates whether a pair of district polygons share a border, relying on the district polygons from Humanitarian Data Exchange v1.72.0 PY3 65 .

Following an all-subsets modeling approach, we fitted models with all possible combinations of covariates and selected the model with the lowest Deviance information criterion (DIC) as the best-supported model. The DIC is tailored to Bayesian model selection, where the posterior distributions have been generated by MCMC iterations 69 . Similar to the widely used Akaike’s information criterion (AIC), the DIC can select the model based on both the goodness of fit as well as the effective number of parameters 69 . In the model selection process, we excluded candidate models that included both the percentage of district area with forest cover and the forest cover dominated by either deciduous or coniferous tree species as covariates. We did this because the percentage of district area with forest cover already incorporates the combined effect of forest cover dominated by deciduous and coniferous tree species.

For the model selection procedure, we ran the ST.CARar() function with 220,000 MCMC samples. From these samples, 20,000 were removed to account for the burn-in process and the leftover samples were thinned by 10 to remove most of the autocorrelation. Iterating through all possible combinations of covariates, the best model according to the obtained DIC values was refitted. The global model containing all factors was fitted using three separate MCMC chains to quantify the between-to-within chain variation in the MCMC samples using the Gelman–Rubin diagnostic 70 and detect if the longer MCMC chains would potentially achieve a scale reduction. The final model was run with 1,100,000 MCMC samples, with a 1,000,000 burn-in period and a thinning of 100, resulting in 1000 samples for model inference. The MCMC sample size used in the model selection procedure was smaller compared to the final fit (factor of 5) for computational reasons. The MCMC sample size used for model selection (sample = 220,000; burn-in = 20,000; thinning = 10) was assessed by computing the Gelman–Rubin diagnostic 70 for 3 individual chains for the global model (containing all variables). This resulted in a Gelman–Rubin diagnostic of 1.01, indicating that the selected MCMC sample size was sufficient (< 1.1). The convergence diagnostics for MCMC runs are presented as trace plots in Supplementary Fig. S4 . The MCMC sample size used to refit the final model (sample = 1,100,000; burn-in = 100,000; thinning = 100) also showed to be sufficiently large according to the Gelman–Rubin diagnostic (< 1.1).

Using the symmetric mean absolute percentage error (sMAPE), we evaluated the predictive accuracy of our model by comparing the yearly predicted time trend to the observed time trend for each district individually. The entire analysis was performed in R version 4.3.2, using the ‘base’, ‘sf, ‘gstat’, ‘maptools’, ‘ggplot, ‘fst, ‘spdep’, and ‘CARBayesST’ packages 67 , 68 .

## Data availability

All data compiled in this study has been published in the manuscript or Supplementary Information files.

Marques, A., Strle, F. & Wormser, G. P. Comparison of Lyme disease in the United States and Europe. Emerg. Infect. Dis. 27 , 2017–2024 (2021).

Article PubMed PubMed Central Google Scholar

Kugeler, K. J., Schwartz, A. M., Delorey, M. J., Mead, P. S. & Hinckley, A. F. Estimating the frequency of Lyme disease diagnoses, United States, 2010–2018. Emerg. Infect. Dis. 27 , 616–619 (2021).

Surveillance Data | Lyme Disease | CDC. n.d. https://www.cdc.gov/lyme/datasurveillance/surveillance-data.html .

Vandekerckhove, O., De Buck, E. & Van Wijngaerden, E. Lyme disease in Western Europe: an emerging problem? A systematic review. Acta Clin. Belg. 76 , 244–252 (2019).

Article PubMed Google Scholar

Burn, L. et al. Incidence of Lyme borreliosis in Europe: A systematic review (2005–2020). Vector Borne Zoonotic Dis. 23 , 172–194 (2023).

Burn, L. et al. Seroprevalence of Lyme Borreliosis in Europe: Results from a Systematic Literature Review (2005–2020). Vector Borne Zoonotic Dis. 23 , 195–220 (2023).

Lohr, B. et al. Epidemiology and cost of hospital care for Lyme borreliosis in Germany: Lessons from a health care utilization database analysis. Ticks Tick Borne Dis. 6 , 56–62 (2015).

Article CAS PubMed Google Scholar

Van Den Wijngaard, C. C. et al. The cost of Lyme borreliosis. Eur. J. Public Health 27 , 538–547 (2017).

Mac, S., Da Silva, S. R. & Sander, B. The economic burden of Lyme disease and the cost-effectiveness of Lyme disease interventions: A scoping review. PLoS One 14 , e0210280 (2019).

Article CAS PubMed PubMed Central Google Scholar

Adrion, E., Aucott, J. N., Lemke, K. & Weiner, J. P. Health care costs, utilization and patterns of care following lyme disease. PLoS One 10 , e0116767 (2015).

Steere, A. C. et al. Lyme borreliosis. Nat. Rev. Dis. Primers https://doi.org/10.1038/nrdp.2016.90 (2016).

Kahl, O. & Gray, J. The biology of Ixodes ricinus with emphasis on its ecology. Ticks Tick Borne Dis. 14 , 102114 (2023).

Wolcott, K., Margos, G., Fingerle, V. & Becker, N. S. Host association of Borrelia burgdorferi sensu lato: A review. Ticks Tick Borne Dis. 12 , 101766 (2021).

Phelan, J. et al. Genome-wide screen identifies novel genes required for Borrelia burgdorferi survival in its Ixodes tick vector. PLoS Pathog. 15 , e1007644 (2019).

Caimano, M. J., Drecktrah, D., Kung, F. & Samuels, D. S. Interaction of the Lyme disease spirochete with its tick vector. Cell Microbiol. 18 , 919–927 (2016).

Estrada-Peña, A., De La Fuente, J., Ostfeld, R. S. & Cabezas-Cruz, A. Interactions between tick and transmitted pathogens evolved to minimise competition through nested and coherent networks. Sci. Rep. https://doi.org/10.1038/srep10361 (2015).

Estrada-Peña, A. et al. Nested coevolutionary networks shape the ecological relationships of ticks, hosts, and the Lyme disease bacteria of the Borrelia burgdorferi (s.l.) complex. Parasit. Vectors https://doi.org/10.1186/s13071-016-1803-z (2016).

Rizzoli, A. et al. Ixodes ricinus and Its Transmitted Pathogens in Urban and Peri-Urban Areas in Europe: New Hazards and relevance for public health. Front. Public Health https://doi.org/10.3389/fpubh.2014.00251 (2014).

Roome, A. et al. Tick magnets: The occupational risk of tick-borne disease exposure in forestry workers in New York. Health Sci. Rep. https://doi.org/10.1002/hsr2.509 (2022).

Donohoe, H., Pennington-Gray, L. & Omodior, O. Lyme disease: Current issues, implications, and recommendations for tourism management. Tour. Manag. 46 , 408–418 (2015).

Kilpatrick, A. M. et al. Lyme disease ecology in a changing world: consensus, uncertainty and critical gaps for improving control. Phil. Trans. R. Soc. Lond. B. Biol. Sci. 372 , 20160117 (2017).

Article Google Scholar

Simon, J. A. et al. Climate change and habitat fragmentation drive the occurrence of Borrelia burgdorferi , the agent of Lyme disease, at the northeastern limit of its distribution. Evol. App. 7 , 750–764 (2014).

Levi, T., Keesing, F., Holt, R. D., Barfield, M. & Ostfeld, R. S. Quantifying dilution and amplification in a community of hosts for tick-borne pathogens. Ecol. App. 26 , 484–498 (2016).

Hansford, K. M., Wheeler, B. W., Tschirren, B. & Medlock, J. M. Questing Ixodes ricinus ticks and Borrelia spp. in urban green space across Europe: A review. Zoonoses Public Health 69 , 153–166 (2022).

Bisanzio, D., Del Pilar Fernández, M., Martello, E., Reithinger, R. & Diuk-Wasser, M. A. Current and future spatiotemporal patterns of Lyme disease reporting in the Northeastern United States. JAMA Netw. Open 3 , e200319 (2020).

Blanchard, L. et al. Comparison of national surveillance systems for Lyme disease in humans in Europe and North America: A policy review. BMC Public Health 1 , 1307 (2022).

Surveillance Atlas of Infectious Diseases. European Centre for Disease Prevention and Control, https://www.ecdc.europa.eu/en/surveillance-atlas-infectious-diseases . (2017).

Meldunki epidemiologiczne. Narodowy Instytut Zdrowia Publicznego. Państwowy Instytut Badawczy n.d. https://www.pzh.gov.pl/serwisy-tematyczne/meldunki-epidemiologiczne/ .

Rushworth, A., Lee, D. & Mitchell, R. A spatio-temporal model for estimating the long-term effects of air pollution on respiratory hospital admissions in Greater London. Spat. Spatiotemporal Epidemiol. 10 , 29–38 (2014).

Lee, D. A tutorial on spatio-temporal disease risk modelling in R using Markov chain Monte Carlo simulation and the CARBayesST package. Spat. Spatiotemporal Epidemiol. 34 , 100353 (2020).

Gardner, A. et al. Landscape features predict the current and forecast the future geographic spread of Lyme disease. Proc.R. Soc. B Biol Sci. 287 , 20202278 (2020).

Randolph, S. E. Tick-borne disease systems. Rev. Sci. Tech. Off. Int. Epiz. 27 , 1–15 (2008).

Google Scholar

Randolph, S. & Storey, K. M. Impact of microclimate on immature Tick-Rodent host interactions (Acari: Ixodidae): Implications for parasite transmission. J. Med. Entom. 36 , 741–748 (1999).

Article CAS Google Scholar

Tack, W., Madder, M., Baeten, L., De Frenne, P. & Verheyen, K. The abundance of Ixodes ricinus ticks depends on tree species composition and shrub cover. Parasitology 139 , 1273–1281 (2012).

Li, S., Gilbert, L., Harrison, P. A. & Rounsevell, M. Modelling the seasonality of Lyme disease risk and the potential impacts of a warming climate within the heterogeneous landscapes of Scotland. J. R. Soc. Interf. 13 , 20160140 (2016).

Garcia-Martí, I., Zurita-Milla, R. & Swart, A. Modelling tick bite risk by combining random forests and count data regression models. PLoS One 14 , e0216511 (2019).

Heylen, D. et al. Ticks and tick-borne diseases in the city: Role of landscape connectivity and green space characteristics in a metropolitan area. Sci. Total Environ. 670 , 941–949 (2019).

Article ADS CAS PubMed Google Scholar

Oechslin, C. P. et al. Prevalence of tick-borne pathogens in questing Ixodes ricinus ticks in urban and suburban areas of Switzerland. Parasit. Vectors https://doi.org/10.1186/s13071-017-2500-2 (2017).

Cunze, S., Glock, G., Kochmann, J. & Klimpel, S. Ticks on the move—climate change-induced range shifts of three tick species in Europe: current and future habitat suitability for Ixodes ricinus in comparison with Dermacentor reticulatus and Dermacentor marginatus . Parasitol. Res. 121 , 2241–2252 (2022).

Randolph, S., Green, R. M., Hoodless, A. N. & Peacey, M. An empirical quantitative framework for the seasonal population dynamics of the tick Ixodes ricinus . Int. J. Parasitol. 32 , 979–989 (2002).

DelGiudice, G. D., Riggs, M. R., Joly, P. & Pan, W. Winter severity, survival, and cause-specific mortality of female white-tailed deer in north-central Minnesota. J. Wildl. Manag. 66 , 698 (2002).

Nabbout, A. E., Ferguson, L. V., Miyashita, A. & Adamo, S. A. Female ticks ( Ixodes scapularis ) infected with Borrelia burgdorferi have increased overwintering survival, with implications for tick population growth. Insect Sci. 30 , 1798–1809 (2023).

Cumming, G. S. Comparing climate and vegetation as limiting factors for species ranges of African ticks. Ecology 83 , 255–268 (2002).

LoGiudice, K., Ostfeld, R. S., Schmidt, K. A. & Keesing, F. The ecology of infectious disease: effects of host diversity and community composition on Lyme disease risk. PNAS 100 , 567–571 (2003).

Article ADS CAS PubMed PubMed Central Google Scholar

Letnic, M. & Ripple, W. J. Large-scale responses of herbivore prey to canid predators and primary productivity. Glob. Ecol. Biogeogr. 26 , 860–866 (2017).

McNaughton, S. J., Oesterheld, M., Frank, D. A. & Williams, K. J. Ecosystem-level patterns of primary productivity and herbivory in terrestrial habitats. Nature 341 , 142–144 (1989).

Grigoryeva, L. A. Influence of air humidity on the survival rate, lifetime, and development of Ixodes ricinus (L., 1758) and Ixodes persulcatus Schulze, 1930 (Acari: Ixodidae). Syst. Appl. Acaro. 27 , 2241 (2022).

Ostfeld, R. S., Canham, C. D., Oggenfuss, K., Winchcombe, R. J. & Keesing, F. Climate, deer, rodents, and acorns as determinants of variation in Lyme-Disease risk. PLoS Biol. 4 , e145 (2006).

Ciupa, T. & Suligowski, R. Green-blue spaces and population density versus COVID-19 cases and deaths in Poland. Int. J. Environ. Res. Public Health 18 , 6636 (2021).

Dautel, H. & Kahl, O. Ticks (Acari: Ixodoidea) and their medical importance in the urban environment. In Proceedings of the Third International Conference on Urban Pests: 19-22 July 1999 (ed. Dautel, H.) (Czech Republic, 1999).

Pfäffle, M. P., Littwin, N., Muders, S. V. & Petney, T. N. The ecology of tick-borne diseases. Int. J. Parasitol. 43 , 1059–1077 (2013).

Brownstein, J. S., Skelly, D. K., Holford, T. R. & Fish, D. Forest fragmentation predicts local scale heterogeneity of Lyme disease risk. Oecologia 146 , 469–475 (2005).

Article ADS PubMed Google Scholar

Zając, Z. et al. Environmental determinants of the occurrence and activity of Ixodes ricinus ticks and the prevalence of tick-borne diseases in eastern Poland. Sci. Rep. https://doi.org/10.1038/s41598-021-95079-3 (2021).

Nowak-Chmura, M. 2013 Fauna kleszczy (Ixodida) Europy Środkowej. Kraków:WNUP; (2013).

Zając, Z. et al. Tick activity, host range, and tick-borne pathogen prevalence in mountain habitats of the Western Carpathians Poland. Pathogens 12 , 1186 (2023).

Strzelczyk, J. K. et al. Prevalence of Borrelia burgdorferi sensu lato in Ixodes ricinus ticks collected from southern Poland. Acta Parasitol. 60 , 666–674 (2015).

Zając, Z., Bartosik, K., Kulisz, J. & Woźniak, A. Incidence of tick-borne encephalitis during the COVID-19 pandemic in selected European countries. J. Clin. Med. 11 , 803 (2022).

Statistics Poland – Local Data Bank; GUS - Bank Danych Lokalnych. n.d. https://bdl.stat.gov.pl/bdl/start .

The Forest Data Bank; Bank Danych o Lasach. n.d. https://www.bdl.lasy.gov.pl/portal/zestawienia-en .

Abatzoglou, J. T., Dobrowski, S. Z., Parks, S. A. & Hegewisch, K. C. TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958–2015. Sci. Data https://doi.org/10.1038/sdata.2017.191 (2018).

Hall, D. K. Riggs, G. A. MODIS/Terra Snow Cover Daily L3 Global 500m SIN Grid, Version 6. Boulder, Colorado USA. NASA National Snow and Ice Data Center Distributed Active Archive Center. (2016).

Copernicus Climate Data Store n.d. https://cds.climate.copernicus.eu/cdsapp#!/home .

Aybar, C., Wu, Q., Bautista, L., Yali, R. & Barja, A. rgee: An R package for interacting with google earth engine. J. Open Source Softw. 5 , 2272 (2020).

Article ADS Google Scholar

Search for a Dataset—Humanitarian Data Exchange, https://data.humdata.org/dataset , (2020).

Pebesma, E. Simple features for R: Standardized support for spatial vector data. R. J. 10 , 439 (2018).

Moran, P. A. Notes on continuous stochastic phenomena. Biometrika 37 , 17–23 (1950).

Article MathSciNet CAS PubMed Google Scholar

Lee, D., Rushworth, A. & Napier, G. Spatio-temporal areal unit modeling in R with conditional autoregressive priors using the CARBayesST package. J. Stat. Softw. 84 , 1–39 (2018).

Bivand, R. et al. Package ‘spdep’. Compr. R Arch. Netw. 604 , 605 (2015).

Spiegelhalter, D. J., Best, N. G., Carlin, B. P. & Van Der Linde, A. Bayesian measures of model complexity and fit.. SJ. R. Tat. Soc. Series B Stat. Methodol. 64 , 583–639 (2002).

Article MathSciNet Google Scholar

Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian data analysis (Chapman and Hall/CRC, 1995).

Book Google Scholar

Download references

## Author information

Authors and affiliations.

Chair and Department of Biology and Parasitology, Medical University of Lublin, Radziwiłłowska St. 11, 20-080, Lublin, Poland

Joanna Kulisz, Renata Kunc-Kozioł, Aneta Woźniak & Zbigniew Zając

Department of Environmental Science, Radboud Institute for Biological and Environmental Sciences, Radboud University, P.O. Box 9010, 6500, Nijmegen, GL, The Netherlands

Selwyn Hoeks, Aafke M. Schipper & Mark A. J. Huijbregts

Anses, UMR BIPAR, Laboratoire de Santé Animale, INRAE, Ecole Nationale Vétérinaire d’Alfort, 94700, Maisons-Alfort, France

Alejandro Cabezas-Cruz

You can also search for this author in PubMed Google Scholar

## Contributions

J.K. conceptualization, methodology, data acquisition, writing original draft, writing—review and editing; S.H. conceptualization, methodology, data acquisition, data analysis, writing original draft, writing—review and editing; R.K-K. data acquisition, writing original draft, writing—review and editing; A.W. data acquisition, writing original draft, writing—review and editing; Z.Z. conceptualization, methodology, data acquisition, writing original draft, writing—review and editing; A.M.S. conceptualization, methodology, writing—review and editing; A.C–C. writing original draft, writing—review and editing; M.A.J.H conceptualization, methodology, writing—review and editing.

## Corresponding author

Correspondence to Joanna Kulisz .

## Ethics declarations

Competing interests.

The authors declare no competing interests.

## Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

## About this article

Cite this article.

Kulisz, J., Hoeks, S., Kunc-Kozioł, R. et al. Spatiotemporal trends and covariates of Lyme borreliosis incidence in Poland, 2010–2019. Sci Rep 14 , 10768 (2024). https://doi.org/10.1038/s41598-024-61349-z

Download citation

Received : 12 January 2024

Accepted : 05 May 2024

Published : 10 May 2024

DOI : https://doi.org/10.1038/s41598-024-61349-z

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

## Quick links

- Explore articles by subject
- Guide to authors
- Editorial policies

Sign up for the Nature Briefing: Anthropocene newsletter — what matters in anthropocene research, free to your inbox weekly.

## IMAGES

## VIDEO

## COMMENTS

Covariates: Variables that affect a response variable, but are not of interest in a study. For example, suppose researchers want to know if three different studying techniques lead to different average exam scores at a certain school. The studying technique is the explanatory variable and the exam score is the response variable.

Covariates are continuous independent variables (or predictors) in a regression or ANOVA model. These variables can explain some of the variability in the dependent variable. That definition of covariates is simple enough. However, the usage of the term has changed over time. Consequently, analysts can have drastically different contexts in ...

A covariate can be an independent variable (i.e. of direct interest) or it can be an unwanted, confounding variable. Adding a covariate to a model can increase the accuracy of your results. Meaning in ANCOVA. Covariates are controlled for in ANCOVA. Image: Makingstats|Wikimedia Commons In ANCOVA, the independent variables are categorical variables.

The most precise definition is its use in Analysis of Covariance, a type of General Linear Model in which the independent variables of interest are categorical, but you also need to adjust for the effect of an observed, continuous variable-the covariate. In this context, the covariate is always continuous, never the key independent variable ...

Covariance in statistics measures the extent to which two variables vary linearly. The covariance formula reveals whether two variables move in the same or opposite directions. Covariance is like variance in that it measures variability. While variance focuses on the variability of a single variable around its mean, the covariance formula ...

These variables are known as covariates. Covariates: Variables that affect a response variable, but are not of interest in a study. For example, suppose researchers want to know if three different studying techniques lead to different average exam scores at a certain school. The studying technique is the explanatory variable and the exam score ...

The sign (+ or −) and size of the correlation coefficient between the dependent variable and covariate should be the same at each level of the qualitative variable ().In other words, if we draw a regression line for the relationship between the dependent variable and covariate at each level of the qualitative variable, the slope of the regression lines should be the same at all levels ...

Clearly, this approach can deepen understanding of biology. Second, we might include a covariate because, if the covariate accounts for a reasonable amount variation in the dependent variable, we increase statistical power to examine effects of a factor that interests us; again, this provides clear benefits.

Adjusted covariates are ~x ij = x ij x i These are the residuals from a model tting the covariate as response to the treatments. If we use adjusted covariates, then we get variance reduction, but we do no get covariance adjustment of the means. That is, the covariance adjusted means for this adjusted covariate are just the y i s.

Covariance in Excel: Steps. Step 1: Enter your data into two columns in Excel. For example, type your X values into column A and your Y values into column B. Step 2: Click the "Data" tab and then click "Data analysis.". The Data Analysis window will open. Step 3: Choose "Covariance" and then click "OK.".

Definition: Covariate is a variable that is not of main interest in an experiment but can affect the dependent variable and the relationship of the independent variable with the dependent variable. The covariate is not a planned variable but often arises in experiments due to underlying experimental conditions. Covariate should be identified ...

Covariates are usually used in ANOVA and DOE. In these models, a covariate is any continuous variable, which is usually not controlled during data collection. Including covariates the model allows you to include and adjust for input variables that were measured but not randomized or controlled in the experiment.

Factor. Meaning. A covariate is a variable that is related to both the independent variable (s) and the dependent variable in a research study or statistical analysis. The confounder is a specific type of covariate that is associated with both the exposure (independent variable) and the outcome (dependent variable) and can distort or falsely ...

For example, the covariance between two random variables X and Y can be calculated using the following formula (for population): For a sample covariance, the formula is slightly adjusted: Where: Xi - the values of the X-variable. Yj - the values of the Y-variable. X̄ - the mean (average) of the X-variable. Ȳ - the mean (average) of ...

ANCOVA, or the analysis of covariance, is a powerful statistical method that analyzes the differences between three or more group means while controlling for the effects of at least one continuous covariate. ANCOVA is a potent tool because it adjusts for the effects of covariates in the model. By isolating the effect of the categorical ...

This page titled 9.1: Role of the Covariate is shared under a CC BY-NC 4.0 license and was authored, remixed, and/or curated by Penn State's Department of Statistics via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. The role of the covariate in the ANCOVA.

This video describes the characteristics of independent and dependent variables as well as covariates.

The sign of the covariance of two random variables X and Y. Covariance in probability theory and statistics is a measure of the joint variability of two random variables.. The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one variable mainly correspond with greater values of the other variable, and the same holds ...

The terms independent variables, covariates, confounding variables, and confounding by indication are often imprecisely used in the context of regression. Independent variables are the full set of variables whose influence on the outcome is studied. Covariates are the independent variables that are included not because they are of interest but because their influence on the outcome can be ...

Example (salt tolerance experiment) Independent variables (aka treatment variables) Variables you manipulate in order to affect the outcome of an experiment. The amount of salt added to each plant's water. Dependent variables (aka response variables) Variables that represent the outcome of the experiment.

Quick Reference. (covariable) n. (in statistics) a continuous variable that is not part of the main experimental manipulation but has an effect on the dependent variable. The inclusion of covariates increases the power of the statistical test and removes the bias of confounding variables (which have effects on the dependent variable that are ...

Analysis of covariance. Analysis of covariance ( ANCOVA) is a general linear model that blends ANOVA and regression. ANCOVA evaluates whether the means of a dependent variable (DV) are equal across levels of one or more categorical independent variables (IV) and across one or more continuous variables. For example, the categorical variable (s ...

To control for covariates (typically continuous or variables on a particular scale) that aren't the main focus of your study. To study combinations of categorical and continuous variables, or variables on a scale as predictors. In this case, the covariate is a variable of interest (as opposed to one you want to control for). Within-Group Variance

This chapter addresses strategies for selecting variables for adjustment in nonexperimental comparative effectiveness research (CER), and uses causal graphs to illustrate the causal network relating treatment to outcome. While selection approaches should be based on an understanding of the causal network representing the common cause pathways between treatment and outcome, the true causal ...

In observational or nonrandomized research, it is common and often wise to control for certain variables in statistical models. Such variables are often referred to as covariates.Covariates may be controlled through multiple means, such as inclusion on the 'right-hand side' or 'predictor side' of a statistical model, matching, propensity score analysis, and other methods (Cochran and ...

The critical importance of justifying the inclusion of covariates is a facet often overlooked in data analysis. While the incorporation of covariates typically follows informal guidelines, we argue for a comprehensive exploration of underlying principles to avoid significant statistical and interpretational challenges. Our focus is on addressing three common yet problematic practices: the ...

In this paper, we consider quantile regression estimation for linear models with covariate measurement errors and nonignorable missing responses. Firstly, the influence of measurement errors is eliminated through the bias-corrected quantile loss function. To handle the identifiability issue in the nonignorable missing, a nonresponse instrument is used. Then, based on the inverse probability ...

Follow-up research is needed to get a better understanding of the influence of these factors on LB incidence. ... for each covariate. We removed variables with a VIF greater than 3 to mitigate ...