*After completing this chapter, the reader will be able to*

Define the population being studied and describe the method most appropriate to sample a given population.

Identify and describe the dependent and independent variables and indicate whether any covariates were included in analysis.

Identify and define the four scales of variable measurement.

Describe the difference between descriptive and inferential statistics.

Describe the mean, median, variance, and standard deviation and why they are important to statistical analysis.

Describe the properties of the normal distribution and when an alternative distribution should be, or should have been, used.

Describe several common epidemiological statistics.

Identify and describe the difference between parametric and nonparametric statistical tests and when their use is most appropriate.

Determine whether the appropriate statistical test has been performed when evaluating a study.

There are four scales of variable measurement consisting of nominal, ordinal, interval, and ratio scales that are critically important to consider when determining the appropriateness of a statistical test.

Measures of central tendency are useful to quantify the distribution of a variable’s data numerically. The most common measures of central tendency are the mean, median, and mode, with the most appropriate measure of central tendency dictated by the variable’s scale of measurement.

Variance is a key element inherent in all statistical analyses, but standard deviation is presented more often. Variance and standard deviation are related mathematically.

The key benefit to using the standard normal distribution is that converting the original data to z-scores allows researchers to compare different variables regardless of the original scale.

The last observation carried forward (LOCF) technique used often with the data from clinical trials introduces significant bias into the results of statistical tests.

The central limit theorem states when equally sized samples are drawn from a non-normal distribution, the plotted mean values from each sample will approximate a normal distribution as long as the non-normality was not due to outliers.

There are numerous misconceptions about

*p*values and it is important to know how to interpret them correctly.Clinical significance is far more important than statistical significance. Clinical significance can be quantified by using various measures of effect size.

The selection of the appropriate statistical test is based on several factors including the specific research question, the measurement scale of the dependent variable (DV), distributional assumptions, the number of DV measurements as well as the number and measurement scale of independent variables (IVs) and covariates, among others.

Knowledge of statistics and statistical analyses is essential to constructively evaluate literature in the biomedical sciences. This chapter provides a general overview of both descriptive and inferential statistics that will enhance the ability of the student or evidence-based practitioner to interpret results of empirical literature within the biomedical sciences by evaluating the appropriateness of statistical tests employed, the conclusions drawn by the authors, and the overall quality of the study.

Alongside Chapters 4 and 5, diligent study of the material presented in this chapter is an important first step to critically analyze the often avoided methods or results sections of published biomedical literature. Be aware, however, that this chapter cannot substitute for more formal didactic training in statistics, as the material presented here is not exhaustive with regards to either statistical concepts or available statistical tests. Thus, when reading a journal article, if doubt emerges about whether a method or statistical test was used and interpreted appropriately, do not hesitate to consult appropriate references or an individual who has more formal statistical training. This is especially true if the empirical evidence is being considered for implementation in practice. Asking questions is the key to obtaining knowledge!

For didactic purposes, this chapter can be divided into two sections. The first section presents a general overview of the processes underlying most statistical tests used in the biomedical sciences. It is recommended that all readers take the time required to thoroughly study these concepts. The second section, beginning with the Statistical Tests section, presents descriptions, assumptions, examples, and results of numerous statistical tests commonly used in the biomedical sciences. This section does not present the mathematical underpinnings, calculation, or programming of any specific statistical test. It is recommended that this section serve as a reference to be used concurrently alongside a given journal article to determine the appropriateness of a statistical test or to gain further insight into why a specific statistical test was employed.

When investigating a particular research question or hypothesis, researchers must first define the population to be studied. A population refers to any set of objects in the universe, while a sample is a fraction of the population chosen to be representative of the specific population of interest. Thus, samples are chosen to make specific generalizations about the population of interest. Researchers typically do not attempt to study the entire population because data cannot be collected for everyone within a population. This is why a sample should ideally be chosen at random. That is, each member of the population must have an equal probability of being included in the sample.

For example, consider a study to evaluate the effect a calcium channel blocker (CCB) has on blood glucose levels in Type 1 diabetes mellitus (DM) patients. In this case, all Type 1 DM patients would constitute the study population; however, because data could never be collected from all Type 1 DM patients, a sample that is representative of the Type 1 DM population would be selected. There are numerous sampling strategies, many beyond the scope of this text. Although only a few are discussed here, interested readers are urged to consult the list of suggested readings at the end of this chapter for further information.

A random sample does not imply that the sample is drawn haphazardly or in an unplanned fashion, and there are several approaches to selecting a random sample. The most common method employs a random number table. A random number table theoretically contains all integers between one and infinity that have been selected without any trends or patterns. For example, consider the hypothetical process of selecting a random sample of Type 1 DM patients from the population. First, each patient in the population is assigned a number, say 1 to *N*, where *N* is the total number of Type 1 DM patients in the population. From this population, a sample of 200 patients is requested. The random number table would randomly select 200 patients from the population of size *N*. There are numerous free random number tables and generators available online; simply search for random number table or random number generator in any search engine.

Depending on the study design, a random sample may not be most appropriate when selecting a representative sample. On occasion, it may be necessary to separate the population into mutually exclusive groups called strata, where a specific factor (e.g., race, gender) will create separate strata to aid in analysis. In this case, the random sample is drawn within each stratum individually. This process is termed stratified random sampling. For example, consider a situation where the race of the patient was an important variable in the Type 1 DM study. To ensure the proportion of each race in the population is represented accurately, the researcher stratifies by race and randomly selects patients within each stratum to achieve a representative study sample.

Another method of randomly sampling a population is known as cluster sampling. Cluster sampling is appropriate when there are natural groupings within the population of interest. For example, consider a researcher interested in the patient counseling practices of pharmacists across the United States. It would be impossible to collect data from all pharmacists across the United States. However, the researcher has read literature suggesting regional differences within various pharmacy practices, not necessarily including counseling practices. Thus, he or she may decide to randomly sample within the four regions of the United States (U.S.) defined by the U.S. Census Bureau (i.e., West, Midwest, South, and Northeast) to assess for differences in patient counseling practices across regions.^{1}

Another sampling method is known as systematic sampling. This method is used when information about the population is provided in list format, such as in the telephone book, election records, class lists, or licensure records, among others. Systematic sampling uses an equal-probability method where one individual is selected initially at random and every *n*th individual is then selected thereafter. For example, the researchers may decide to take every 10th individual listed after the first individual is chosen.

Finally, researchers often use convenience sampling to select participants based on the convenience of the researcher. That is, no attempt is made to ensure the sample is representative of the population. However, within the convenience sample, participants may be selected randomly. This type of sampling is often used in educational research. For example, consider a researcher evaluating a new classroom instructional method to increase exam scores. This type of study will use the convenience sample of the students in their own class or university. Obviously, significant weaknesses are apparent when using this type of sampling, most notably, limited generalization.

A variable is the characteristic that is being observed or measured. Data are the measured values assigned to the variable for each individual member of the population. For example, a variable would be the participant’s biological sex, while the data is whether the participant is male or female.

In statistics, there are three types of variables: dependent (DV), independent (IV), and confounding. The DV is the response or outcome variable for a study, while an IV is a variable that is manipulated. A confounding variable (often referred to as covariate) is any variable that has an effect on the DV over and above the effect of the IV, but is not of specific research interest. Putting these definitions together, consider a study to evaluate the effect a new oral hypoglycemic medication has on glycosylated hemoglobin (HbA1c) compared to placebo. Here, the DV consists of the HbA1c data for each participant, and the IV is treatment group with two levels (i.e., treatment versus placebo). Initially, results may suggest the medication is very effective across the entire sample; however, previous literature has suggested participant race may affect the effectiveness of this type of medication. Thus, participant race is a confounding variable and would need to be included in the statistical analysis. After statistically controlling for participant race, the results may indicate the medication was significantly more effective in the treatment group.

❶ *There are four scales of variable measurement consisting of nominal, ordinal, interval, and ratio scales that are critically important to consider when determining the appropriateness of a statistical test*.^{2} Think of these four scales as relatively fluid; that is, as the data progress from nominal to ratio, the information about each variable being measured is increased. The scale of measurement of DVs, IVs, and confounding variables is an important consideration when determining whether the appropriate statistical test was used to answer the research question and hypothesis.

A **nominal scale** consists of categories that have no implied rank or order. Examples of nominal variables include gender (e.g., male versus female), race (e.g., Caucasian versus African American versus Hispanic), or disease state (e.g., absence versus presence). It is important to note that with nominal data, the participant is categorized into one, and only one, category. That is, the categories are mutually exclusive.

An **ordinal scale** has all of the characteristics of a nominal variable with the addition of rank ordering. It is important to note the distance between rank ordered categories cannot be considered equal; that is, the data points can be ranked but the distance between them may differ greatly. For example, in medicine a commonly used pain scale is the Wong-Baker Faces Pain Rating Scale.^{3} Here, the participant ranks pain on a 0 to 10 scale; however, while it is known that a rating of eight indicates the participant is in more pain than rating four, there is no indication a rating of eight hurts twice as much as a rating of four.

An **interval scale** has all of the characteristics of an ordinal scale with the addition that the distance between two values is now constant and meaningful. However, it is important to note that interval scales do not have an absolute zero point. For example, temperature on a Celsius scale is measured on an interval scale (i.e., the difference between 10°C and 5°C is equivalent to the difference between 20°C and 15°C). However, 20°C/10°C cannot be quantified as twice as hot because there is no absolute zero (i.e., the selection of 0°C was arbitrary).

Finally, a **ratio scale** has all of the characteristics of an interval scale, but ratio scales have an absolute zero point. The classic example of a ratio scale is temperature measured on the Kelvin scale, where zero Kelvin represents the absence of molecular motion. Theoretically, researchers should not confuse absolute and arbitrary zero points. However, the difference between interval and ratio scales is generally trivial as these data are analyzed by identical statistical procedures.

Continuous variables generally consist of data measured on interval or ratio scales. However, if the number of ordinal categories is large (e.g., seven or more) they may be considered continuous.^{4} Be aware that continuous variables may also be referred to as quantitative in the literature. Examples of continuous variables include age, body mass index (BMI), or uncategorized systolic or diastolic blood pressure values.

**Categorical variables** consist of data measured on nominal or ordinal scales because these scales of measurement have naturally distinct categories. Examples of categorical variables include gender, race, or blood pressure status (e.g., hypotensive, normotensive, prehypertensive, hypertensive). In the literature, categorical variables are often termed discrete or if a variable is measured on a nominal scale with only two distinct categories it may be termed binary or **dichotomous**.

Note that it is common in the biomedical literature for researchers to categorize continuous variables. While not wholly incorrect, categorizing a continuous variable will always result in loss of information about the variable. For example, consider the role participant age has on the probability of experiencing a cardiac event. Although age is a continuous variable, younger individuals typically have much a lower probability compared to older individuals. Thus, the research may divide age into four discrete categories: <30, 31–50, 51–70, and 70 + . Note that assigning category cutoffs is generally an arbitrary process. In this example, information is lost by categorizing age because after categorization the exact age of the participant is unknown. That is, an individual’s age is defined only by their age category. For example, a 50-year-old individual is considered identical to a 31-year-old individual because they are in the same age category.

There are two types of statistics—descriptive and inferential. Descriptive statistics present, organize, and summarize a variable’s data by providing information regarding the appearance of the data and distributional characteristics. Examples of descriptive statistics include measures of central tendency, variability, shape, histograms, boxplots, and scatterplots. These descriptive measures are the focus of this section.

Inferential statistics indicate whether a difference exists between groups of participants or whether an association exists between variables. Inferential statistics are used to determine whether the difference or association is real or whether it is due to some random process. Examples of inferential statistics are the statistics produced by each statistical test described later in the chapter. For example, the *t* statistic produced by a *t*-test.

❷ *Measures of central tendency are useful to quantify the distribution of a variable’s data numerically. The most common measures of central tendency are the mean, median, and mode, with the most appropriate measure of central tendency dictated by the variable’s scale of measurement*.

The **mean** (indicated as M in the literature) is the most common and appropriate measure of central tendency for normally distributed data (see the Common Probability Distributions section later in chapter) measured on an interval or ratio scale. The mean is the arithmetic average of a variable’s data. Thus, the mean is calculated by summing a variable’s data and dividing by the total number of participants with data for the specific variable. It is important to note that data points that are severely disconnected from the other data points can significantly influence the mean. These extreme data points are termed outliers.

The **median** (indicated as Mdn in the literature) is most appropriate measure of central tendency for data measured on an ordinal scale. The median is the absolute middle value in the data; therefore, exactly half of the data is above the median and exactly half of the data is below the median. The median is also known as the 50th percentile. Note that the median can also be presented for continuous variables with skewed distributions (see the Measures of Distribution Shape section later in chapter) and for continuous variables with outliers. A direct comparison of a variable’s mean and median can give insight into how much influence outliers had on the mean.

Finally, the **mode** is the most appropriate measure of central tendency for nominal data. The mode is a variable’s most frequently occurring data point or category. Note that it is possible for a variable to have multiple modes; a variable with two or three modes is referred to as bimodal and trimodal, respectively.

Measures of variability are useful in indicating the spread of a variable’s data. The most common measures of variability are the range, interquartile range, variance, and standard deviation. These measures are also useful when considered with appropriate measures of central tendency to assess how the data are scattered around the mean or median. For example, consider the two histograms in Figure 8–1. Both histograms present data for 100 participants that have identical mean body weight of 185 pounds. However, notice the variability (or dispersion) of the data is much greater for Group 2. If the means for both groups were simply taken at face value, the participants in these two groups would be considered similar; however, assessing the variability or spread of data within each group illustrates an entirely different picture. This concept is critically important to the application of any inferential statistical test because the test results are heavily influenced by the amount of variability.

The simplest measure of variability is the range, which can be used to describe data measured on an ordinal, interval, or ratio scale. The range is a crude measure of variability calculated by subtracting a variable’s minimum data point from its maximum data point. For example, in Figure 8–1 the range for Group 1 is 30 (i.e., 200–170), whereas the range from Group 2 is 95 (i.e., 235–140).

The interquartile range (indicated as IQR in the literature) is another measure of dispersion used to describe data measured on ordinal scale; as such, the IQR is usually presented alongside the median. The IQR is the difference between the 75th and 25th percentile. Therefore, because the median represents the 50th percentile, the IQR presents the middle 50% of the data and always includes the median.

The final two measures of variability described are the variance and standard deviation. These measures are appropriate for normally distributed continuous variables measured on interval or ratio scales. ❸ *Variance is a key element inherent in all statistical analyses, but standard deviation is presented more often. Variance and standard deviation are related mathematically.* Variance is the average squared deviation from the mean for all data points within a specific variable. For example, consider Figure 8–1, where the mean for both Group 1 and 2 was 185 pounds. Say a specific participant in Group 1 had a body weight of 190 pounds. The squared deviation from the mean for this participant would be equal to 25 pounds. That is, 190 − 185 = 5 and 5^{2} = 25. To calculate variance, the squared deviations are calculated for all data points. These square deviations are then summed across participants and then divided by the total number of data points (i.e., *N*) to obtain the average squared deviation. Some readers may be asking why the deviations from the mean are squared. This is a great question! The reason is that summing unsquared deviations across participants would equal zero. That is, deviations resulting from values above and below the mean would cancel each other out. While the calculation of variance may seem esoteric, conceptually all that is required to understand variance is that larger variance values indicate greater variability. For example, the variances of Group 1 and 2 in Figure 8–1 were 25 and 225, respectively. While the histograms did not present these numbers explicitly, the greater variability in Group 2 can be observed clearly. The importance of variance cannot be overstated. It is the primary parameter used in all parametric statistical analyses, and this is why it was presented in such detail here. With that said, variance is rarely presented as a descriptive statistic in the literature. Instead, variance is converted into a standard deviation as described in the next paragraph.

As a descriptive statistic, the standard deviation (indicated as SD in the literature) is often preferred over variance because it indicates the average deviation from the mean presented on the same scale as the original variable. Variance presented the average deviation in squared units. It is important to note that the standard deviation and variance are directly related mathematically, with standard deviation equal to the square root of the variance (i.e., $SD = variance$). Thus, if the standard deviation is known, the variance can be calculated directly, and vice versa. When comparing variability between groups of participants, the standard deviation can provide insight into the dispersion of scores around the mean, and, similar to variance, larger standard deviations indicate greater variability in the data. For example, from Figure 8–1, Group 1 had a standard deviation of 5 (i.e., $ 25$) and Group 2 had a standard deviation of 15 (i.e., $ 225$). Again, the greater variability within Group 2 is evident.

Skewness and kurtosis are appropriate measures of shape for variables measured on interval or ratio scales, and indicate asymmetry and peakedness of a distribution, respectively. They are typically used by researchers to evaluate the distributional assumptions of a parametric statistical analysis.

Skewness indicates the asymmetry of distribution of data points and can be either positive or negative. Positive (or right) skewness occurs when the mode and median are less than the mean, whereas negative (or left) skewness occurs when mode and median are greater than the mean. As stated previously, the mean is sensitive to extremely disconnected data points termed outliers; thus, it is important to know the difference between true skewness and skewness due to outliers. True skewness is indicated by a steady decrease in data points toward the tails (i.e., ends) of the distribution. Skewness due to outliers is indicated when the mean is heavily influenced by data points that are extremely disconnected from the rest of the distribution. For example, consider the histogram in Figure 8–2. The data for Group 1 provides an example of true positive skewness, whereas the data for Group 2 provides an example of negative skewness due to outliers.

Kurtosis indicates the peakedness of the distribution of data points and can be either positive or negative. Plotted data with a narrow, peaked distribution and a positive kurtosis value is termed leptokurtic. A leptokurtic distribution has small range, variance, and standard deviation with the majority of data points near the mean. In contrast, plotted data with a wide, flat distribution and a negative kurtosis value is referred to as platykurtic. A platykurtic distribution is an indicator of great variability, with large range, variance, and standard deviation. Examples of leptokurtic and platykurtic data distributions are presented for Group 1 and 2, respectively, in Figure 8–2.

Graphical representations of data are incredibly useful, especially when sample sizes are large, as they allow researchers to inspect the distribution of individual variables. There are typically three graphical representations presented in the literature—histograms, boxplots, and scatterplots. Note that graphical representations are typically used for continuous variables measured on ordinal, interval, or ratio scales. By contrast, nominal, dichotomous, or categorical variables are best presented as count data, which are typically reported in the literature as frequency and percentage.

A histogram presents data as frequency counts over some interval; that is, the *x*-axis presents the values of the data points, whether individual data points or intervals, while the *y*-axis presents the number of times the data point or interval occurs in the variable (i.e., frequency). Figures 8–1 and 8–2 provide examples of histograms. When data are plotted, it is easy to observe skewness, kurtosis, or outlier issues. For example, reconsider the distribution of body weight for Group 2 in Figure 8–2, where each vertical bar represents the number of participants having body weight within 10-unit intervals. The distribution of data has negative skewness due to outliers, with outliers being participants weighing between 90 and 120 pounds.

A boxplot, also known as a box-and-whisker plot, provides the reader with five descriptive statistics.^{5} Consider the boxplot in Figure 8–3, which presents the same data as the Group 2 histogram in Figure 8–2. The thin-lined box in a boxplot indicates the IQR, which contains the 25th to 75th percentiles of the data. Within this thin-lined box is a thick, bold line depicting the median, or 50th percentile. From both ends of the thin-lined box extends a tail, or whisker, depicting the minimum and maximum data points up to 1.5 IQRs beyond the median. Beyond the whiskers, outliers and extreme outliers are identified with circles and asterisks, respectively. Note that other symbols may be used to identify outliers depending on the statistical software used. Outliers are defined as data points 1.5 to 3.0 IQRs beyond the median, whereas extreme outliers are defined as data points greater than 3.0 IQRs beyond the median. The boxplot corroborates the information provided by the histogram in Figure 8–2 as participants weighing between 90 and 120 pounds are defined as outliers.

Finally, a scatterplot presents data for two variables both measured on a continuous scale. That is, the *x*-axis contains the range of data for one variable, whereas the *y*-axis contains the range of data for a second variable. In general, the axis choice for a given variable is arbitrary. Data are plotted in a similar fashion to plotting data on a coordinate plane during an introductory geometry class. Figure 8–4 presents a scatterplot of height in inches and body weight in pounds. The individual circles in the scatterplot are a participant’s height in relation to their weight. Because the plot is bivariate (i.e., there are two variables), participants must have data for both variables in order to be plotted. Scatterplots are useful in visually assessing the association between two variables as well as assessing assumptions of various statistical tests such as linearity and absence of outliers.

Up to this point, data distributions have been discussed using very general terminology. There are numerous distributions available to researchers; far too many to provide a complete listing, but all that needs to be known about these available distributions is that each has different characteristics to fit the unique requirements of the data. Globally, these distributions are termed probability distributions, and they are incredibly important to every statistical analysis conducted. Thus, the choice of distribution used in a given statistical analysis is nontrivial as the use of an improper distribution can lead to incorrect statistical inference. Of the distributions available, the normal and binomial distributions are used most frequently in the biomedical literature; therefore, these are discussed in detail. To provide the reader with a broader listing of available distributions, this section also presents brief information about other common distributions used in the biomedical literature and when they are appropriately used in statistical analyses.

The normal distribution, also called Gaussian distribution, is the most commonly used distribution in statistics and one that occurs frequently in nature. It is used only for continuous variables measured on interval or ratio scales. The normal distribution has several easily identifiable properties based on the numerical measures of central tendency, variability, and shape. Specifically, this includes the following characteristics:

The primary shape is bell-shaped.

The mean, median, and mode are equal.

The distribution has one mode, is symmetric, and reflects itself perfectly when folded at the mean.

The skewness and kurtosis are zero.

The area under a normal distribution is, by definition, equal to one.

It should be noted that the five properties above are considered the gold standard. In practice, however, each of these properties will be approximated; namely, the curve will be roughly bell-shaped, the mean, median, and mode will be roughly equal, and skewness and kurtosis may be evident but not greatly exaggerated. For example, the distribution of data for Group 2 in Figure 8–1 is approximately normal.

Several additional properties of the normal distribution are important; consider Figure 8–5. First, the distribution is completely defined by the mean and standard deviation. Consequently, there are an infinite number of possible normal distributions because there are an infinite number of mean and standard deviation combinations. In the literature, this property is often stated as the mean and standard deviation being sufficient statistics for describing the normal distribution. Second, the mean can always be identified as the peak (or mode) of the distribution. Third, the standard deviation will always dictate the spread of the distribution. That is, as the standard deviation increases, the distribution becomes wider. Finally, roughly 68% of the data will occur within one standard deviation above and below the mean, roughly 95% within two standard deviations, and roughly 99% within three standard deviations.

Among the infinite number of potential normal distributions, only the standard normal distribution can be used to compare all normal distributions. Although this may seem confusing on the surface, a clearer understanding of the standard normal distribution is made possible by considering the standard deviation. Initially, when converting a normal distribution to a standard normal distribution, the data must be converted into standardized scores referred to as **z-scores**. A z-score converts the units of the original data into standard deviation units. That is, a z-score indicates how many standard deviations a data point is from the mean. When converted to z-scores, the new standard normal distribution will always have a mean of zero and a standard deviation of one. The standard normal distribution is presented in Figure 8–5.

It is a common misconception that converting data into *z*-scores creates a standard normal distribution from data that was not normally distributed. This is never true. A standardized distribution will always have the same characteristics of distribution from which it originated. That is, if the original distribution was skewed, the standardized distribution will also be skewed.

❹ *The key benefit to using the standard normal distribution is that converting the original data to z-scores allows researchers to compare different variables regardless of the original scale*. Remember, a standardized variable will always be expressed in standard deviation units with a mean of zero and standard deviation of one. Therefore, differences between variables may be more easily detected and understood. For example, it is possible to compare standardized variables across studies. It should go without stating that the only requirement is that that both variables measure the same construct. After z-score standardization, the age of two groups of participants from two different studies can be compared directly.

Many discrete variables can be dichotomized into one of two mutually exclusive groups, outcomes, or events (e.g., dead versus alive). Using the binomial distribution, a researcher can calculate the exact probability of experiencing either binary outcome. The binomial distribution can only be used when an experiment assumes the four characteristics listed below:

The trial occurs a specified number of times (analogous to sample size,

*n*).Each trial has only two mutually exclusive outcomes (success versus failure in a generic sense,

*x*). Also, be aware that a single trial with only two outcomes is known as a Bernoulli trial, a term that may be encountered in the literature.Each trial is independent, meaning that one outcome has no effect on the other.

The probability of success remains constant throughout the trial.

An example may assist with the understanding of these characteristics. The binomial distribution consists of the number of successes and failures during a given study period. When all trials have been run, the probability of achieving exactly *x* successes (or failures) in *n* trials can be calculated. Consider flipping a fair coin. The coin is flipped for a set number of trials (i.e., *n*), there are only two possible outcomes (i.e., heads or tails), each trial is not affected by the outcome of the last, and the probability of flipping heads or tails remains constant throughout the trial (i.e., 0.50).

By definition, the mean for the binomial distribution is equal to the number of successes in a given trial. That is, if a fair coin is flipped 10 times and heads turns up on six flips, the mean is 0.60 (i.e., 6/10). Further, the variance of the binomial distribution is fixed by the mean. While the equation for the variance is not presented, know that the variance is largest at a mean of 0.50, decreases as the mean diverges from 0.50, and is symmetric (e.g., the variance for a mean of 0.10 is equal to the variance for a mean of 0.90). Therefore, because variance is fixed by the mean, the mean is the sufficient statistic for the binomial distribution.

In reality, the probability of experiencing an outcome is rarely 0.50. For example, biomedical studies often use all-cause mortality as an outcome variable, and the probability of dying during the study period is generally lower than staying alive. At the end of the trial, participants can experience only one of the outcomes—dead or alive. Say a study sample consists of 1000 participants, of which 150 die. The binomial distribution allows for the calculation of the exact probability of having 150 participants die in a sample of 1000 participants.

As mentioned in the introduction to this section, there are numerous probability distributions available to researchers, with their use determined by the scale of measurement of the DV as well as the shape (e.g., skewness) of the distribution. When reading journal articles, it is important to know whether the appropriate distribution has been used for statistical analysis as inappropriate use of any distribution can lead to inaccurate statistical inference.

Table 8–1 provides a short list of the several commonly used distributions in the biomedical literature. Of note here is that alternative distributions are available when statistically analyzing non-normally distributed continuous data or data that cannot be normalized such as categorical data. The take away message here is that non-normal continuous data does not need to be forced to conform to a normal distribution.

Distribution Name | DV Scale | Distribution Characteristics | Comments |
---|---|---|---|

Normal | Continuous | ||

Gamma | Continuous | Positive skew | |

Inverse Gaussian | Continuous | Positive skew | |

Exponential | Continuous | Positive skew | Data has to be > 0 |

Log-normala | Continuous | Positive skew | Data has to be > 0 |

Weibull | Continuous | Positive or negative skew | |

Gompertz | Continuous | Positive or negative skew | |

Poisson | Count | Mean = variance | Data ≥ 0 |

Negative binomial | Count | Mean ≠ variance | Data ≥ 0 |

Bernoulli | Binary | One trial; 2 categories | |

Binomial | Binary | Repeated Bernoulli trials; 2 categories | |

Categorical | Categorical | One trial; > 2 categories | |

Multinomial | Categorical | Repeated trials; > 2 categories |

The only DV scale presented in Table 8–1 that has not been discussed thus far is count data. An example of count data would be a count of the number of hospitalizations during a 5-year study period. It is clear from the example that count data cannot take on negative values. That is, a participant cannot have a negative number of hospitalizations. Both the Poisson and negative binomial distributions are used when analyzing count data. The Poisson distribution assumes the mean and the variance of the data are identical. However, in situations where the mean and variance are not equal, the negative binomial distribution allows the variance of the distribution to increase or decrease as needed. As a point of possible confusion, the negative binomial distribution does not allow negative values. The negative in negative binomial is a result of using a negative exponent in its mathematical formula.

Transformations are usually employed by researchers to transform a non-normal distribution into a distribution that is approximately normal. Although on the surface this may appear reasonable, data transformation is an archaic technique. As such, transformation is not recommended for the three reasons provided below.

First, although parametric statistical analyses assume normality, the distribution of the actual DV data is not required to be normally distributed. As stated in the previous section, and highlighted in Table 8–1, if non-normality is observed, alternative distributions exist allowing proper statistical analysis of non-normal data without transformation. For this reason alone, transformation is rarely necessary.

Second, transforming the DV data potentially prevents effects of an IV from being observed. As an overly simplistic example, consider the bimodal distribution of body weight data presented in Figure 8–6. Say the body weight data were collected from a sample of men and women. Obviously, the data in Figure 8–6 is bimodal and not normally distributed. In this situation, a researcher may attempt to transform the data, but transformation would obscure the effects of an obvious IV—gender. That is, women tend to be lighter compared to men, and this is observed in these data, with women being grouped in the left distribution and men grouped in the right distribution. If transformation were performed successfully, inherent differences between men and women would be erased.

Third, while data transformation will not affect the rank order of the data, it will significantly cloud interpretation of statistical analyses. That is, after transformation all results must be interpreted in the transformed metric, which can become convoluted quickly making it difficult to apply results to the untransformed data used in the real world. Thus, when reading a journal article where the authors employed any transformation, be aware the results are in the transformed metric and ensure the authors’ interpretations remain consistent with this metric.

Although transformation is not recommended, it can be correct, and it will inevitably be encountered when reading the literature, especially for skewed data or data with outliers. Therefore, it is useful to become familiar with several of the techniques used to transform data. Transformations for positive skewness differ from those suggested for negative skewness. Mild positive skewness is typically remedied by a square root transformation. Here, the square root of all data is calculated and this square root data is used in analysis. When positive skewness is severe, a natural log transformation is used. When data are skewed negatively, researchers may choose to reflect their data to make it positively skewed and apply the transformations for positive skewness described above. To reflect data, a 1 is added to the absolute value of the highest data point and all data points are subtracted from this new value. For example, if the highest value in the data is 10, a 1 is added to create 11, and then all data points are subtracted from 11 (e.g., 11 − 10 = 1, 11 − 1 = 10, etc.). In this manner, the highest values become the lowest and the lowest become the highest. It should be clear that this process considerably convolutes data interpretation!

The field of epidemiology investigates how diseases are distributed in the population and the various factors (or exposures) influencing this distribution.^{6} Epidemiological statistics are not unique to the field of epidemiology, as much of the literature in the biomedical sciences incorporates some form of these statistics such as odds ratios. Thus, it is important to have at least a basic level of understanding of these statistics. In this section, the most commonly used epidemiological statistics are discussed briefly including ratios, proportions, and rates, incidence and prevalence, relative risk and odds ratios as well as sensitivity, specificity, and predictive values.

Ratios, proportions, and rates are terms used interchangeably in the medical literature without regard for the actual mathematical definitions. Further, there are a considerable number of proportions and rates available to researchers, each providing unique information. Thus, it is important to be aware of how each of these measures are defined and calculated.^{7}

A ratio expresses the relationship between two numbers. For example, consider the ratio of men to women diagnosed with multiple sclerosis (MS). If, in a sample consisting of only MS patients, 100 men and 50 women are diagnosed, the ratio of men to women is 100 to 50, or 100:50, or 2:1. Remember, that the order in which the ratio is presented is vitally important; that is, 100:50 is not the same as 50:100.

A proportion is a specific type of ratio indicating the probability or percentage of the total sample that experienced an outcome or event without respect to time. Here, the numerator of the proportion, representing patients with the disease, is included in the denominator, representing all individuals at risk. For example, say 850 non-MS patients are added to the sample of 150 MS patients from the example above to create a total sample of 1000 patients. Thus, the proportion of patients with MS is 0.15 or 15% (i.e., 150/1000).

A rate is a special form of a proportion that includes a specific study period, typically used to assess the speed at which the event or outcome is developing.^{8} A rate is equal to the number of events in a specified time period divided by the length of the time period. For example, say over a 1-year study period, 50 new cases of MS were diagnosed out of the 850 previously undiagnosed individuals. Thus, the rate of new cases of MS within this sample is 50 per year.

Incidence quantifies the occurrence of an event or outcome over a specific study period within a specific population of individuals. The incidence rate is calculated by dividing the number of new events by the population at risk, with the population at risk defined as the total number of people who have not experienced the outcome. For example, consider the 50 new cases of MS that developed from the example above from the 850 originally undiagnosed individuals. The incidence rate is approximately 0.06 (i.e., 50/850). Note the denominator did not include the 150 patients already diagnosed with MS because these individuals could longer be in the population at risk.

Prevalence quantifies the number of people who have already experienced the event or outcome at a specific time point. Prevalence is calculated by dividing the total number of people experiencing the event by the total number of individuals in the population. Note that the denominator is everyone in the population, not just individuals in the population at risk. For example, the diagnosed MS cases (i.e., 50 + 150 = 200) would be divided by the population that includes them. That is, the prevalence of MS in this sample is 0.20 (i.e., 200/1000).

Finally, it is important to consider both incidence and prevalence when describing events or outcomes. This is because prevalence varies directly with incidence and the duration of the sickness or disease. For example, consider influenza where the duration of the sickness is relatively short. Thus, while incidence of new influenza cases may be high the overall prevalence may be low because most individuals recover quickly. By contrast, consider individuals diagnosed with asthma. Because asthma is incurable, the prevalence of the disease may be high, whereas the incidence may be low depending on the total number of new cases diagnosed throughout the year.

Relative risk is defined as the ratio (or probability) of the incidence of an event occurring in individuals exposed to a stimulus compared to the incidence of the event in those not exposed to the stimulus. Relative risk can be calculated directly from the cohort study design (see Chapter 5). Briefly, this design is typically a prospective observational design comparing the incidence of experiencing an event in cohorts of exposed and unexposed individuals over time.

Relative risk is used when comparing the probability of an event occurring to all possible events considered in a study. For example, consider the risk of developing lung cancer in those who are exposed and unexposed to second-hand smoke over a 10-year study period. Upon study conclusion, the 2 × 2 contingency table, shown in Figure 8–7, is created containing frequency counts of events for two groups exposed and unexposed to the second-hand smoke stimulus. This table provides all data necessary to calculate the incidence of the event for both exposed and unexposed individuals. Relative risk is calculated by dividing the proportion of individuals who suffered the event in the exposed group (i.e., A/A + B) by the proportion of individuals who suffered the event in the unexposed group (i.e., C/C + D). Relative risk provides a single number ranging from zero to infinity, and there are three resulting interpretations provided below.^{6}

If relative risk equals 1, the risk of experiencing the event was equal within the exposed and unexposed groups. Thus, there is no association of being exposed to the stimulus.

If relative risk is greater than 1, the exposed group has a greater risk of experiencing the event compared to the unexposed group. Thus, there is a positive association or detrimental effect (risk factor) of being exposed to the stimulus.

If relative risk is less than 1, the exposed group has a lower risk of experiencing the event compared to the unexposed group. Thus, there is a negative association or protective effect of being exposed to the stimulus.

When relative risk cannot be calculated, researchers will often present an odds ratio. Odds are calculated by dividing the proportion of people experiencing an event by the proportion of people not experiencing an event. Thus, an odds ratio is a ratio of two odds; one for individuals exposed to the stimulus and the other for those not exposed to the stimulus. Odds ratios range from zero to infinity. They have three interpretations identical to those presented above for relative risk, simply substitute the odds ratio for relative risk. Odds ratios can be calculated for both cohort and case-control designs. A case-control study compares cases that have experienced the event and controls who have not, and then assesses whether each individual was exposed to a stimulus or not. Thus, a case-control study is retrospective.

Odds ratios are used when comparing events to nonevents with its calculation depending on the study design. For example, consider comparing a group of individuals who developed measles to those who did not and then determining whether they received all recommended vaccinations. In a cohort study, the odds ratio is calculated by dividing the odds of experiencing the event in the exposed group (i.e., A/B) by the odds the unexposed group who experienced the event (i.e., C/D). In a case-control study, the odds ratio is calculated by dividing the odds that cases were exposed to the risk (i.e., A/C) by the odds that the controls were exposed (i.e., B/D).

Relative risk and odds ratios are comparable in magnitude only when the outcome under study is rare (e.g., some cancers). It is important to consider that odds ratios consistently overestimate risk when the outcome is more common (e.g., hyperlipidemia). As a result, relative risk should be used if possible and caution should be exhibited when interpreting odds ratios.

Sensitivity, specificity, and positive and negative predictive values indicate the ability of a test to identify correctly those experiencing the event and those who did not. For example, consider the ability of a blood glucose screening test to correctly identify individuals with diabetes. Four outcomes result from this test, which are required for the calculation of sensitivity, specificity, and the predictive values:

True positives (TP) have the disease and have a positive test result.

False positives (FP) do not have the disease, but have a positive test result.

True negatives (TN) do not have the disease and have a negative test result.

False negatives (FN) have the disease, but have a negative test result.

**Sensitivity** is the probability a diseased individual will have a positive test result and is the true positive rate of the test. It is calculated by dividing true positives by all individuals who actually have the disease (i.e., TP/TP + FN). **Specificity** is the probability a disease-free individual will have a negative test result and is the true negative rate of the screening test. It is calculated by dividing true negatives by all disease-free individuals (i.e., TN/TN + FP).

Positive and negative predictive values are calculated to measure the accuracy of the screening test. Both predictive values are directly related to disease prevalence; that is, the higher the prevalence, the higher the predictive value.^{6} Positive predictive value provides the proportion of individuals who test positive for the disease that actually have the disease. It is calculated by dividing true positives by all individuals with a positive test result (i.e., TP/TP + FP). Negative predictive value provides the proportion of individuals who test negative who are actually disease-free. It is calculated by dividing true negatives by all individuals with a negative test result (i.e., TN/TN + FN).

It is important to identify the implications all four values have to new and existing research. When designing a study involving a screening test, researchers must indicate a standard cutoff score for their screening. That is, qualify who is to be considered diseased and who will be considered disease-free. This decision clearly reflects the repercussions of classifying individuals as false negatives or false positives. For example, consider a screening tool for early stage breast cancer, where there are considerable consequences for both false positives and false negatives. On the one hand, a patient with a false positive may be referred for unnecessary testing that is painful and expensive, not to mention emotionally taxing. On the other hand, a false negative has more serious implications, since the patient may not receive any treatment until the disease has progressed.

The distinction between experimental, quasi-experimental, and nonexperimental research is important, both from a study design perspective and when evaluating literature. Although experimental designs are considered the gold standard by many, do not discount research conducted using quasi-experimental and nonexperimental designs, as long as the limitations are considered. Each of these types of designs will be discussed below.

In the biomedical sciences, experimental designs are typically referred to as a randomized controlled trial (RCT). A full treatment of experimental design is presented in Chapters 4 and 5, so the discussion provided here will only touch the tip of the iceberg on experimental designs. Interested readers are encouraged to consult the suggested readings provided at the end of this chapter for more information.

First, experimental designs always allow the researcher to manipulate levels of an IV. For example, consider a drug trial assessing the effectiveness of a new cancer medication. For this trial, four groups of participants are randomly assigned to a different dose of a medication. Thus, there are four levels of the IV. The researcher, within ethical and theoretical constraints, can manipulate the size of the dose, if the participants will be measured multiple times, and the length of the study period.

Second, in experimental design participants are randomly assigned to levels of the IV; thus, any participant has a chance of being placed in any single group. There are many different methods and theories of randomization and the chance of being in one group versus another does not necessarily have to be equal. For example, consider a study that has two treatment groups and one placebo group where the researcher is interested in the difference between the treatment groups. Because the difference in the DV between the placebo and either treatment group will usually be larger than the difference in the DV between the treatment groups, the researcher may randomize more participants to the treatment groups to increase statistical power to detect the difference between treatments. Note that the concept of statistical power is discussed in the Statistical Inference section below as well as in Chapter 4. Often a 2:2:1 or 3:3:1 randomization schedule will be used with two or three times as many participants, respectively, being randomized to treatment as opposed to placebo.

Third, causality can be determined with proper experimental control of error. For example, the effectiveness of a cancer drug can be better explained by reducing sources of error due to the participant (e.g., age, health status), setting (e.g., prescriber’s office), or diagnostic tests (e.g., measurement accuracy).

Finally, it should be noted RCTs have limitations. The primary limitation is cost, as RCTs are extremely costly requiring many considerations including space, personnel, and participants. RCTs also may have limited external validity and generalizability due to extreme control over experimental conditions (e.g., efficacy study) that do not necessarily translate to real world situations (e.g., effectiveness study). Third, it is difficult, if not impossible, to study rare events with an RCT due to ethical concerns and the considerable sample size required.

Quasi-experimental designs are used more often in the social sciences, but they can be observed in the biomedical literature. On the surface, these types of designs appear to be experimental; however, they lack one key aspect, random assignment.

For example, consider examining the effectiveness of a new dialysis treatment. Most dialysis patients are already in the care of a nephrologist at a specific clinic. Because nephrologists typically see numerous patients, randomizing patients to specific levels of treatment (i.e., the IV) within a group that is under the care of the same nephrologist may be unfeasible logistically or may lead to medication errors. Thus, the entire clinic must be randomized. That is, all patients within a specific clinic will receive one treatment.

Advantages of quasi-experimental designs include reduced cost and a quicker timeline compared to RCTs with the addition of possible increases in external validity due to conditions being more consistent with the real world. The disadvantages, however, are considerable. This is most notable with the lack of random assignment. Nonrandom assignment may create dissimilar groups based on any number of characteristics that are potentially related to the success of the treatment. For example, consistent differences in patient demographics (e.g., sickness) may result from a dialysis clinic in the suburbs being compared to a dialysis clinic in a more urban setting. Further, causation cannot be implied and statistical analysis can potentially be rendered uninformative.^{9}

Nonexperimental designs have several advantages over RCTs, primarily lower cost, a quicker timeline to publication, and a broader range of participants.^{10} The advantages of nonexperimental studies over RCTs have prompted their widespread use in the biomedical sciences. Overall, these studies tend to be nonrandomized, retrospective, and correlational in nature and are distinct because the researcher cannot manipulate the IV(s).

For example, consider a 5-year retrospective study assessing the effectiveness of statin therapy on preventing cardiac events. The researcher has knowledge of which patients initiated statin therapy, but has no control over the drug, dose, adherence, etc. Although the researcher may assign patients to groups based on dose size, the researcher cannot randomly assign patients to a specific drug nor can they manipulate the dose. In addition, nonexperimental research often fails to indicate causality, which is due to lack of experimental control and randomization as well as inability to identify all confounding variables. Finally, it is important to note that while nonexperimental designs are ubiquitous in the biomedical sciences, treatment effects may be different when compared to RCTs.^{11}

The human body needs vitamin D to absorb calcium. Without sufficient calcium absorption the body will extract calcium from its bone stores thereby weakening bone and increasing the probability of bone fractures. An endocrinologist interested in bone metabolism is considering a 6-month prospective study to evaluate whether differing doses of vitamin D will affect rates of calcium absorption in postmenopausal women. The researcher plans to enroll her own patients from those seen at her clinic. Calcium absorption will be measured by the dual isotope tracer method and vitamin D will be measured as serum 25-hydroxyvitamin D (25OHD). Both will be treated as continuous variables in statistical analysis. After baseline calcium absorption and 25OHD measurements are collected, participants will be randomized into four groups—one placebo group and three groups ingesting a different orally administered vitamin D supplement daily for 6 months (i.e., 500 international units, 2500 international units, and 5000 international units). At the end of the 6-month study period, calcium absorption will be measured again.

1.Describe the population of interest.

2.What sampling strategy or strategies were used?

3.Would this study be considered a randomized controlled trial? Why or why not?

4.What is the DV for this study? What is the scale of measurement for the DV?

5.What is the IV for this study? What is the scale of measurement for the IV? How many levels does the IV have? What about the IV is manipulated by the researcher?

6.Based on the study description, were any confounding variables or covariates considered for this study?

7.The researcher is planning on using the binomial distribution to evaluate the probability the participant is calcium deficient. Is this correct given the scale of the DV? If not, what distribution would be a better option to consider?

8.The researcher is planning on presenting calcium absorption and 25OHD as median and IQR. Is this the most appropriate measure of central tendency for these variables?

9.A histogram of the calcium absorption variable indicated severe positive skewness, but no outliers. The researcher is considering using a square root transformation of the DV prior to analysis. Is a data transformation appropriate? Why or why not?

The U.S. National Institutes of Health (NIH) defines five different types of clinical trials—treatment, prevention, diagnostic, screening, and quality of life.^{12} In this chapter, two specific types of treatment clinical trials are discussed—the randomized controlled trial (RCT) and adaptive clinical trial (ACT). The experimental design of RCTs and ACTs are discussed at length in Chapters 4 and 5, so the descriptions provided in this chapter are minimal. Briefly, RCTs and ACTs are both protocol based, meaning that every step of the study from design to analysis is identified ** a priori**. They are prospective studies where participants are followed over time using strict experimental control to indicate reliably the causality between the DV and manipulated IV.

The most common RCT is a parallel-groups design, where the IV typically involves participants randomized into fixed levels of treatment, also known as treatment arms, with each arm indicating a different treatment or comparison.^{13–16} That is, participants are randomly assigned to one, and only one, treatment arm. As stated in the previous section, the sample size within each group does not have to be equal and can vary depending on estimated statistical power to detect treatment effects.

There are two parallel-groups designs frequently used in the biomedical sciences—group comparison and matched pairs.^{15} Briefly, a group comparison design simultaneously compares at least two groups of participants after each group is randomized to a different level of the IV. In a matched pairs design, participants are matched based on one or more characteristics (e.g., age, race) and then randomized to levels of the IV.

The second most common RCT is a crossover design. A crossover design has the primary advantage of each participant serving as his or her own control.^{13,15} The primary advantage of a crossover design is that it requires fewer participants in comparison to a parallel-groups design due to the fact that at the end of the study all participants will have received all treatment arms. A disadvantage is that a crossover design cannot be used in a study where the first drug may cure the participant (e.g., an antibiotic given for an infection), since there would be no reason for the participant to crossover to the other agent.

For example, consider a 1-month study that includes two treatment arms. For a parallel-groups design, say 20 participants are required; that is, 10 participants are randomized to each treatment arm. By contrast, in a crossover design, only 10 subjects are required because each participant receives both treatments—10 participants receive the first treatment and the same 10 participants receive the second treatment. While both designs have two total measurements, in the parallel-groups design, two individual groups of participants provide one measurement each, whereas in the crossover design the same group of participants provides both measurements.

At this point, a common question is whether the effect of the first treatment carries over to influence the effect of the second treatment. This is a considerable concern in crossover designs and is dealt with by including a washout period between treatments. While the maximum duration of the washout period is arbitrary so long as the effect of the first treatment to be absent prior to beginning the second treatment, but not be so long that participant attrition becomes a concern.

A more recent advancement to the RCT is the adaptive design or ACT. Although an ACT is possibly cheaper and more ethical than an RCT, this design is much more complex to both implement and analyze. Briefly, ACTs implement changes or adaptations in the design of the study based on the results of a predetermined set of interim statistical analyses. Interim analyses can be based on blinded or unblinded data, with the resulting adaptations aimed at establishing a more efficient, safer, and informative trial that is more likely to demonstrate treatment effects.^{17}

For example, consider a 3-year study examining the effect of three different large doses of vitamin D on parathyroid hormone. Because implementing large doses of vitamin D is controversial, ethical considerations require this study to be adaptive. Here, the interim analyses would provide important information regarding the effectiveness and safety of the doses. Say, for example, that the ACT has interim analyses scheduled quarterly. Further, say that at the end of the second quarter of the first year the interim analyses indicated that the group receiving the highest dose of vitamin D had twice the risk of developing kidney stones compared to the other two groups. Thus, the group receiving the highest dose of vitamin D could have their dose reduced or the group could be dropped from the study completely. After these adaptations are implemented, the study continues as designed.

The analysis of clinical trials typically involves evaluating repeated measures data where participants are followed prospectively over time or conducting an endpoint analysis using only the last observation or measurement. Regardless of the study design, the choice of the appropriate statistical test to analyze these data is based primarily on the scale of measurement of the DV, but other factors should also be considered (see the Selecting the Appropriate Statistical Test section later in the chapter).

Because an endpoint analysis is straightforward conceptually, a brief discussion of repeated measures design and analysis considerations is presented. Using a repeated measures design allows researchers to study the treatment effects over time in a smaller sample of participants due to the increased statistical power. Briefly, a repeated measures design increases statistical power by removing the error variance due to the participant. That is, each participant serves as their own control. Removing error variance is important because it results in larger test statistics and an increased probability of achieving statistical significance. However, a repeated measures design has limitations relevant to clinical trials, primarily participant attrition and nonadherence.

Attrition and nonadherence can assume many forms in a clinical trial. For example, participants may drop out of the study, fail to complete all required measurements, receive incorrect treatment or doses, or a myriad of other possible protocol violations. Thus, the first part of this section will discuss how the analysis of clinical trials typically handles missing data due to attrition and nonadherence. The final sections discuss the analysis of parallel-group, crossover, and adaptive designs.

Two analytical approaches exist for clinical trials—intent-to-treat (ITT) and per-protocol (PP). The ITT approach is often employed in the presence of violations to protocol or patients being lost to follow-up. ITT is the approach most often used in the biomedical literature. ITT requires the analysis to include all participants in the arm to which they were randomized originally. That is, treatment effects are best evaluated by the planned treatment protocol rather than the actual treatment given.^{18}

For example, consider a participant randomized to receive treatment A but instead receives treatment B. For analysis, this participant would be considered as receiving treatment A. It should be noted that if a large number of protocol violations of this nature occur, the study would be discontinued; thus, these occurrences are relatively rare. It is important to note that for ITT has considerable weaknesses. Most notably, for ITT to be unbiased, attrition and nonadherence are considered to occur completely at random.^{14} However, this is an untestable hypothesis. Further, the ITT approach can dilute treatment effects simply by including nonadherent participants by employing the last observation carried forward (LOCF) technique discussed later in this section.

By comparison, the PP approach evaluates only compliant participants with complete data. Although this analysis is straightforward analytically and allows researchers to evaluate a more accurate treatment effect, it has substantial limitations, primarily, reduced statistical power compared to an ITT approach, because participants with incomplete data are not considered in analysis.^{18}

Because the ITT approach considers all participants with at least one measurement, an imputation or replacement method is often employed for missing measurements. The LOCF technique is one of the most commonly used imputation methods. This method uses the last recorded measurement for a participant for every missing measurement. For example, consider a study measuring HbA1c measured on six occasions over a 1-year study period. If a participant had only the first three measurements, the third (i.e., last) measurement would be imputed for measurements four through six. ❺ *From this example, it is clear the LOCF technique may not only dilute treatment effects, but also introduce significant bias into the results of statistical tests.* Therefore, be cautious when evaluating a study that has employed the LOCF technique. As a result of these limitations, better imputation approaches have been suggested including multiple imputation and the use of maximum likelihood estimation. These techniques are beyond the scope of this chapter, but are valid and produce unbiased estimates if data are considered missing at random.^{19,20}

When analyzing a parallel-groups design, the traditional approach is to conduct an endpoint analysis using only the final measurement. If the DV is continuous, analysis will typically require an independent samples t-test for two groups, one-way analysis of variance (ANOVA) for more than two groups, or analysis of covariance (ANCOVA) for two or more groups, statistically controlling for a baseline DV measurement. If the DV is categorical, an endpoint analysis will typically require a chi-square test or logistic regression analysis. Note that these analyses are discussed in detail in the Statistical Tests section later in the chapter.

For example, consider a study designed to assess the effect of lubiprostone compared to placebo in treating chronic constipation associated with Parkinson disease. Following randomization, this 1-month study will assess constipation symptoms twice—at the end of the second and fourth week. An endpoint analysis would only consider the treatment effect at the end of week 4, ignoring the measurement at the end of week 2. Thus, it is clear that this type of analysis does not consider the repeated measures, and, as a result, does not consider the changes occurring over time. Further, an endpoint analysis does not take full advantage of statistical power increases from a longitudinal design as discussed in the Design and Analysis of Clinical Trials section.

By contrast, to assess for change over time using repeated measures for a continuous DV, researchers often employ a mixed between-within ANOVA or mixed-effects linear regression. Briefly, these analyses assess differences between treatment groups as well as changes over time within treatment groups. The major benefit of these analyses is provided by the interaction effect, which evaluates whether the change over time was different between the two treatment arms (see Figure 8–8). To evaluate group differences in change over time for a categorical DV, researchers are required to employ a mixed-effects logistic regression. This analysis allows the researcher to evaluate group differences and interaction effects. In general, mixed-effects analyses are extremely complex and even a brief overview of this analysis is well beyond the scope of this chapter. With that said, when these analyses are most appropriate is provided in the Selecting the Appropriate Statistical Tests section. Interested readers are encouraged to consult the suggested readings at the end of the chapter for a treatment of this analysis.

As an example of analyzing a parallel-groups design, reconsider the lubiprostone example. Say the results are presented graphically in Figure 8–8. Notice that two groups experience drastically different change in constipation symptoms between the end of week 2 and the end of week 4. The between-group difference in constipation symptoms over time is the interaction effect. Because a lower number of symptoms are indicative of treatment success, Figure 8–8 shows that the effect of lubiprostone is more effective compared to placebo over the study period.

Following a statistically significant interaction effect, researchers can conduct follow-up or *post hoc* tests to determine where the significant difference occurred. In Figure 8–8, there is likely no statistically significant difference in constipation symptoms between groups at the end of week 2, but there is a likely statistically significant difference at the end of week 4. Therefore, *post hoc* tests can assist researchers in identifying the shortest treatment time or minimum effective dose by indicating where treatment effects diminish (e.g., the slope plateaus) or indicating when differences between doses converge and are no longer statistically significant.

The purpose of the crossover design is to study treatment effects using the participant as his or her own control. Remember, in a crossover design each participant receives all treatment arms, with an adequate washout period occurring between arms to prevent the carryover of treatment effects. For example, consider the lubiprostone example described in the Analyzing Parallel-Groups Designs section. Instead of having two treatment arms as in a parallel-groups design, the crossover design would randomize participants into a different treatment order. That is, Group A would receive lubiprostone and Group B would receive placebo for the first 2 weeks. At the end of the 2-week study period, constipation symptoms are assessed. Next, all participants are required to have a 3-week washout period purported to effectively eliminate any carryover effects of the lubiprostone. Note that during the washout period the placebo is not given either. After the washout period, Group A would receive placebo and Group B would receive lubiprostone for 2 weeks. At the end of this second 2-week study period, constipation symptoms are assessed again.

Statistically, a crossover design requires an initial test for order effects. If the DV is continuous, order effects are assessed by evaluating the interaction effect between order of treatment and the IV using a mixed between-within ANOVA or a mixed-effects linear regression. These analyses are described in detail in the Selecting the Appropriate Statistical Test section later in the chapter, but for now consider both analyses useful when evaluating interaction effects. When testing for order effects, a statistically significant interaction indicates that the order in which the treatments were received influenced the treatment effect. For example, say that receiving lubiprostone prior to placebo had a different treatment effect than receiving placebo prior to lubiprostone. A clear order effect is presented in Figure 8–9. Notice the effect of placebo differs depending on the order in which it was received. The presence of a statistically significant order effect can have multiple explanations. Considering the example and Figure 8–9, it is clear the washout period may not have been long enough, as the effectiveness of lubiprostone carried over to measurement of the placebo. In addition, the groups may have been initially different following randomization. Whenever a statistically significant order effect is identified, no further analysis is conducted as any subsequent analyses are biased by this order effect. However, if the interaction is nonsignificant and the DV is continuous, an endpoint analysis is typically evaluated via paired-samples t-test (also explained in detail in the Selecting the Appropriate Statistical Test section later in the chapter). That is, treatment differences between lubiprostone and placebo are assessed without respect to the order in which the treatments were received.

The statistical analyses and considerations used when analyzing adaptive designs are similar to parallel-groups and crossover designs. If the DV is continuous, statistical tests used during the interim analyses or on the study endpoint typically include an independent-samples t-test for comparing two groups, one-way ANOVA for comparing more than two groups, or ANCOVA for statistically controlling baseline DV measurement. Note that the DV for the interim analyses is the most recent measurement. If the DV is categorical, an interim analysis will typically require a chi-square test or logistic regression analysis. All of the analyses mentioned are discussed in detail in the Statistical Tests section later in chapter.

Several important considerations are required when analyzing and interpreting the results from adaptive designs.^{17} First, all interim and endpoint analyses suffer the potential risk of inflated Type I errors. Briefly, a Type I error can be thought of as a false positive. That is, the statistical test could indicate a statistically significant result when the result is actually not significant. Type I error has been discussed in Chapter 4 and in the Statistical Inference section later in this chapter. With this definition in mind, as the number of interim analyses increase the probability of finding a false positive might increase as well. To adjust for this possibility, researchers will often make the criteria for achieving statistical significance more conservative by adjusting alpha (see the Statistical Inference section later in chapter for a full description of alpha). Note that adjusting alpha is not a ubiquitous practice and there is no universal recommendation for doing so. Just be aware that inflated Type I errors may be an issue in an ACT.

Second, estimates of population parameters may be biased. That is, any adaptation can reduce the generalizability to the original population sampled, produce underestimated or overestimated parameter estimates, and produce misleading confidence intervals. Researchers must carefully document and provide rationale for adaptations resulting from interim analysis. Failure to do so will indicate results should be viewed with extreme caution.

Finally, when all adaptations are considered, the overall results of the endpoint analyses may actually be invalid, providing inaccurate support for treatment effects. Consumers of research are urged strongly to consider these factors when interpreting and evaluating research using adaptive designs.

Inferential statistics provide the probability a difference or association is actually observed in the population based on the analysis of sample data. Inferential statistics allow researchers to make rational decisions in the presence of random processes and variation. This section presents several requirements that need to be considered prior to conducting and evaluating the result of a statistical test. First, the sampling distribution and application of the **central limit theorem** are discussed, followed by hypothesis testing, as well as Type I and Type II errors and statistical power. Then, the difference between statistical and clinical significance is presented. Finally, the appropriate uses of parametric and nonparametric statistical tests are provided as well as a brief description of degrees of freedom.

Statistical inference uses sampled data to make conclusions about a specific population. Because quality samples are chosen randomly, the means produced from these samples are also random.^{21} Given this information, it is important to remember that the mean may not be exactly representative of the population and will vary from sample to sample. However, the law of large numbers states as the size of the sample increases, the sample mean will move closer to the population mean. Further, as the number of samples increases, the mean of the sample means will begin to approximate the population mean. ❻ *The central limit theorem states when equally sized samples are drawn from a non-normal distribution, the plotted mean values from each sample will approximate a normal distribution as long as the non-normality was not due to outliers*.

A distribution of the sampled means calculated from repeated samples is termed the distribution of sampling means. For example, consider a study to analyze the mean value of blood urea nitrogen (BUN) in the general, healthy population, where the researcher selects 100 random samples of 10 healthy participants. Each sample of 10 will provide a mean BUN value. Although mean BUN will vary from sample to sample, when the 100 sample means are plotted in a histogram, this distribution of sampling means will begin to approximate the actual population distribution.

The central limit theorem states sufficiently large samples should approximate a normal distribution of sampling means as long as the data do not contain outliers. A sufficiently large sample is generally considered to consist of 30 or more participants or a situation where the degrees of freedom for the statistical test are greater than 20.^{22} Note that degrees of freedom are discussed later in this section. In addition, researchers must be careful not to confuse the issue of having a large enough sample to achieve statistical significance and a large enough sample to be representative of the population.

As with any normal distribution, the standard deviation of the distribution of sampling means can be calculated. This is termed the standard error of the mean (SEM). The SEM is equal to the standard deviation divided by the square root of the sample size, and reflects variability within the sample means. Further, standard error is used in the majority of statistical tests. It is important when evaluating the literature to understand the relationship between the standard deviation and the SEM. Researchers often present the SEM to show variability or noise in their data. Note that the SEM will always be smaller than the standard deviation. Thus, the use of SEM will suggest the data is less variable and often more appealing.

A hypothesis indicates a theory about the population regarding an outcome the researcher is interested in studying. The statistical analyses discussed in this chapter evaluate two types of hypotheses, the **null hypothesis** and the alternative or research hypothesis. That is, the analyses discussed employ procedures generally known as null hypothesis significance testing (NHST). The null hypothesis (H_{0}) assumes no difference or association between the different study groups or variables, whereas the **alternative hypothesis** (H_{A} or H_{1}) states there is a difference or association between the different study groups or variables. A representative, ideally random, sample is then drawn from the population of interest to estimate the difference or relationship and test whether this difference or relationship rejects or fails to reject the null hypothesis. It is important to note that failing to reject the null hypothesis does not indicate the null hypothesis is true. This is a common misconception observed frequently in the literature. There are often many other reasons for failing to reject the null hypothesis including inadequate experimental design, inadequate control over extraneous variables, and inadequate sample size to detect the effect of interest, among others.

When testing a specific hypothesis, a researcher is required to determine whether their hypothesis is directional or not. A directional hypothesis requires a **one-tailed** hypothesis test, whereas a nondirectional hypothesis requires a **two-tailed** hypothesis test. For example, consider a hypothesis which states that initiating statin therapy will lower low-density lipoproteins (LDL). Note that use of the term lower implies directionality and requires a one-tailed test. If, however, the researchers were looking for any effect of statin therapy, whether lowering or raising LDL, the hypothesis is nondirectional and would require a two-tailed test. In the literature, it is generally more acceptable to use a two-tailed test, even if the hypothesis is directional because a two-tailed test is more conservative statistically, thereby reducing the probability of a spurious statistical significance.

It is essential that researchers establish how much error they are willing to accept before initiating a study. NHST can only result in four possible outcomes, which can be observed in Table 8–2. Type I and Type II have been discussed at length in Chapter 4, but briefly, a **Type I error** occurs when a statistical test rejects the null hypothesis by indicating a statistically significant difference when, in fact, the null hypothesis is true (i.e., false positive). A **Type II error** occurs when the researcher fails to reject the null hypothesis by not indicating a statistically significant difference when, in fact, the null hypothesis is false (i.e., false negative). Type I and Type II errors are interconnected; that is, as the probability of one error increases the other decreases. Researchers must consider these two errors carefully when designing studies, weighing whether a false positive is more or less concerning than a false negative.

Truth | ||
---|---|---|

Decision | False H_{0} | True H_{0} |

Reject H_{0} | Correctly reject H_{0} | Type I error |

Fail to reject H_{0} | Type II error | Correctly fail to reject H_{0} |

Statistical power was developed as a method allowing researchers to calculate the probability of finding a statistically significant result, when, in fact, one actually exists. This topic has been discussed in Chapter 4. Essentially, increasing statistical power reduces the probability of committing a Type II error; however, it can also increase the probability of committing a Type I error as described in the previous paragraph. Statistical power is influenced by four factors: alpha defined as the probability value at which the null hypothesis is rejected, effect size defined as the size of the treatment effect (discussed later), error variance defined as the precision of the measurement instrument, and the sample size. Statistical power can be increased by increasing alpha, effect size, or sample size as well as by decreasing error variance. Note that statistical power of 0.80 has been defined as adequate.^{23} However, some researchers use 0.90 or higher in the biomedical sciences, indicating that a false negative is more detrimental than a false positive, such as when evaluating the effectiveness of a novel breast cancer treatment.

The next step in the research process is to employ a statistical test to assess whether a difference or relationship is due to random variation. The researcher is interested in determining whether the observed difference or relationship rejects or fails to reject the null hypothesis. **Alpha** (*α*) is the conventionally designated decision criterion for rejecting the null hypothesis and ranges from 0 to 1. Alpha is defined as the theoretical probability of rejecting the null hypothesis conditional on the null hypothesis being true. Alpha does not represent the exact Type I error rate; instead, alpha is the upper bound of the Type I error rate. Although most studies typically set alpha at 0.05, this value is arbitrary. A more conservative (i.e., *α* = 0.01) or liberal (*α* = 0.10) alpha may be used in an attempt to show greater support for rejecting or retaining the null hypothesis, respectively. As an example, conservative alpha levels are often used to protect against Type I errors, whereas liberal alpha values are often used in drug equivalency trials where researchers are using data to show nonsignificant differences between the drugs.

Conceptually related to alpha is the probability value (i.e., ** p value**). Statistical tests produce

*p*values that range from 0 to 1. The formal definition of a

*p*value is the probability of obtaining a test statistic as large as or larger than the one obtained, conditional on the null hypothesis being true. Graphically, a

*p*value is directly indicative of the area under the probability distribution used by the statistical test. Note that the area under any proper probability distribution is 1. A quick glance at the appendices of any introductory statistics textbook will provide area under the curve values for various distributions. In more general terms,

*p*= 0.05 indicates 5% of the distribution’s area is to the left or right of the associated test statistic depending on whether it is positive or negative. For example, consider a one-tailed statistical test with a positive test statistic and reconsider Figure 8–5. A

*z*-score of 1.645 leaves approximately 5% of the distribution to the right of this value. This example highlights how a

*p*value less than 0.05 indicates that less than 5% of the values (or area) lies beyond a specific test statistic value. As an alternative example, consider a two-tailed (i.e., nondirectional) statistical test. A

*z*-score of 1.96 leaves approximately 2.5% of the distribution to the right this value, whereas a

*z*-score of −1.96 leaves approximately 2.5% of the distribution to the left of this value. Thus, a two-tailed test also leaves 5% of the area under the distribution when the two parts are aggregated. Regardless of whether a one- or two-tailed test was used, in general, if a

*p*value is less than the specified alpha, the researcher rejects the null hypothesis and the difference or relationship is considered statistically significant. Alternatively, if the

*p*value is equal to or greater than alpha, the researcher has failed to reject the null hypothesis and the difference or relationship is not considered statistically significant.

❼ *Be aware that there is continuing difficulty when interpreting* p *values, even among statisticians*.^{24}*Therefore, it is important to be cognizant of several misconceptions about p values that are stated commonly in the literature.* First, a *p* value is not the exact probability of committing a Type I error. Second, the *p* value is not the probability that the null hypothesis is true, nor is 1 − *p* the probability that the alternative hypothesis is true. Third, a small *p* value is not evidence that the results will replicate nor can *p* values be compared directly across studies. Fourth, a *p* value indicates nothing about the magnitude of a difference or relationship. For example, a small *p* value (e.g., *p* = 0.00001) does not indicate a larger treatment effect than a larger *p* value (e.g., *p* = 0.049). Finally, for better or worse, alpha in NHST is treated like a cliff. For example, if alpha is set at 0.05 and a *p* value of 0.051 is obtained, a researcher will often state that a *p* value of 0.051 was trending toward significance or that the *p* value indicated marginal or moderate significance. These statements are often wholly incorrect! Thinking back to the discussion of one-tailed versus two-tailed significance tests provided in the Hypothesis Testing section, trending can only occur if the hypothesis is directional using a one-tailed test and the result obtained was in the hypothesized direction. Results can never be trending toward significance if the hypothesis was nondirectional using a two-tailed test because the alternative claim that the result was trending away from significance cannot be challenged.

Statistical significance can also be established by calculating a confidence interval around the estimated population parameters (e.g., sample means, slopes) or test statistics (e.g., *t*). A confidence interval provides a range of scores likely to contain the unknown population parameter, and generally, a confidence interval is reported using a 95% confidence level. The calculation of confidence intervals includes both sample size and variability where smaller confidence intervals indicate less variability in the data. A 95% confidence interval is calculated by multiplying 1.96 by the SEM and adding or subtracting this value from the estimated parameter to find the upper and lower confidence limits, respectively. A 95% confidence interval indicates that if repeated random sampling occurs within the population of interest under consistent conditions (e.g., sample size), the true population parameter would be included in the interval 95% of the time. Thus, every value within the interval is considered a possible value of the population parameter.

Confidence intervals can be used it indicate statistical significance without the use of statistical tests. This process varies according to whether the researcher is examining population parameters or test statistics. For example, consider Figures 8–10, 8–11, and 8–12. In each figure, HbA1c values are being compared for a treatment and placebo group. Further, mean HbA1c for each group is presented as the circle, while the 95% confidence intervals are the whiskers extending above and below the means. The overlap of the confidence intervals between groups is directly related to *p* values; that is, less overlap is indicative of larger differences resulting in smaller *p* values. In Figure 8–10, notice that the confidence intervals do not overlap; thus, this difference can be assumed statistically significant at least at *p* < 0.05. Statistical significance can also be indicated when the confidence intervals overlap as long as the overlap is less than approximately 50% of a whisker, as in Figure 8–11.^{25} Finally, as shown in Figure 8–12, substantial overlap in confidence intervals indicates a nonstatistically significant difference (i.e., *p* > 0.05).

Determining statistical significance using confidence intervals around test statistics (e.g., *t*) uses procedures that vary based on the statistical test employed. For most parametric tests of group differences and correlation, a 95% confidence interval around the test statistic containing zero is not considered statistically significant at an alpha of 0.05. That is, statistical significance for these types of analyses are essentially testing that the differences or relationship are different from zero (i.e., H_{0} = 0). Thus, a 95% confidence interval containing zero essentially indicates that it is plausible the true population difference or relationship could be zero.^{26} For example, consider the commonly used independent-samples t-test to evaluate for a difference between two group means (this statistical test is discussed in the Statistical Tests section below). Say that the test statistic produced was 2.0, but the confidence interval ranged from −0.50 to 4.50. Based on this sample, the difference would not be considered statistically significant because it is plausible that in 95% of samples of the same size the true population parameter could in fact be zero.

Alternatively, the 95% confidence interval for test statistics based on ratios (e.g., odds ratios) that contain 1 are not considered statistically significant at an alpha of 0.05. Remember from the Epidemiological Statistics section, that an odds ratio or relative risk of 1 indicates no difference. Thus, a 95% confidence interval for a relative risk or an odds ratio that contains 1 indicates that it is plausible the true population parameter could in fact be an odds ratio or relative risk of 1. For example, say the result of a logistic regression analysis produced an odds ratio of 2.5 (this statistical test is discussed in the Statistical Tests section below). This value is greater than 1 indicating an increase in odds of experiencing the event. However, say the 95% confidence interval ranged from 0.50 to 15.0. Because the confidence interval contains 1, the odds ratio of 2.5 is not statistically significant using an alpha of 0.05.

❽ *When evaluating the significance of the finding, keep in mind that statistical significance (e.g., p < 0.05) does not indicate clinical significance. Statistical significance can be manipulated in several ways, most easily by increasing sample size drastically*. A sample size increase may artificially reduce error variance, which in turn reduces the standard error on which the test statistic is based. Reducing the standard error increases the value of the test statistic necessarily resulting in a smaller *p* value.

As an example, using a sample of 10,000 patients, researchers may find a CCB reduced blood glucose significantly in Type 1 DM patients. However, on examining the estimated parameters, the statistically significant difference in blood glucose was only 2 mg/dL, a decrease considered clinically insignificant.

This example highlights the importance of identifying and interpreting the clinical significance or effect size of all studies. While a complete discussion of effect size is beyond the scope of this chapter, larger effect size values are always preferred. Briefly, two different types of effect sizes exist. First, standardized difference effect sizes indicate a standardized difference between groups in standard deviation units. Examples commonly seen in the literature include Cohen’s **d**, Glass’ **Δ**, and Hedges’ **g**. As indicated by their name, standardized difference effect sizes are used when evaluating mean differences between groups. Further, because the effect sizes are standardized it may be easier to think of them as z-scores. Standardized difference effect sizes range from – ∞ to + ∞, with larger absolute values indicating larger effects. Second, effect size may also be reported as the proportion of variance explained. That is, how much of the reason a participant had a specific value of the DV is due to the IV. Examples commonly observed in the literature include R^{2}, ω^{2}, and η^{2}. Proportion of variance effect sizes is specific to tests of relationships or association as described in the Statistical Tests section below. Their values range from 0 to 1, with higher values indicating larger effects.

Given this information, it is important to remember that the definition of clinical significance varies by substantive area; thus, the definition of clinically significant to a bench researcher may be qualitatively different from an evidence-based practitioner. Finally, not all studies will provide an effect size estimate, especially in the biomedical sciences. Thus, research must be viewed with warranted skepticism until it can be determined whether the statistically significant difference or relationship is clinically meaningful.

The primary difference between parametric and nonparametric statistical tests is that parametric tests make assumptions regarding the descriptive characteristics of the normal distribution (i.e., mean, variance, skewness, and kurtosis). Nonparametric tests make a few or no distributional assumptions. In general, **parametric tests** are used only for interval and ratio scales, whereas **nonparametric tests** can be employed for any scale of measurement. Regardless, if the DV is measured on a continuous scale, the decision of which statistical test to employ typically begins with parametric tests. Because they assume a normal distribution, all parametric tests have several conservative and easily violated assumptions. These assumptions are testable, but vary depending on the statistical test employed. Therefore, in the Statistical Tests section below, all assumptions are described for each statistical test. It is important to know that employing a parametric test in the presence of a nondefined distribution will lead to inaccurate, biased, and unreliable parameter estimates. Therefore, when reading a journal article, if the authors neglect to provide information regarding assumption testing, caution should be exercised as it is unknown how much bias exists in their results.

Some assumption violations challenge the robustness of parametric tests greater than others. **Robustness** is defined as the ability of the statistical test to produce correct inferences in the presence of assumption violations. In most situations, any assumption violation requires the researcher to employ a nonparametric test, and most parametric tests have a widely used nonparametric alternative. Most, but not all, nonparametric statistical tests are distribution free, meaning they do not make inferences based on a defined probability distribution. Further, additional strengths of nonparametric tests include the ability to assess small sample sizes and the ability to assess data from several different populations.^{27} However, it must be noted if the assumptions of a parametric test are tenable, parametric tests have greater statistical power to detect real effects compared to their nonparametric alternative(s).^{28}

**Degrees of freedom** (df) are a vital component of all statistical tests, as most probability distributions, and statistical significance are based on them. Degrees of freedom are provided for all statistical tests and are a useful indicator of adequate sample size in the presence of assumption violations. For example, recall that the central limit theorem states the distribution of sampling means is approximately normal with degrees of freedom greater than 20. Thus, the central limit theorem operates independently of the distribution of the actual raw data. Therefore, if a researcher indicates that degrees of freedom for the statistical test are greater than 20, the results can typically be viewed as robust.

The definition of degrees of freedom is obscure and beyond the scope of this chapter; however, a brief description is provided. Degrees of freedom are conceptually defined as the number data points that are free to vary. Not all data is free to vary because values may have to be fixed by a specific sample parameter, such as the group mean. For example, consider a group of three participants who have a mean age of 50. Because the mean is 50, two of the participants can be of almost any age greater than zero. The third participant, however, must be the age that creates the mean of 50. Thus, if participant A is 40 and participant B is 45, participant C must be 65. That is, [40 + 45 + 65]/3 = 50. If instead, participant A is 75 and participant B is 40, then participant C must be 35. Therefore, because the age of two of the three participants could be just about any age greater than zero, there are two degrees of freedom (i.e., 2df)

Calculating degrees of freedom become increasingly complex in accordance with the complexity of the statistical test. That is, degrees of freedom for bivariate tests, such as those evaluating the difference between two group means, are easier to calculate and conceptualize than multivariate tests with multiple DVs. For the analyses described in the Statistical Tests section later in chapter, calculation of degrees of freedom is not explained explicitly, but it is important to note the distribution, probability, and statistical significance of all statistical tests are based on degrees of freedom. Further, degrees of freedom will usually be subscripted next to the test statistic for all parametric tests as well as for some nonparametric tests, such as the chi-square test. Subscripted degrees of freedom should be provided in the brief results section for all statistical tests described.

While the information presented in this chapter has outlined the underlying processes of most statistical tests used in the biomedical sciences, the remainder of the chapter uses all previous information as a base to begin to integrate more directly useful information. This section presents decision trees that allow for determination of whether the appropriate statistical test was used in a journal article.

❾ *The selection of the appropriate statistical test is based on several factors including the specific research question, the measurement scale of the DV, distributional assumptions, the number of DV measurements as well as the number and measurement scale of IVs and covariates, among others*. Tables 8–3 and 8–4 provide decision trees to identify the most appropriate statistical test based on the unique set of factors for statistical tests of group differences and statistical tests of association, respectively.

DV Scale | Distributional Assumptions Met | Repeated DV Measurements | Number of DV Measurements | Number of IVs | IV Levels | Covariates Allowed | Appropriate Statistical Test | ||
---|---|---|---|---|---|---|---|---|---|

Differences from known population | Continuous | Yes | No | 1 | 0 | No | One-sample z test | ||

Continuous | Yes | No | 1 | 0 | No | One-sample t-test | |||

Dichotomous | No | 1 | 0 | No | Binomial test | ||||

Continuous | No | 1 | 0 | No | Kolmogorov-Smirnov test | ||||

Between-group differences | Continuous | Yes | No | 1 | 1 | 2 | No | Independent-samples t-test | |

Ordinal or higher | No | No | 1 | 1 | 2 | No | Mann-Whitney test | ||

Ordinal or higher | No | No | 1 | 1 | 2 | No | Median test | ||

Continuous | Yes | No | 1 | 1 | ≥2 | No | One-way ANOVA | ||

Ordinal or higher | No | No | 1 | 1 | ≥2 | No | Kruskal-Wallis test | ||

Continuous | Yes | No | 1 | 1 | ≥2 | Yes | ANCOVA | ||

Continuous | Yes | No | 1 | ≥2 | ≥2 | No | Factorial ANOVA | ||

Within-group differences | Continuous | Yes | Yes | 2 | 0 | No | Paired-samples t-test | ||

Ordinal or higher | No | Yes | 2 | 0 | No | Signed-rank test | |||

Ordinal or higher | No | Yes | 2 | 0 | No | Sign test | |||

Continuous | Yes | Yes | ≥2 | 0 | No | One-way RM-ANOVA | |||

Ordinal or higher | No | Yes | ≥2 | 0 | No | Friedman test | |||

Continuous | Yes | Yes | ≥2 | 1 | ≥2 | No | Mixed BW-ANOVA | ||

Dichotomous | Yes | 2 | 0 | No | McNemar test | ||||

Dichotomous | Yes | ≥2 | 0 | No | Cochran Q test |

DV Scale | Distributional Assumptions Met | Number of IVs | IV Scale | Covariates Allowed | Appropriate Statistical Test | |
---|---|---|---|---|---|---|

One-sample association | Categorical | 1 | Categorical | No | Chi-square test | |

Categorical | 1 | Categorical | No | Fisher’s exact test | ||

Categorical | 1 | Categorical | Yes | Mantel-Haenszel test | ||

Correlation and regression | Continuous | Yes | 1 | Continuous | No | Pearson’s correlation |

Continuous | No | 1 | Continuous | No | Spearman’s rank order correlation | |

Continuous | Yes | 1 | Any | No | Simple linear regression | |

Continuous | Yes | ≥2 | Any | Yes | Multivariable linear regression | |

Dichotomous | 1 | Any | No | Simple logistic regression | ||

Dichotomous | ≥2 | Any | Yes | Multivariable logistic regression | ||

Within-group regression | Continuous | Yes | ≥0 | Any | Yes | Mixed-effects linear regression a |

Dichotomous | ≥0 | Any | Yes | Mixed-effects logistic regression a | ||

Time-to-event | Continuous | Yes | ≥2 | Yes | Cox proportional-hazards model | |

Reliability | Nominal | 0 | No | Kappa |

^{a}Mixed-effects linear and logistic regression models are complex analyses beyond the scope of this chapter. Therefore, they are not described in the Statistical Tests section. However, be aware that these analyses are available. Also, note that there is varying terminology used among researchers when describing these types of models; thus, in the literature mixed-effects models may be termed multilevel, hierarchical, nested, random effects, random coefficient, or random parameter models. The actual procedure for conducting the analyses remains identical regardless of the label attached to them.

The first factor that needs to be addressed when selecting the most appropriate statistical test is whether the research question is phrased to evaluate differences or associations. In general, the last paragraph of the Introduction to any journal article should explicitly state the research questions, so give close attention to the phrasing of these questions. This may seem mundane, but it is important to note that all parametric tests of group differences are mathematically equivalent to parametric tests of association. Therefore, various parametric statistical tests can often be used interchangeably to answer the same research question. This can be seen in Tables 8–3 and 8–4 where one-way ANOVA can be used in the same situations as simple linear regression analysis. The overall statistical inference would be identical. This can be an overwhelmingly confusing concept for those with more novice statistical backgrounds. However, in general, when a research question is phrased to evaluate group differences (e.g., to determine whether three treatment groups differ on some outcome) follow the decision tree provided in Table 8–3, and when a research question is phrased to evaluate associations between variables (e.g., to determine whether a patient’s age is associated with some outcome), follow Table 8–4.

The second factor to consider is the measurement scale of the DV. This is a much more concrete concept, but requires a thorough understanding of the four scales of variable measurement. Remember, nominal and ordinal scales are roughly classified as discrete or categorical, whereas interval and ratio scales are classified as continuous. Although it is often inappropriate, continuous variables may be also categorized into discrete variables. For example, categorization often occurs in the biomedical sciences with variables such as blood pressure, where exact systolic or diastolic blood pressure values are combined and categorized into low, normal, or high blood pressure.

The distributional assumptions of a statistical test involve complex explanations beyond the scope of this chapter. However, there are a few concepts to remember regarding distributions, most of which have been described in the Common Probability Distributions section above. First, distributional assumptions are required for all statistical tests using a continuous DV. Remember, the search for the most appropriate statistical test usually begins with parametric options, and all parametric statistical tests require a normal distribution (or application of the central limit theorem). Second, the distribution of the actual DV data is never considered when assessing distributional assumptions. Distributional assumptions are based on residual values, which represent the difference between the outcome predicted by the statistical model and the actual, observed outcome. Residuals are discussed in detail later in the simple linear regression analysis section. Third, if the distributional assumptions are violated, two options are generally available. First, the researcher could use a more appropriate distribution (see Table 8–1) or, second, they could employ a nonparametric statistical test. It is also common to see data transformation to attempt to force a normal distribution, but the considerable downside of this archaic technique was discussed previously in the Transforming Non-Normal Distributions section. It is more common in the biomedical literature to see a nonparametric test used, but be aware more statistically savvy researchers will use alternative distributions if the distributional assumption is violated.

The number of DV measurements is critically important to determine whether an appropriate statistical test was used. Note that studies using one DV measurement are known as **cross-sectional**, whereas studies using two or more DV measurements are known as **longitudinal**. If the DV was measured on multiple occasions, there is inherent association or correlation across DV measurements. That is, DV values from the same person inherently have a higher correlation compared to DV values from different people. Therefore, if a statistical test does not account for this correlation, the standard errors will be biased and improper statistical inference will occur.

It is also critically important to consider the number of IVs and covariates, their scale of measurement, the number of categorical IV levels, and whether the IVs and/or covariates interact. Note that the distribution of the IV or covariate is never considered in the statistical analysis. With that said, the scale of measurement for each IV and covariate is important when deciding which statistical test to employ. In general, ANOVA will usually be employed for categorical IVs, whereas regression analysis is required for continuous IVs. Note that the wording of the last sentence was chosen specifically. That is, although it was stated earlier that ANOVA and regression are mathematically equivalent, ANOVA cannot be used with continuous IVs. However, regression analysis can be used with any combination of IVs and covariates measured on any scale. This can be a confusing distinction. Another important concept is the number of levels for each categorical IV. Remember, a level can be thought of as the number of groups being studied and evaluated. When testing group differences, researchers will typically employ a t-test for two levels and ANOVA for three or more groups. Determining the number of levels is important because with three or more groups, the statistical test is an **omnibus test**. That is, an overall test result will be provided indicating a difference between at least two of the groups, but the test will not indicate specifically which groups differ. The concept of an omnibus test is discussed throughout the Statistical Tests section later in the chapter. Finally, whether an IV interacts with another IV or covariate is critically important. It is important to note that ANOVA can handle interactions between categorical IVs. Remember, an interaction indicates that the value of one IV is dependent on the value of another IV. For example, treatment group differences may be smaller in older patients (i.e., a treatment group-by-age interaction). However, if there is an interaction between a categorical IV and a continuous covariate, an ANOVA-type analysis, such as ANCOVA, cannot be used and a form of regression analysis must be used instead. Whether an interaction exists is an empirical question that is testable and described in the ANCOVA section later in the chapter.

Finally, other factors exist when determining whether the appropriate statistical test was used, but many are beyond the scope of this chapter. One key factor worth considering, however, is based on the concept of clustering (also known as nesting). Classic examples of clustering include children nested within the same classroom or patients nested within the same doctor. Clustering creates statistical issues that are similar to using a cross-sectional analysis on longitudinal data. That is, DV measurements from children nested within the same classroom have greater associations compared to DV measurements from children in different classrooms. Failing to account for clustering will result in biased standard errors and incorrect statistical inference. Although a description of the statistical analysis for clustered data is too complex for this chapter, recognizing when an analysis should account for (or should have accounted for) clustering is relatively easy, and the number of clustering levels can get as complex as the researcher desires. For example, patients could be nested within a doctor, the doctor could be nested within a clinic, the clinic could be nested within a hospital system, the hospital system could be nested within a city, and so on. When reading a journal article, take time to consider whether clustering should have been considered by the researchers. Do not be disheartened by the possibility of seeing an exorbitant number of clustering levels in any journal article. If clustering is considered, most studies in the biomedical sciences will only use two or three clustering units. The takeaway message is that careful thought must be undertaken when evaluating a study, especially when considering clustering. If clustering levels were not considered, but should have been, interpret all results with caution.

As an example of how to use Tables 8–3 and 8–4, say a researcher wanted to examine the effect a new statin medication had on the number of low-density lipoproteins (LDL) using a sample of 200 healthy patients. Although LDL has received a bad reputation, research has shown that the size of the LDL particles carrying the cholesterol is more predictive of future cardiovascular problems than the absolute LDL value measured in mg/dL.^{29} More specifically, cholesterol carried by large, buoyant LDLs (i.e., Pattern A) has little association with cardiovascular problems, whereas cholesterol carried by small, dense LDLs (i.e., Pattern B) has been associated with a myriad of cardiovascular problems.^{30} Therefore, a statin that only targets cholesterol carried by small, dense LDL is needed. Based on lipoprotein particle profile (LPP) testing, patients were placed into one of two groups based on LDL particle size (i.e., the IV: Pattern A versus Pattern B). The outcome for the study was LDL measured in mg/dL. Baseline measurements and demographic data were used as covariates and included categorized age (i.e., 40–49, 50–59, etc.), race, socioeconomic status indicated by whether the patient’s mother graduated from high school, comorbid conditions, concurrent medications, and baseline LDL. The researchers hypothesized that the statin should reduce LDL significantly more in Pattern A patients compared to Pattern B patients. The analysis was completed at the end of a 6-month study period. In the method section, the researchers stated that baseline characteristics and demographic data were compared between the two groups of patients. Due to the presence of outliers, Mann-Whitney tests were used for all continuous variables and chi-square tests were used for categorical variables. For the primary analysis, multiple linear regression was used. No assumption violations were indicated.

The example above contains similar information to what is typically provided in the overwhelming majority of published literature. Thus, using the example information above, Tables 8–3 and 8–4 can be used to determine whether the statistical tests employed were appropriate. Note that all three of these tests are described in detail later in the chapter. For the baseline and demographic data, using Table 8–3, the Mann-Whitney test was used because the DV scale was continuous, distributional assumptions were not met due to outliers, the DV was only measured on one occasion, there was one IV with two levels, and no covariates were considered. Further, using Table 8–4, chi-square tests were used because both the DV and IV were categorical and no covariates were considered. For the primary analysis, using Table 8–4, multiple linear regression analysis was used because the scale of the DV was continuous, the distributional assumptions were met, there was one IV with two levels, covariates were considered measured on both categorical and continuous scales, and although the DV was measured twice, the baseline measurement was used only as an additional covariate. Based on this information, all three statistical tests were used appropriately.

Tranexamic acid (TXA) and epsilon-aminocaproic acid (EACA) are two antifibrinolytics used to reduce blood loss following total joint arthroplasty. Because TXA is considerably more expensive than EACA, researchers at a local hospital currently dosing patients with TXA are interested in whether they can reduce costs by dosing EACA with no additional risk of blood loss. Therefore, a study was designed evaluate to differences in blood loss prevention between TXA and EACA following total joint arthroplasty. A sample of patients will be consented to participate in the study, with equal numbers randomized to receive either TXA or EACA. Blood loss will be measured using hemoglobin (Hbg). On the day of surgery, during the preoperative process, blood will be drawn to measure the patient’s baseline Hbg to serve as a covariate in analysis. Hbg will be measured again from blood drawn 2 days post-operatively to serve as the primary outcome. Hbg will be treated as a continuous variable for analysis.

Because the local hospital only has three orthopedic surgeons who complete total hip and knee replacements, the decision was made to recruit orthopedic surgeons from other hospitals in the metropolitan area to participate in the study. Thirty additional orthopedic surgeons within five hospitals agreed to participate. Because surgeons often have privileges to perform surgery at multiple hospitals, they were asked to include surgeries at only one hospital of their choosing.

Questions:

1.Would you consider this study to be a randomized controlled trial?

2.Is the design parallel-groups, crossover, or adaptive? Why?

3.The researchers plan to use a one-tailed hypothesis for this study. Is this appropriate? Could the researchers have specified a two-tailed hypothesis?

4.Are the researchers interested in a statistical test of differences or association?

5.How many times was the DV measured?

6.The distributional assumptions have been met and authors have indicated they will use an independent-samples t-test to analyze their data. Is this correct given the study design? If so, why? If not, indicate which statistical test(s) of group differences would be more appropriate.

7.The distributional assumptions have been met and the authors have indicated they will use some form of regression analysis to analyze their data. Is this appropriate? What type of regression is most appropriate?

8.Should the researchers have been concerned with clustering or nesting? Why or why not? If so, describe the levels of nesting within their study design.

The remainder of this chapter describes the application and assumptions of numerous parametric and nonparametric statistical tests commonly used in the biomedical sciences. Note that this section only covers statistical tests applicable to study designs with one measured DV; no truly multivariate tests are discussed. A description and list of assumptions are provided for each statistical test as well as an example with an associated results section as it would likely appear in the literature. It is important to take careful note of the assumptions for each statistical test, as these assumptions are vital in determining whether correct statistical inference can be inferred from the statistical test results. The discussion of statistical tests begins with tests for nominal and categorical data, followed by statistical tests for evaluating group differences and associations.

Pearson’s chi-square test (or simply the chi-square test) is one of the most common statistical tests used in the biomedical sciences. It is used to assess for significant differences between two or more mutually exclusive groups for two variables measured on a nominal scale. Note that the data may also be ordinal, if the number of rank ordered categories is small; however, the test does not consider rank order. The chi-square test assesses for differences between actual or observed frequency counts and the frequency count that would be expected if there actually was no differences in the data.

As an example, consider a study to determine whether a significant difference in gender exists between three treatment groups. In most journal articles, a 2 × 3 (i.e., gender by treatment group) contingency containing the observed frequency counts within each cell will typically be presented. This table would appear similar to Figure 8–7 but with another row or column. The expected frequencies are rarely presented in the literature. Next, a chi-square (χ^{2}) statistic should be presented with appropriate degrees of freedom subscripted next to the chi-square symbol (e.g., $\chi 2 2$ for two degrees of freedom). As stated above, the number of degrees of freedom will vary based on the analysis. If the probability of the difference is below alpha; that is, if the observed frequencies are different from the expected frequencies, the test is considered statistically significant indicating a significant gender difference between the groups. The results of a statistically significant gender difference are presented as follows:

*The results of the chi-square test indicated a statistically significant gender difference between treatment groups (*$\chi 2 2 =11.59$*, p < 0.05)*.

When the chi-square test is based on a contingency table larger than 2 × 2, the test is considered an omnibus test. That is, in the example above for the 2 × 3 table the chi-square test indicated a statistically significant gender difference existed between at least two treatment groups, but failed to indicate specifically which treatment groups differed. In these situations, *post hoc* chi-square tests (or Fisher’s exact tests if expected frequencies are low, see the next section) are used to determine where statistically significant differences occurred. For example, in the 2 × 3 chi-square above, three 2 × 2 *post hoc* chi-square tests are required. Specifically, gender compared between groups A and B, gender compared between groups A and C, and gender compared between groups B and C. A sample results section including the *post hoc* chi-square test results is presented below:

*Statistically significant gender differences were indicated across the three treatment groups (*$\chi 22 =11.59$*, p < 0.05). Post hoc chi-square tests indicated statistically significant gender differences between groups A and B (*$\chi 1 2 =6.54$*, p < 0.05) and between groups B and C (*$\chi 1 2 =10.26$*, p < 0.05), with group B including significantly more males compared to both groups A and C. Further, no statistically significant gender difference was indicated between groups A and C*.

The assumptions of the chi-square test include:

Data for both variables being compared must be categorical.

Note that continuous data can be categorized; however, information will be lost via categorization.

The categories must be mutually exclusive.

That is, each individual can fall into one, and only one, category.

The total sample size must be large.

The expected frequencies in each cell must not be too small. For chi-square tests with degrees of freedom greater than 1 (i.e., when the number of columns and/or rows are greater than 2), no more than 20% of the cells should have expected frequencies less than 5. Further, no cell should have an expected frequency less than 1.

^{31}This is a difficult assumption to verify from the literature, outside of calculating the expected frequencies by hand. However, if an article fails to indicate this assumption was tested, view results with caution.

Fisher’s exact test is ubiquitous in the biomedical literature. The test can only be applied to 2 × 2 contingency tables and is most useful when the sample size is small. It is often used when the third assumption of the chi-square test is violated. Conceptually, Fisher’s exact test is identical to the chi-square test, in that, the two variables being compared must be nominal and have mutually exclusive categories.

For example, consider a study assessing for differences in cardiac events in dialysis patients who initiated beta blocker therapy compared to patients who did not initiate therapy. Note, both variables are dichotomous (i.e., event versus no event; beta blocker versus no beta blocker). Fisher’s exact test provides the exact probability of observing this particular set of frequencies within each cell of the contingency table. Results of a statistically significant Fisher’s exact test are presented below. Notice only a *p* value is provided:

*The results of a Fisher’s exact test indicated patients initiating beta blocker therapy had significantly fewer cardiac events compared to patients failing to initiate therapy (p < 0.05).*

The assumptions of Fisher’s exact test include:

Data for both variables being compared must be dichotomous.

Note that continuous data can be dichotomized; however, information will be lost via categorization.

The dichotomous categories must be mutually exclusive.

That is, each individual can fall into one, and only one, category.

The Mantel-Haenszel chi-square test (also known as Cochran-Mantel-Haenszel test or Mantel-Haenszel test) measures the association of three discrete variables, which usually consists of two dichotomous IVs and one categorical confounding variable or covariate used as a stratification variable.

For example, consider a study assessing the presence or absence of lung cancer in smokers and nonsmokers (the IVs) after stratifying for frequent exposure to secondhand smoke (the dichotomous covariate; exposure versus no exposure). A 2 × 2 contingency table is created at each level of secondhand smoke. That is, a contingency table for exposure and another for no exposure. This test produces a chi-square statistic (χ^{2}_{MH}), with a statistically significant result indicating a significant difference in the presence of lung cancer for smokers and nonsmokers across the levels of the covariate (i.e., exposed versus unexposed). A statistically significant result is presented as follows:

*The results of a Mantel-Haenszel chi-square test indicated the proportion of nonsmokers developing lung cancer was significantly greater for those exposed to second hand smoke (χ*^{2}_{MH}*= 29.67, 1 df, p < 0.05)*.

The assumptions of the Mantel-Haenszel chi-square test include:

Data of the IVs must be dichotomous.

Note that continuous data can be dichotomized; however, information will be lost via categorization.

The dichotomous categories must be mutually exclusive.

That is, each individual can fall into one, and only one, category.

Data of the covariate must be categorical.

Again, note that continuous data can be dichotomized; however, information will be lost via categorization.

The kappa statistic (also known as Cohen’s kappa or κ) is a measure of inter-rater reliability or agreement for a categorical variable measured on a nominal scale. That is, kappa indicates how often individual raters using the same measurement scale indicate identical scores. Kappa provides the proportion of agreement corrected for chance and ranges from 0 indicating no agreement to 1 indicating perfect agreement. Be aware that there are no methodological studies defining a threshold for what could be considered good agreement. The definition of good agreement varies by substantive area. That is, what is considered good agreement in medical literature may not be considered good agreement in psychology, or vice versa. In general, however, a higher kappa is always better.

Because this statistic corrects for chance agreement, it is more appropriate than simply calculating overall percent agreement.^{32} In fact, percent agreement should rarely be used and published results using percent agreement should be viewed with caution. Kappa can be applied to a variable with any number of categories, with the understanding that as the number of categories increase, overall agreement will undoubtedly decrease. That is, the more choices two raters have, the less likely they are to agree.

As an example, consider 100 professional school applicants, who each interview with two faculty members. After the interview is complete, each faculty member rates the applicant as accept, deny, or waitlist. Kappa is then used to calculate the agreement between faculty members. Results using the kappa statistic are presented as follows:

*Cohen’s kappa was employed to measure the agreement between faculty members in determining whether applicants should be accepted, denied, or waitlisted. Results indicated moderate agreement between faculty members (κ = 0.75)*.

The assumptions of the kappa statistic include:

Each object (e.g., the applicant in the example above) is rated only one time.

The outcome variable is nominal with mutually exclusive categories.

That is, each individual can fall into one, and only one, category.

There are at least two independent raters.

That is, each rater provides one, and only one, response for each applicant.

The one-sample *z* test is used to assess for a difference between the mean of the study sample and a known population mean using a continuous DV.

For example, consider data collected from a random sample of 1000 patients with borderline high cholesterol, for which their mean serum total cholesterol was 210.01 mg/dL. The researcher is interested in determining whether the total cholesterol of this sample is significantly higher than the mean total cholesterol within the general population. The 2007-2008 National Health and Nutrition Examination Survey (NHANES) determined the mean serum total cholesterol level for individuals in the United States aged 6 years and older is 186.67 mg/dL with a standard deviation of 42.15.^{33} A one-sample *z* test provides a *z*-score indicating how many standard errors the sample mean is from the known population mean and if this difference is large enough to be considered statistically significant based on specific degrees of freedom. Note that degrees of freedom will be subscripted next to the *z*-score (e.g., *z*_{999}). Results of a statistically significant one-sample *z* test with no assumption violations are provided as follows:

*Results of a one-sample z test indicated a statistically significant difference in total cholesterol between the study sample and population (z*_{999} *= 2.10, p < 0.05), with the study sample having significantly higher total cholesterol compared to the general population (210.01 mg/dL* versus *186.67 mg/dL, respectively)*.

Assumptions of the one-sample *z* test include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV is normal.

This can be ensured by applying the central limit theorem.

The population mean and standard deviation is known.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

Only in rare cases is the population standard deviation known; thus, test statistics often must be based on sample data (i.e., standard deviation and sample size). The one-sample t-test is used in situations where only the population mean is known, or can at least be estimated by very large amounts of data. It is used only for a continuous DV.

For example, consider a study to compare the mean total cholesterol of a random sample of 1000 adults aged 20 years of age or older with borderline high cholesterol (e.g., 231.26 mg/dL) to the mean total cholesterol of the general population. In 2006, the National Center for Health Statistics determined the mean serum total cholesterol for adults in the United States aged 20 years and older was 199.00 mg/dL.^{34} Notice, no population standard deviation is available; thus, a one-sample t-test is required. Note that this test produces a *t* statistic, which can be considered similar to a *z*-score when samples are large. In general, a t-test will approximate a *z*-test with a sample size of around 30. The result of a statistically significant one-sample t-test with no assumption violations is presented as follows:

*Results of a one-sample t-test indicated a statistically significant difference in total cholesterol between the study sample and population (t*_{999} *= 2.23, p < 0.05), with the study sample having significantly higher total cholesterol compared to individuals aged 20 or older in the general population (231.26 mg/dL* versus *199.00 mg/dL, respectively).*

Assumptions of the one-sample t-test include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV is normal.

This can be ensured by applying the central limit theorem.

The population mean is known.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The binomial test is used when the DV is dichotomous and all of the possible data or outcomes fall into one, and only one, of the two categories. The binomial test uses the binomial distribution to test the exact probability of whether the sample proportion differs from the population proportion. Further, the binomial test is often used in the literature when sample sizes are small and violate the assumptions of the chi-square test, specifically low expected frequencies.^{27}

For example, say a fair coin is flipped 10 times, and lands on heads 6 of the 10 flips. The expected population proportion is 0.50; that is, if the coin is fair, as the number of flips increases the coin should land on heads 50% of the time. Because the coin landed on heads 6 of the 10 flips, the statistical test is whether this proportion (i.e., 6/10 or 0.60) is statistically different from the expected proportion (i.e., 0.50). In this case, the binomial test indicates the difference between these proportions is not significant and results are presented as follows:

*Results of the binomial test indicated the probability of flipping 6 heads in 10 flips was not statistically different from the expected population proportion of 0.50 (p > 0.05)*.

The assumptions of the binomial test include:

Data for both variables being compared must be dichotomous.

Note that continuous data can be dichotomized; however, information will be lost via categorization.

The dichotomous categories must be mutually exclusive.

That is, each individual can fall into one, and only one, category.

The population proportion is known.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The Kolmogorov-Smirnov one-sample test is a goodness-of-fit test used to determine the degree of agreement between the distribution of a researcher’s sample data and a theoretical population distribution.^{27} That is, it allows researchers to compare the distribution of their sample data against a given probability distribution for a continuous DV (see Table 8–1).

For example, consider a study where HbA1c data was collected for a random sample of 100 patients with diabetes. The researcher is interested in determining whether the distribution of HbA1c data was sampled from a population of patients with an underlying normal distribution. That is, the researcher is interested in whether the sample data is normally distributed. A nonsignificant Kolmogorov-Smirnov test indicates the sample distribution and the hypothesized normal distribution are not statistically different; that is, the distribution of sample data can be considered normally distributed. Results of the Kolmogorov-Smirnov test are presented as follows:

*Results of the Kolmogorov-Smirnov test indicated HbA1c variable had a nonsignificant departure from normality (p > 0.05); thus, the data are considered to result from a normal distribution.*

The assumptions of the Kolmogorov-Smirnov one-sample test include:

The DV is measured on an interval or ratio scale.

The underlying population distribution is theorized or known.

That is, the researcher must specify the correct probability distribution to test the sample data against. If the distribution is unknown, the test is inappropriate.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The independent-samples t-test (also known as Student’s t-test) is used to assess for a statistically significant difference between the means of two independent, mutually exclusive groups using a continuous DV.

For example, consider testing for a mean difference in a methacholine challenge at the end of an 8-week study period in two groups of asthma patients receiving either rosiglitazone or placebo. Methacholine challenge was measured by a 20% decrease in forced expiratory volume in one second (FEV1; PC20). The independent-sample *t*-test provides the *t* statistic and probability of obtaining a difference of this size or larger based on specific degrees of freedom, which are usually subscripted (e.g., *t*_{31}). Results of a statistically significant independent-samples t-test with no assumption violations are presented as follows:

*The results of an independent-samples t-test indicated a statistically significant difference between groups (t*_{31} = *9.654, p < 0.05), with asthma patients receiving rosiglitazone displaying significantly better lung function compared to placebo (mean PC20 = 10.7 mg/mL versus 3.8 mg/mL, respectively).*

The assumptions of the independent-samples t-test include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV within each level of the IV (i.e., group) is normal.

This can be ensured by applying the central limit theorem.

The IV is dichotomous.

Note that continuous data can be dichotomized; however, information will be lost via categorization.

The IV categories are mutually exclusive.

That is, each individual can fall into one, and only one, category.

Homogeneity of variance is ensured.

That is, the variance within each group is similar. A crude indicator of a violation of this assumption (i.e., heterogeneity) is the ratio of the largest variance to smallest variance being greater than 10:1.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

A one-way between-groups analysis of variance (ANOVA) is an extension of the independent-samples t-test to situations, where researchers want to assess for mean differences between three or more mutually exclusive groups using a continuous DV.

For example, consider the rosiglitazone example from the Independent-Samples *t*-Test section above, but in addition to the placebo group, include two groups receiving different doses of rosiglitazone (e.g., 4 and 8 mg). The use of three independent-samples t-tests to test for mean differences between groups (i.e., 4 mg versus placebo, 8 mg versus placebo, 4 mg versus 8 mg) is inappropriate due to a possible increase in Type I error. Instead, one-way ANOVA is used to partition the variance between and within groups to determine if a statistically significant group difference exists. This partitioning can be observed in the two numbers presented for degrees of freedom (i.e., *F*_{2,27} indicates 2 between-group degrees of freedom and 27 within-group degrees of freedom). The result of a statistically significant one-way ANOVA with no assumption violations is presented as follows:

*Results of a one-way ANOVA indicated a statistically significant difference between groups (F*_{2,27} *= 6.89, p < 0.05).*

ANOVA provides an omnibus *F* test; that is, an overall test assessing the statistical significance between the three or more group means. A statistically significant omnibus *F* test indicates a statistically significant difference between at least two group means. To determine which groups differ specifically, a series of *post hoc* tests are conducted. *Post hoc* tests are simply tests comparing individual groups to one another; thus, *post hoc* tests can be viewed as a series of independent-samples *t*-tests. That is, two-group comparisons. Because *post hoc* tests increase the number of statistical tests used, they often use a more conservative alpha to control for potential Type I errors. With this in mind, the significant one-way ANOVA in the example above would require three adjusted *post hoc* tests. That is, 4 mg versus placebo, 8 mg versus placebo, and 4 mg versus 8 mg. The most commonly used *post hoc* tests in the literature include the Tukey and Scheffé tests. Be aware that the Scheffé test is the most conservative *post hoc* test available and some methodologists argue that the test may be too conservative increasing the probability of committing a Type II error. A suitable alternative is the Tukey test, which is conservative but to a lesser degree. In most cases, the two tests will indicate similar results and both are viewed as acceptable.

The results of a statistically significant one-way ANOVA including *post hoc* tests are presented as follows:

*Results of a one-way ANOVA indicated a statistically significant difference between groups (F*_{2,27} *= 6.89, p < 0.05). Post hoc Tukey tests indicated statistically significant differences (at p < 0.05) between placebo (3.8 mg/mL) and 4 mg dose of rosiglitazone (10.7 mg/mL) as well as between placebo and the 8 mg dose of rosiglitazone (12.2 mg/mL). No statistically significant differences were indicated between the 4 mg and 8 mg doses of rosiglitazone.*

The assumptions for one-way ANOVA include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV within each level of the IV is normal.

This can be ensured by applying the central limit theorem.

The levels of the IV are mutually exclusive.

That is, each individual can fall into one, and only one, category.

Homogeneity of variance is ensured.

That is, the variance within each group is similar. A crude indicator of a violation of this assumption (i.e., heterogeneity) is the ratio of the largest variance to smallest variance being greater than 10:1.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

A factorial between-groups analysis of variance (also known as factorial ANOVA) is an extension of the one-way between-groups ANOVA to a study with more than one IV using a continuous DV.

For example, consider a study to evaluate for differences in heart rate measured by beats per minute (bpm) between men and women following either a 25-mg dose of pseudoephedrine or placebo. In the literature, this may be described as a 2 × 2 factorial design indicating two IVs (i.e., gender and treatment) each with two levels (i.e., male versus female; pseudoephedrine versus placebo). This type of design produces two main effects—one for gender and one for treatment—and an interaction effect between gender and treatment. Thus, three separate *F* tests are provided, one for each effect, with statistical significance determined separately for each effect.

It is extremely important to note that if the interaction effect is statistically significant, the results of the main effects cannot be interpreted directly, as the IVs are dependent on each other. From the example, a statistically significant interaction effect indicates treatment effects differ depending on the gender of the participant. That is, pseudoephedrine had a different effect for males than it did for females. However, if the interaction effect is not significant, main effects can and should be interpreted. When interpreting the main effect of an IV, the levels of the other IV are averaged or marginalized. That is, interpreting the main effect of gender is done irrespective of whether the participants received pseudoephedrine or placebo. Likewise, interpreting the main effect of treatment is done irrespective of the participant’s gender.

Similar to one-way between-groups ANOVA, following a statistically significant main effect or interaction, *post hoc* tests may be required to identify where statistically significant differences occurred. There are a number of *post hoc* tests available depending on whether the interaction or main effects are statistically significant including the Tukey and Scheffé tests.^{35} Each *post hoc* test adjusts alpha more or less conservatively to reduce potential Type I errors. *Post hoc* tests for factorial ANOVA used in the literature are often termed simple comparisons, simple contrasts, simple main effects, or interaction contrasts. While each uses a slightly different procedure, they are used to accomplish the same goal—identify group differences.

In the biomedical sciences, the results of a nonsignificant interaction effect for a 2 × 2 factorial between-groups ANOVA with no assumption violations are presented as follows:

*Results of a 2 (gender; male versus female) × 2 (treatment; pseudoephedrine versus placebo) factorial between-groups ANOVA indicated a nonsignificant interaction effect between gender and treatment (p > 0.05). However, the main effect for gender was statistically significant (F*_{1,26} *= 21.36, p < 0.05), with males having significantly higher heart rates than females (91.6 bpm versus 84.3 bpm, respectively). Further, the main effect of treatment was also statistically significant (F*_{1,26} *= 15.24, p < 0.05), with pseudoephedrine resulting in a significantly higher heart rate compared to placebo (70.3 bpm versus 65.2 bpm, respectively)*.

The results of a 2 × 2 factorial between-groups ANOVA with a statistically significant interaction and no assumption violations are presented as follows:

*Results of a 2 (gender; male versus female) × 2 (treatment; pseudoephedrine versus placebo) factorial between-groups ANOVA indicated a statistically significant interaction effect between gender and treatment (F*_{1,26} *= 15.42, p < 0.05). Simple main effects were assessed to identify at which treatment level gender differed. Results indicated pseudoephedrine increased heart rate significantly higher for males compared to females (90.5 bpm versus 82.4 bpm, respectively). No statistically significant gender difference in heart rate was indicated for the placebo group.*

The assumptions of factorial between-groups ANOVA include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV within each level of the IV is normal.

This can be ensured by applying the central limit theorem.

The levels of the IVs are mutually exclusive.

That is, each individual can fall into one, and only one, category.

Homogeneity of variance is ensured.

That is, the variance within each group is similar. A crude indicator of a violation of this assumption (i.e., heterogeneity) is the ratio of the largest variance to smallest variance being greater than 10:1.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

Analysis of covariance (ANCOVA) is an extension of both one-way between-groups ANOVA and factorial between-groups ANOVA. ANCOVA evaluates main effects and interactions using a continuous DV after statistically adjusting for one or more continuous confounding variables. That is, ANCOVA adjusts all group means to create the situation as if all participants scored identically on the covariate.^{22}

For example, consider a study comparing atenolol to placebo (IV) and assessing their effects on systolic blood pressure (DV). The researchers note, however, that previous research has shown systolic blood pressure and BMI to be highly correlated.^{36} Thus, the analysis will include BMI as a covariate assessing the effect of atenolol on systolic blood pressure over and above the effect of BMI on systolic blood pressure. If the atenolol group has greater BMI values compared to the placebo group, ANCOVA will adjust the systolic blood pressure within both groups to account for this initial difference in BMI.

In ANCOVA, covariates are continuous, measured before the DV, and correlated with the DV. It should be noted that ANCOVA is closely related to linear regression. Thus, although not completely necessary, it may be useful to revisit this section after reading the sections on Simple and Multivariable Linear Regression presented later in the chapter. In ANCOVA, group means are statistically adjusted by the magnitude of the association (i.e., slope) between the DV and covariate.^{37} That is, the greater the association, the more useful the covariate and better the adjustment. Thus, the goal of the covariate is to reduce error variance thereby increasing the statistical power of the test. From the example, mean systolic blood pressure for the atenolol and placebo groups are adjusted by the association between BMI and systolic blood pressure. Because previous research has shown the association between systolic blood pressure and BMI to be considerable, the statistical power of this test will undoubtedly be increased.

When presenting the results of ANCOVA, researchers should provide adjusted means; that is, the mean of the DV at each level of the IV after adjusting for the covariate. Published research that does not present adjusted means should be viewed with caution. Further, the effect of the covariate must also be presented which provides information regarding the effectiveness of the covariate in adjusting group means. Finally, it should be noted that ANCOVA is more suited for experimental design in which participants are randomized to groups, as opposed to nonexperimental designs without randomization. Remember, ANCOVA is used to adjust group means as if all participants had identical covariate values. However, in nonexperimental research, important covariates may have been missed and causality is difficult to infer—a characteristic intrinsic to all nonexperimental work. Thus, the limitations may be significant when applying ANCOVA to nonexperimental designs and results must be viewed cautiously.^{9}

In the biomedical sciences, the results of a statistically significant ANCOVA with no assumption violations are presented as follows:

*Results of a one-way ANCOVA indicated a statistically significant group differences in systolic blood pressure after adjusting for BMI (F*_{1,17} *= 7.98, p < 0.05), with patients receiving atenolol having significantly lower systolic blood pressure compared to placebo (adjusted means = 118 mm Hg versus 141 mm Hg, respectively). The relationship between systolic blood pressure and BMI was also statistically significant after adjusting for group (F*_{1,17} *= 39.85, p < 0.05) with a pooled within-group correlation of 0.61.*

The assumptions of ANCOVA include:

The DV is measured on an interval or ratio scale.

The sampling distribution of means for the DV and covariate(s) within each level of the IV is normal.

This can be ensured by applying the central limit theorem.

The levels of the IV are mutually exclusive.

That is, each individual can fall into one, and only one, category.

Homogeneity of variance is ensured.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

Homogeneity of regression is ensured.

This assumption requires the association between the DV and covariate to be the same within each level of the IV. A violation of this assumption renders ANCOVA inappropriate, and the authors should use linear regression instead. However, violation is difficult to detect from the literature, as most authors fail to provide the appropriate information in the narrative. Thus, when reading a journal article employing ANCOVA, if the author fails to indicate whether this assumption was tested, results must be viewed with caution.

The covariate(s) is measured reliably and without error.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The Mann-Whitney test is the nonparametric alternative to the independent-samples t-test and is one of the most powerful nonparametric tests.^{27} It is used when the distributional assumptions for the parametric test are violated or when the DV is measured on an ordinal scale. The Mann-Whitney test is based on ranked data. That is, instead of using the actual values of the DV, as an independent-samples t-test does, each participant’s DV value is ranked with the highest value receiving the highest rank and the lowest value receiving the lowest rank. The ranks within each group are then summed and the test assesses whether the difference in ranked sums between groups is statistically significant.

For example, consider a performance improvement study assessing gender differences in patient satisfaction of hospital stay following total hip replacement surgery. The measurement instrument uses a Likert-type scale with four possible responses anchored from Strongly Disagree to Strongly Agree. A statistically significant Mann-Whitney test indicates gender differences in patient satisfaction, with the group with the highest ranked sums indicating higher satisfaction. The results of the Mann-Whitney test are presented as follows:

*The results of a Mann-Whitney test indicate a statistically significant gender difference in patient satisfaction following total hip replacement surgery (z = 2.65, p < 0.05), with males indicating higher satisfaction scores compared to females.*

The assumptions of the Mann-Whitney test include:

The DV is measured on an ordinal, interval, or ratio scale.

The IV is dichotomous.

Note that continuous data can be categorized into a dichotomous variable; however, information will be lost.

The levels of the IV are mutually exclusive.

That is, each individual can fall into one, and only one category.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The median test is used to assess whether two mutually exclusive groups have different medians. There is no parametric alternative to the median test; however, the nonparametric Mann-Whitney test can be used as an adequate alternative. The test calculates the medians within each group and then classifies the data within each group as either above or below the respective group median. Further, because the test is based on the median, it can be used appropriately for skewed distributions or data containing outliers.

For example, consider a study evaluating gender differences in childhood autism as measured by the Childhood Autism Spectrum Test (CAST).^{37} The CAST measures difficulties and preferences in social and communication skills using the total score a 37-item questionnaire, with lower scores indicating fewer symptoms. Because autism is a relatively rare disorder, the distribution of CAST scores is expected to have severe positive skewness due to outliers. That is, most children will score low, while a few autistic children will have high scores. The median test was used to determine whether statistically significant gender differences existed in CAST scores. The results of a statistically significant median test are presented as follows:

*The results of the median test indicated gender differences in CAST scores (p < 0.05), with boys having a significantly higher median score compared to girls (median = 5 versus median = 4, respectively)*.

Assumptions of the median test include:

The DV is measured on an ordinal, interval, or ratio scale.

Samples sizes are sufficiently large.

If sample sizes are small, say less than 5 in each group, Fisher’s exact test should be used instead. From the example, this means using a 2 (Group; male versus female) × 2 (Median; above versus below) contingency table.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The Kruskal-Wallis one-way ANOVA by ranks (or simply, the Kruskal-Wallis test) is the nonparametric alternative to the one-way between-groups ANOVA. The test is an extension of the Mann-Whitney test to assess group differences between three or more mutually exclusive groups. The Kruskal-Wallis test is typically used when distributional assumptions are violated or when the DV is measured on an ordinal scale. Further, the Kruskal-Wallis test is based on rank sums similar to the Mann-Whitney test. The DV scores are ranked from highest to lowest, summed, and statistically significant of group differences are evaluated based on these rank sums.

For example, consider a study evaluating regional differences in whether volunteer preceptors believe they have adequate time available to dedicate to their experiential pharmacy students.^{38} In this study, the DV was measured on an ordinal 4-point Likert-type scale; thus, the Kruskal-Wallis test was used in lieu of one-way between-groups ANOVA. The results of the statistically significant Kruskal-Wallis test are presented as follows:

*Results of the Kruskal-Wallis test indicated regional differences regarding whether volunteer preceptors believe they have adequate time to dedicate to experiential students (*$\chi 6 2 =33.07$*, p < 0.05).*

Similar to a one-way between-groups ANOVA, the Kruskal-Wallis test is an omnibus test. That is, the test will determine whether an overall statistically significant difference exists between groups, but will not indicate specifically which groups differed statistically. Thus, *post hoc* tests are required. In this situation, the Mann-Whitney test is used to compare all two-group combinations. From the example, *post hoc* tests would include West versus Midwest, West versus South, West versus Northeast, and so on for a total of six *post hoc* tests. The results of the Kruskal-Wallis test including Mann-Whitney *post hoc* tests are presented as follows:

*Results of the Kruskal-Wallis test indicated regional differences regarding whether volunteer preceptors believe they have adequate time to dedicate to experiential students (*$\chi 6 2 =33.07$*, p < 0.05). Post hoc Mann-Whitney tests indicated preceptors in the West disagreed more compared to preceptors located in the Midwest (p < 0.05) and agreed less with preceptors in the South (p < 0.05). No other statistically significant group differences were indicated*.

The assumptions of the Kruskal-Wallis test include:

The DV is measured on an ordinal, interval, or ratio scale.

The IV is categorical.

Note that continuous data can be categorized; however, information will be lost.

The levels of the IV are mutually exclusive.

That is, each individual can fall into one, and only one category.

Each group has approximately the same distribution.

Although the Kruskal-Wallis test does not assume data are distributed normally, if the distribution for one level of the IV is skewed negatively and the other levels are skewed positively, the results produced by the test may be inaccurate.

The data do not include a large number of ties.

Tied values are given average ranks. Typically, if less than 25% of the data are ties, the test is unaffected.

^{27}

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response).

The paired-samples t-test (also known as matched t-test or nested t-test) is used when one group of participants in measured twice or two groups of participants are matched on specific characteristics. In both cases, the assumption of independence, or mutually exclusive groups, is violated. This statistical test is only appropriate using a continuous DV.

When one group of participants is measured twice, it is known as a repeated measures design. Repeatedly measuring participants is a valid method for reducing error and increasing statistical power, which requires fewer participants. The simplest repeated measures design is termed a pretest-posttest design. For example, consider measuring the therapeutic knowledge of 20 fourth-year pharmacy (P4) students prior to clinical rotations (i.e., pretest) and following rotations (i.e., posttest) to assess for increases in therapeutic knowledge. Therapeutic knowledge was measured using a discriminating 20-question test. The paired-samples t-test assesses for a statistically significant change in correct responses from pretest to posttest.

When two groups of participants are matched on specific characteristics, it is called a matched design. For example, when studying the effects of a new statin medication on hyperlipidemia, researchers would identify a group of patients to receive the statin and then identify a matched control group by matching individuals based on age, race, gender, BMI, and years with diagnosis. Note that the matched control group does not receive any medication. Matching participants serves the same purpose as repeated measures—reduce error variance—but is often more difficult because as the number of matching criteria increases the probability of finding a suitable match decreases.

Results of a statistically significant paired-samples t-test with no assumption violations using the pretest-posttest design example above is presented as follows:

*The results of a paired-samples t-test indicated a statistically significant difference in therapeutic knowledge between pretest and posttest scores (t*_{19} *= 3.25, p < 0.05). Therapeutic knowledge increased significantly following clinical rotations (mean = 10.4 correct responses at pretest versus a mean of 16.5 at posttest)*.

The assumptions of the paired-samples t-test include:

The DV is measured on an interval or ratio scale.

The two DV measurements are associated.

The sampling distribution of means for both DV measurements is normal.

This can be ensured by applying the central limit theorem.

Homogeneity of variance for both DV measurements is ensured.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

A one-way repeated measures ANOVA (aka, repeated measures ANOVA) is an extension of the paired-samples t-test to situations where the continuous DV is measured three or more times. Again, this can occur when the same participants are measured repeatedly or when three or more matched groups are measured once. A repeated measures ANOVA is used to indicate whether statistically significant change occurred between the repeated measurements.

For example, reconsider the pretest-posttest design described in the Paired-Samples t-Test section early in the chapter. Briefly, a researcher is interested in testing whether therapeutic knowledge of 20 pharmacy students in their last year of college changes before and after clinical rotations. To be applicable to repeated measures ANOVA, students would be tested on a third occasion 6 months after posttest to assess knowledge retention. That is, the design measures therapeutic knowledge at pretest, posttest, and 6-month follow-up. The repeated measures ANOVA is then used to test whether a statistically significance change occurred between the repeated measurements.

It should be noted that some researchers prefer to use repeated measures ANOVA over paired-samples t-tests when participants are only measured twice. This is an appropriate use of repeated measures ANOVA as repeated measures ANOVA can be used in any situation when a paired-samples t-test is appropriate. The results would be identical. However, the test statistic from the repeated measures ANOVA will be an *F* value instead of a *t* value produced by the paired-samples t-test. This is a nonissue, though, as the *F* value in this situation is simply *t*^{2}.

In most cases, repeated measures ANOVA has more statistical power than a paired-samples t-test. This has been alluded to in the Analysis of Clinical Trials and Paired-Samples t-test sections earlier in the chapter. In general, increasing the number of repeated measures further reduces error, which allows for more precise measurement and decreases the overall probability of committing a Type I error. With that said, increasing the number of repeated measurements has diminishing returns in statistical power. That is, for most studies, statistical power will increase drastically by adding a few additional repeated measurements, but the magnitude of this increase weakens rapidly between four and six measurements, with little to no increases in statistical power beyond the seventh measurement.^{39} Finally, if a study has more than 10 repeated measurements, a time series analysis may be more appropriate than repeated measures ANOVA.

In addition, a brief discussion of the key assumption of repeated measures ANOVA is useful as a basic understanding this assumption will assist in determining whether the test statistics produced from the analysis are correct. This key assumption, known as sphericity, states that the variances of the differences between repeated measurements are equal. For example, consider a study with three repeated measurements. For the sphericity assumption to be satisfied, the variance of the difference between the first and second measurements must be similar to the variance of the difference between the first and third and second and third. This assumption tends to be restrictive, as most differences closer in time tend to have less variability compared to measurements further apart in time. Sphericity is a testable assumption using Mauchly’s test, and a violation can severely bias the statistical inference. Therefore, when reading a journal article, if the authors fail to provide information regarding the assurance or violation of the sphericity assumption, results and interpretations must be viewed with caution.

Briefly reconsider the example provided above, where 20 pharmacy students in their final year have therapeutic knowledge measured before clinical rotations (i.e., pretest), once immediately after rotations (i.e., posttest), and at a 6-month follow-up. That is, therapeutic knowledge is measured on three separate occasions. A statistically significant repeated measures ANOVA with no assumption violations will be presented as follows:

*The results of a one-way repeated measures ANOVA indicated a statistically significant difference in therapeutic knowledge between pretest, posttest, and 6-month follow-up (F*_{2,38} *= 9.87, p < 0.05)*.

With more than two repeated measurements, the one-way repeated measure ANOVA is an omnibus test. That is, the *F* test will identify whether a statistically significant difference exists between repeated measures, but will not indicate specifically which repeated measurements differ. Thus, *post hoc* tests, known as pairwise comparisons, are required. Similar to other analysis requiring *post hoc* tests, there are numerous adjusted pairwise comparisons available. Each type of pairwise comparison adjusts alpha differently, with some being more conservative. It may be simpler to think of these comparisons as a series of paired-samples t-tests with adjusted alpha values. That is, adjusted paired-samples t-tests comparing the first and second repeated measurements, the first and third, the second and third, and so on. The additional information required to present results of a statistically significant one-way repeated measures ANOVA with no assumption violations are presented as follows:

*The results of a one-way repeated measures ANOVA indicated a statistically significant difference in therapeutic knowledge between pretest, posttest, and 6-month follow-up (F*_{2,38} *= 9.87, p < 0.05). Results of the pairwise comparisons indicated a statistically significant increase in therapeutic knowledge from pretest to posttest (mean = 5.50 versus 15.90, respectively, p < 0.05). Further, no statistically significant difference was indicated from posttest to 6-month follow-up (mean = 15.90 versus 15.50, respectively) indicating therapeutic knowledge was retained for 6 months following clinical rotations*.

The assumptions of the one-way repeated measures ANOVA include:

The DV is measured on an interval or ratio scale.

All DV measurements are associated.

The sampling distribution of means for all DV measurements is normal.

This can be ensured by applying the central limit theorem.

Homogeneity of variance for all DV measurements is ensured.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

Sphericity is ensured for designs with three or more repeated measurements.

This is a complex assumption discussed above. In general, sphericity is violated when the variance of the differences between measurements are not similar.

A mixed between-within analysis of variance (also known as factorial ANOVA with repeated measures or split-plot ANOVA) is a combination of factorial between-groups ANOVA and repeated measures ANOVA. The mixed terminology highlights this combination, and indicates that the design considers two or more levels of the IV when a continuous DV is measured repeatedly. It is important to note that a mixed between-within ANOVA is qualitatively different from a mixed-effects analysis involving random effects (see Table 8–4). The simplest case is a 2 × 2 pretest-posttest design, using two mutually exclusive treatment groups measured on two separate occasions. The primary advantage of this analysis is that it allows researchers to assess the interaction effect evaluating whether two groups changed differently over time in addition to between-subjects main effect indicating the overall effect irrespective of measurement and within-subjects main effect indicating the overall effect irrespective of group.

As an example, consider a study examining the effectiveness of a relatively new FDA-approved tricyclic antidepressant (TCA) compared to amitriptyline over a 12-week study period. The researcher hypothesizes that the new TCA is more effective than amitriptyline in reducing symptoms of clinical depression. Prior to initiating treatment, 20 patients with diagnosed clinical depression are measured on the Beck Depression Inventory II (BDI-II).^{40} Following this pretest or baseline measurement, each patient is randomized to receive one of two treatment options, the new TCA or amitriptyline, with 10 patients in each group. Patients then initiate the prescribed medication therapy and at the end of the 12-week study period BDI-II scores are measured again. A mixed between-within ANOVA provides researchers with a separate *F* tests for the interaction effect, between-groups main effect, and within-groups main effect each evaluated with specific degrees of freedom. Within this example, the primary effect of interest is the interaction effect, evaluating whether BDI-II scores changed differently within the new TCA group compared to the amitriptyline group from pretest to posttest.

Similar to factorial ANOVA discussed above, only if the interaction effect is nonsignificant can the researcher evaluate the statistical significance of overall group mean difference or between-group main effect and the overall change in BDI-II scores or within-group main effect. That is, a statistically significant interaction effect indicates that the change in BDI-II scores from pretest to posttest changed differently in the group receiving the new TCA group compared to the group receiving amitriptyline or vice versa. Stated another way, the reduction in symptoms from pretest to posttest was dependent on whether the patient received the new TCA or amitriptyline. In this example, the researcher’s hypothesis would be supported by a statistically significant interaction effect; that is, the new TCA was more effective at reducing the symptoms associated with clinical depression compared to amitriptyline.

Although the example above was for a 2 (group: new TCA versus amitriptyline) × 2 (measurement: pretest versus posttest) design, a mixed between-within ANOVA can be used for a design with any number of IVs with any number of levels or repeated measures. This is often seen in the literature. For example, consider the pretest-posttest study evaluating the effectiveness of three treatment groups (e.g., new TCA, amitriptyline, and placebo) in reducing symptoms of clinical depression. This would be considered a 3 × 2 design. Or, consider the same study evaluating for additional gender differences within these three treatments. This would be considered a 3 × 2 × 2 design. It must be noted that as the number and levels of the IVs increase so does the complexity of interpreting results. Thus, extreme care must be taken when interpreting results and implementing suggestions supported by these types of designs. Consultation with an individual well versed in research methodology and statistical analysis is advised prior to implementing findings into an evidence-based practice.

Because a mixed between-within ANOVA is an extension of factorial between-groups ANOVA and repeated measures ANOVA, *post hoc* tests or pairwise comparisons may be required for any IV with three or more levels. That is, when there are more than three levels of the between-groups IV (e.g., new TCA, amitriptyline, and placebo), the between-groups main effect is an omnibus test. A statistically significant between-groups main effect indicates that a statistically significant difference in BDI-II scores exists, but does not indicate specifically which groups differ significantly. Further, with three or more repeated measures, a statistically significant within-groups main effect indicates a difference between repeated measures, but fails to indicate specifically which measurements differ significantly. The *post hoc* tests and pairwise comparisons for factorial between-groups ANOVA and repeated measures ANOVA, respectively, are also appropriate for a mixed between-within ANOVA.

The results of a 2 × 2 mixed between-within ANOVA with no assumption violations for a nonsignificant interaction effect and statistically significant between- and within-groups main effects is presented as follows:

*The results of a 2 (group: new TCA versus amitriptyline) × 2 (measurements: pretest versus posttest) mixed between-within ANOVA failed to indicate a statistically significant interaction (F*_{1,18} *= 1.215, p > 0.05). However, both main effects were statistically significant. Overall, patients receiving the new TCA had lower BDI-II scores compared to patients receiving amitriptyline (F*_{1,18} *= 27.97, p < 0.05; mean = 29.41 versus 47.50, respectively). Further, an overall decrease in depressive symptoms was indicated from pretest to posttest (F*_{1,18} *= 24.74, p < 0.05; mean = 49.86 versus 35.72, respectively)*.

With a statistically significant interaction effect, results of the mixed between-within ANOVA with no assumption violations are presented as follows:

*The results of a 2 (group: new TCA versus amitriptyline) × 2 (measurements: pretest versus posttest) mixed between-within ANOVA indicated a statistically significant interaction effect (F*_{1,18} *= 10.37, p < 0.05). Simple main effects were assessed to identify statistically significant treatment differences at pretest and posttest individually. Results indicated no statistically significant difference between the new TCA and amitriptyline at pretest (mean = 48.53 versus 50.23, respectively). However, at posttest, a statistically significant difference was indicated, with the new TCA having significantly lower BDI-II scores compared to amitriptyline (mean = 24.63 versus 47.53, respectively)*.

The assumptions of a mixed between-within ANOVA include:

The DV is measured on an interval or ratio scale.

All DV measurements are associated.

The sampling distribution of means at each level of the IV(s), collapsed across the repeated DV measurements, is normal.

This can be ensured by applying the central limit theorem.

Homogeneity of variance for all DV measurements within each level of the IV(s) is ensured.

^{22}For example, most studies do not provide the variance for each variable; however, the standard deviation is reported consistently. Remember, variance is simply the standard deviation squared. Thus, consider two variables with standard deviations of 5 and 10. The homogeneity of variance assumption can be tested by squaring the standard deviations (i.e., 5^{2}= 25 and 10^{2}= 100, respectively) and finding their ratio (i.e., 100/25 = 4). In this case, the ratio is less than 10:1; thus, the assumption is not violated.

Sphericity is ensured for designs with three or more repeated measurements.

This is a complex assumption discussed briefly above for one-way repeated measures ANOVA. In general, sphericity is violated when the variance of the differences between measurements are not similar.

The Wilcoxon signed-rank test (also known as signed-rank test) is the nonparametric alternative to a paired-samples t-test. The test is used typically to assess for differences between two repeated measurements or two matched groups when distributional assumptions are violated or when the DV is measured on an ordinal scale. The signed-rank test is based on ranked difference scores (e.g., difference between pretest and posttest), with the highest difference score receiving the highest rank and the lowest difference score receiving the lowest rank.

For example, consider single group pretest-posttest study evaluating the secondary effect of weight loss in pounds while on exenatide therapy in a sample of 20 Type 1 DM patients. Prior to initiating exenatide therapy, all patients are weighed (i.e., pretest). At the end of a 1-year study period, patients are weighed again (i.e., posttest). For this study, the DV has numerous outliers; thus, the signed-rank test is used in lieu of paired-samples t-test to assess for a change in patient weight from pretest to posttest. The result of a statistically significant signed-rank test is presented as follows:

*The results of the signed-rank tests indicated a statistically significant decrease in body weight from pretest to posttest (z = 2.32, p < 0.05)*.

The assumptions of the signed-rank test include:

The DV measured repeatedly on an ordinal, interval, or ratio scale.

The two DV measurements are associated.

The two measurements come from populations with the same median.

There should not be a large number difference scores equal to zero.

That is, the number of participants having no change (i.e., difference score of zero) should be low.

The Friedman two-way ANOVA by ranks test (also known as Friedman’s test) is the nonparametric alternative to the one-way repeated measures ANOVA and is an extension of the signed-rank test to a situation with three or more repeated measurements. Similar to the other nonparametric tests, it is most often used when distributional assumptions are violated or the DV is measured on an ordinal scale. The Friedman test is based on ranked data, with higher scores receiving higher ranks, and is used to assess for statistically significant differences between repeated measurements.

For example, reconsider the exenatide example described in the Wilcoxon Signed-Ranks section above. Briefly, the example consisted of a one-time pretest-posttest study evaluating the secondary effect of weight loss in pounds while on exenatide therapy in a sample of 20 Type 1 DM patients. To extend this design to be applicable to Friedman’s test, consider a study where patients are weighed prior to initiating exenatide therapy, 6 months after initiation, and 1-year after initiation. Thus, each patient is weighed on three occasions. Because the distribution of weight has numerous outliers, Friedman’s test is employed. The results of a statistically significant Friedman’s test are presented as follows:

*The results of Friedman’s test indicated statistically significant differences in body weight between the three repeated measurements (*$ \chi 2 2 =15.21$*, p < 0.05)*.

Similar to the one-way repeated measures ANOVA, Friedman’s test is an omnibus test. That is, it assesses whether a statistically significant difference exists between the repeated measurements but does not indicate specifically which measurements differ significantly. Thus, a series of *post hoc* tests are required. For the example above, three signed ranks tests are required to test for differences between measurements: pretest versus 6 months, pretest versus 1 year, and 6 months versus 1 year. The results of a statistically significant Friedman’s test including the additional *post hoc* tests are presented as follows:

*The results of Friedman’s test indicated statistically significant differences between the three repeated measurements (*$ \chi 2 2 =15.21$*, p < 0.05). Post hoc signed-rank tests indicated a statistically significant decrease in body weight from pretest to 6-month follow-up (z = 2.65, p < 0.05), with no statistically significant difference between the 6-month and 1-year follow-up (z = 0.51, p > 0.05). Thus, results suggest weight loss occurred rapidly, within 6 months of initiating exenatide therapy, and was sustained through 1 year of therapy.*

The assumptions of the Friedman test include:

The DV is measured repeatedly on an ordinal, interval, or ratio scale.

All DV measurements are associated.

The DV measurements come from populations with the same median.

The sign test is another nonparametric alternative to the paired-samples t-test. Similar to the signed-rank test, the sign test is used typically when distributional assumptions are violated or when the DV is measured on an ordinal scale. The sign test is used to assess for differences between two repeated measurements or two matched groups. It must be noted that the sign test is typically less powerful than the signed-rank test as the signed-rank test uses more information from the data to calculate the test statistic. Nevertheless, the sign test is presented here because it is seen in the literature; however, in most situations, the signed-rank test should have been used.

For example, consider a pretest-posttest study to evaluate change in BMI following a physical activity intervention in a sample of 20 third grade students. In this study, BMI was measured at the beginning of the school year (i.e., pretest) and again after school year was complete (i.e., posttest). The sign test assesses whether statistically significant change occurred between the two measurements. The results of a statistically significant sign test are presented below. Notice that no test statistic is presented, only a *p* value.

*The results of the sign test indicated a statistically significant decrease in BMI from pretest to posttest (p < 0.05)*.

The assumptions of the sign test include:

The DV is measured repeatedly on an ordinal, interval, or ratio scale.

The two DV measurements are associated.

The McNemar test of change (also known as McNemar’s test) is an extension of the chi-square and Fisher’s exact test when participants are measured on two separate occasions and assesses the statistical significance of observed changes between the two repeated measurements. McNemar’s test is only applicable to a DV measured on a nominal scale; however, continuous variables can be artificially dichotomized, with the understanding that information will be lost.^{27,41}

McNemar’s test is often used for pretest-posttest studies to assess change following an intervention or treatment. For example, consider a study to determine the effectiveness of an influenza vaccination across two consecutive flu seasons. At the beginning of the first flu season, 20 participants are randomized to receive either vaccination or placebo with ten in each group. At the end of the first flu season, participants are asked whether they were diagnosed with the flu or not (i.e., yes versus no). Then, at the beginning of the second flu season, participants who received the vaccination originally will receive placebo and those who received placebo originally will receive the vaccination. At the end of the second flu season, participants are asked again whether they were diagnosed with flu. McNemar’s test is used in this study to statistically test whether the flu vaccination was effective, where participants diagnosed with flu while taking placebo should not have developed flu with the vaccination. That is, the participant’s outcome changed depending on the treatment received.

A critically important caveat is that McNemar’s test only considers participants who changed between the two repeated measurements, with participants who did not change removed from analysis. Thus, if a researcher believes that change will be rare, the McNemar test may be inappropriate because the statistical power of this test may be extremely reduced due to the sample size decrease from removing participants who did not change.

Based on the example above, the results of a statistically significant McNemar test are presented below. Notice that no test statistic is provided, only the *p* value.

*The results of a statistically significant McNemar’s test indicated a statistically significant change between treatment and placebo (p < 0.05), with the vaccination significantly reducing influenza diagnoses compared to placebo*.

The assumptions of the McNemar test include:

The DV is measured repeatedly on a dichotomous scale.

The two DV measurements are associated.

The Cochran Q test (also known as Cochran’s Q) is an extension of McNemar’s test to situations where participants are measured repeatedly on three or more separate occasions.^{27} Similar to McNemar’s test, the DV must be measured repeatedly on a nominal scale.

In the biomedical literature, Cochran’s Q is often used to assess stability of a treatment over time or to compare the effectiveness of several treatments. For example, consider a study evaluating the effectiveness of the combination treatment sildenafil and psychotherapy in reducing symptoms of erectile dysfunction (ED).^{42} Eight patients with psychogenic ED attended weekly psychotherapy sessions and ingested 50 mg of sildenafil citrate orally as needed over a 6-month period. Symptoms of ED were assessed at baseline, 6 months (i.e., end of treatment), and at a 3-month posttreatment follow-up. For this study, Cochran’s Q was used to evaluate a change in the stage of remission for patients dichotomized into ED versus no ED. A statistically significant finding indicated change from baseline and the possibility of sustaining effects at posttreatment follow-up. That is, all patients were diagnosed with ED at baseline, thus a statistically significant change indicates patients indicated remission of ED symptoms at some point.

In the literature, the results of the Cochran Q test may be presented with a χ^{2} statistic and a *p* value; however, in other studies, the results may only present a *p* value. Although failing to include the test statistic provides less information, it does not necessarily damage the integrity of results. The result of a statistically significant Cochran’s Q for the example above is presented as follows:

*The results of Cochran’s Q indicated a statistically significant change in psychogenic ED symptoms from baseline (p < 0.05) suggesting a combination of psychotherapy and 50 mg sildenafil are effective in reducing ED symptoms*.

Because Cochran’s Q is used to assess change over three or more repeated measures, it is an omnibus test. That is, the test will determine whether a statistically significant change occurred between the repeated measurements, but will not indicate where the change occurred. Thus, *post hoc* tests are required. For Cochran’s Q, a series of McNemar tests comparing each repeated measurement serve as *post hoc* tests. In this example, three separate McNemar tests are required comparing baseline versus 6 months, baseline versus posttreatment follow-up, and 6 months versus posttreatment follow-up. The results of a statistically significant Cochran’s Q including results of *post hoc* McNemar tests are presented as follows:

*The results of the Cochran Q test indicated a statistically significant change in psychogenic ED symptoms from baseline (p < 0.05). Post hoc McNemar tests indicated statistically significant changes from baseline at 6 months as well as at posttreatment follow-up (all p < 0.05). These results suggest the combination of psychotherapy and sildenafil are effective in reducing ED symptoms and this effect continued to reduce ED symptoms up to 3 months following treatment completion.*

The assumptions of the Cochran test include:

The DV measured repeatedly on a dichotomous scale.

All DV measurements are associated.

When exploring the association or relationship between two or more variables, two specific types of analyses are employed—correlation or regression. These analyses are applied to determine the magnitude and direction of an association or relationship. Correlation analysis indicates the co-relationship of two variables. That is, correlation described how one variable changes in relation to another. It is critically important to note that correlation cannot imply causation.

For example, consider the positive association between creatinine and BUN. In most cases, as creatinine increases so does BUN, but increasing creatinine does not cause BUN to increase or vice versa. Instead, the cause of the BUN and creatinine increase might be due to renal failure.

Regression analysis is a type of correlational analysis used to predict the value of one variable from the value of another variable. In this type of analysis, researchers attempt to determine the amount of variance in the DV that is explained by the IVs or covariates. Note that regression analysis can also permit multiple IVs and/or covariates measured on any scale. With the inclusion of multiple IVs or covariates this type of analysis is often referred to as multivariable regression analysis. While the definition of an IV and covariate is not always concrete, it is easier to think of an IV as the primary variable of interest and a covariate as a variable correlated with the DV, but not of specific interest.

For example, consider a multivariable analysis assessing the effect statin use (IV) had on all cause mortality due to heart failure (DV) during a 5-year study period after statistically controlling for age, gender, race, comorbid conditions, and concurrent medications (covariates). The covariates may explain the reason a patient died, but are not of specific research interest.

This section begins by discussing bivariate, or two variable, techniques followed by multivariable techniques. The discussion in this section will progress in a similar fashion to the statistical tests already presented where parametric tests will be discussed first, followed by the nonparametric alternatives.

Pearson’s product-moment correlation (also known as Pearson’s correlation or Pearson’s *r*) is one of the most commonly used correlation measures. It measures the direction and strength of a linear relationship between two continuous variables. Pearson’s *r* ranges from −1 and + 1, with *r* of 0 indicating no relationship. That is, as the correlation is stronger as it approaches −1 or 1. A positive correlation (e.g., 0.30 or 0.99) indicates that as the values of one variable increase so do the values of the other variable, while a negative correlation (e.g., −0.50 or −0.80) indicates that as the values of one variable increase, the values of the other variable decrease. Pearson’s *r* tests whether the correlation of two variables is different from 0 (i.e., no relationship); thus, a statistically significant *r* indicates the slope of the linear relationship is not horizontal.

For example, research has shown a moderate, but statistically significant, positive correlation (*r* = 0.25) between weight in kilograms and platelet count in men aged 20 to 55.^{43} Thus, in men, as weight increases, so does platelet count. However, the magnitude of this increase varies from person to person. That is, for any individual man, a 1-kg increase in body weight may indicate an increase in platelet count that is different from the platelet count increase for another man. It should be noted the value of *r* is dimensionless because it is based on standardized scores. That is, measuring weight in pounds or kilograms in the example above will not change the value of the correlation. Finally, be aware that *r* is substantially affected by outliers, so researchers must identify and remove them from analysis or use the nonparametric Spearman’s rho discussed below.

The relationship between two variables can be assessed visually by plotting the data on a scatterplot. In fact, this practice is highly recommended.^{5} The magnitude of the correlation is directly related to the strength of the linear relationship. Take a moment to consider Figures 8–13 and 8–14. In Figure 8–13, the value of Pearson’s *r* is approximately 1 indicating a near perfect relationship between the two variables. Notice the dots on the scatterplot lay near the best fit line and that the slope of this line is fairly steep. The positive correlation coefficient indicates that as the values of continuous variable 1 increase so do the values of continuous variable 2. In Figure 8–14, Pearson’s *r* is approximately 0. Here, notice the dots are scattered all over the plot with no real direction, and the best fit line is almost perfectly horizontal, which indicates no relationship between the two continuous variables.

To highlight the substantial effect outliers have on Pearson’s *r*, take a moment to draw an outlier on Figure 8–13, say, a value of 45 for continuous variable 1 and a value of 25 for continuous variable 2. Visualize what effect this outlier has on the previously strong positive correlation and how it would pull the best fit line toward horizontal significantly weakening the correlation.

The result of a statistically significant Pearson’s product-moment correlation with no assumption violations using the example above is presented as follows:

*The results of Pearson’s product-moment correlation indicates a statistically significant moderate relationship between body weight in kilograms and platelet count (r*_{81} *= 0.252, p < 0.05) indicating body weight increased in concordance with platelet count*.

Assumptions of Pearson’s product-moment correlation include:

Both variables are measured on an interval or ratio scale.

There are no outliers.

The relationship between the variables is linear.

**Homoscedasticity**is ensured.The assumption states that the variability around the best fit line of the linear relationship is the same for all data. A violation of this assumption can be seen within a scatterplot. For example, consider a scatterplot where the lower values for a variable fall near the best fit line and higher values for this same variable fall far from the best fit line. In this situation, the variability around the line is not constant. The tenability of this assumption is difficult to ascertain in a journal article; thus, ensure the authors noted that it was tested.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response) for each variable.

As stated in the previous section, the magnitude or size of the correlation is directly related to the strength of the linear relationship. The pattern of this linear relationship is typically indicated by the regression line, which is another name for the best fit line presented in Figures 8–13 and 8–14. The regression line is a best fit line describing how the continuous DV changes as values of the IV change. Similar to Pearson’s *r*, the statistical test in simple linear regression is whether the slope of the regression line is statistically different from zero, or, stated another way, whether the regression line has no slope or is horizontal. Simple linear regression, however, takes Pearson’s *r* one step further, where the regression line is used to predict values of the DV for a given value of the IV.^{21} This is incredibly useful to evidence-based practitioners looking to implement findings into their practice.

The algebraic linear regression equation is: *yê* = *a* + *bx*. Here, *a* is the intercept of the regression line with the *y*-axis, *b* is the slope of the regression line, *x* is the value of the IV, and *yê* (pronounced *y*-hat) is the predicted value of the DV. The intercept is interpreted as the predicted value of the DV when the value of the IV is zero. The slope is interpreted as the overall change in the DV for a one-unit increase in the IV. Further, it is important to note that the value of Pearson’s correlation between the DV and IV is incorporated into the mathematical equation for the slope.

As an example, consider the association between systolic blood pressure measured in millimeters of mercury (mm Hg) and height in centimeters for 100 children aged 5 to 7 years.^{44} The results of this study indicated a positive Pearson correlation between height and systolic blood pressure of 0.33. Moving beyond basic correlation, the researchers used a simple linear regression analysis to predict a child’s systolic blood pressure from their height. Results indicated an intercept value of 46.28 mm Hg and a slope of 0.48. Using these values, the linear regression equation would be: *yê* = 46.28 + (0.48∗Height). Using the interpretation of the intercept and slope provided above, the intercept value of 46.28 is the predicted systolic blood pressure for a child 0 cm tall, whereas the slope indicates that a 1-cm increase in height increases systolic blood pressure 0.48 mm Hg. From this interpretation, it is obvious that a child with a height of 0 cm is impossible. This is a prime example of the awareness readers must have when interpreting the intercept. That is, unless it makes theoretical sense to have a meaningful zero point for the IV, the interpretation of the intercept is never useful. This situation, however, should not suggest the intercept is meaningless to prediction. Instead, the intercept is simply a starting point for predicting the outcome. For example, say we want to predict systolic blood pressure for a child that is 115 cm tall; the linear regression equation becomes: *yê* = 46.28 + (0.48∗115), which equals 101.48 mm Hg.

It is important to note that the predicted values of the DV are rarely identical to the actual observed values. That is, a child 115 cm tall in this sample may actually have a systolic blood pressure of 105 mm Hg, but have a predicted value of 101.48 mm Hg. The difference between the actual and predicted scores is referred to as a **residual value** or residual error. For the example child above, the residual value is 3.52 mm Hg (i.e., 105–101.48). Residual values are always calculated for all participants included in the regression analysis, and one of the assumptions of linear regression is that the residual values follow a normal distribution; this is where the assumption of normality originates for all parametric statistical tests. The tenability of this assumption is a key indicator of the reliability of the results. Therefore, if a researcher fails to describe the distribution of residuals alongside their results, the study should be read and interpreted with caution.

In addition, the interpretation of slope used in the above example is only appropriate for IVs measured on a continuous scale. If instead the IV is categorical, interpretation is slightly different. For example, consider replacing height in the example above with the dichotomous IV gender. In this situation, the researcher must specify which level of gender will serve as the reference or comparison category that is coded 0 for analysis. That is, specify which level of the IV the calculated slope represents. Note that every published study should indicate which group served as the reference category, and if the authors fail to provide a reference category, interpretation becomes impossible. Continuing, if females are specified as the reference category, the slope for gender provides the overall difference in predicted systolic blood pressure for males compared to females. Interpretation follows this logic. For example, using an example similar to the above, say a simple linear regression analysis indicated the intercept was 115.54 and the slope for gender was 15.20 with females considered the reference category. The new linear regression equation would be: *yê* = 115.54 + 15.20∗Male. Thus, the intercept value of 115.54 now represents the average systolic blood pressure for a women (i.e., when Male = 0) and the slope indicates that the predicted value of systolic blood pressure will be 15.20 mm Hg higher for a man (i.e., when Male = 1) compared to a woman. Admittedly, these interpretations can be confusing, but understanding this concept is critically important to proper interpretation of study results.

The primary test used in simple linear regression is an omnibus between-groups ANOVA. It may seem esoteric, but linear regression and ANOVA are mathematically equivalent. The omnibus ANOVA provides an *F* test indicating whether the IV explains a statistically significant amount of variance in the DV. Stated another way, the omnibus test determines whether the IV reliably predicts the DV. Only if the ANOVA is statistically significant is the slope of the individual IV interpreted. Most research studies will provide the results of the ANOVA prior to presenting the slope of the IV. Further, the ANOVA results presented in simple linear regression will be presented identically to the one-way between-groups ANOVA examples discussed above. The statistical significance of the IV will most often be presented with the regression slope and potentially an associated *t* value. The slope is critical to proper interpretation of any regression analysis; thus, if the slope is not presented in the narrative portion or in a table complete interpretation of the regression analysis is impossible and the study is essentially useless.

Finally, the amount of variance in the DV explained by the IV must also be considered. That is, how much of the reason why a participant has a particular value on the DV is attributable to their IV value. As a side note, the word “explained” should not and does not imply causality, as causality in correlational studies is extremely difficult to determine. The amount of variance explained in simple linear regression is quantified by an effect size estimate termed the coefficient of determination. This coefficient is calculated by squaring Pearson’s *r* between the IV and DV (i.e., *r*^{2}). In the literature it will often be referred to simply as *r*^{2}, or identically as *R*^{2}. Note that how this value is referred to in a study is dependent on the researcher, but regardless of whether *r*^{2} or *R*^{2} is reported, their values will be identical and, thus, interpreted identically. The coefficient will often be presented as a proportion, ranging from 0 to 1, with higher values indicating more reliable prediction. From the example above, remember the correlation between a child’s height and systolic blood pressure was 0.33. Thus, approximately 0.11 (i.e., 0.33^{2}) of the child’s measured systolic blood pressure can be explained by the child’s height. Said another way, approximately 11% of the reason a child has a particular systolic blood pressure value is explained by his or her height.

In the literature, the result of a statistically significant simple linear regression analysis with no assumption violations will be presented as follows:

*The results of a simple linear regression analysis indicated a child’s height significantly predicts systolic blood pressure (F*_{1,98} *= 12.03, p < 0.05, r*^{2} *= 0.11), with a 1-cm increase in height resulting in a 0.48 mm Hg increase in systolic blood pressure*.

The assumptions of simple linear regression include:

The DV is measured on an interval or ratio scale.

There are no outliers.

The relationship between the DV and IV is linear.

Homoscedasticity is ensured.

The assumption states that the variability around the regression line is the same for all data. A violation of this assumption can be seen within a scatterplot. For example, consider a scatterplot where the lower values for a variable fall near the regression line and higher values for this same variable fall far from the regression line. In this situation, the variability around the regression line is not constant. The tenability of this assumption is difficult to ascertain in a journal article; thus, ensure the authors noted that it was tested.

Residuals are distributed normally.

Residuals are the difference between the actual and predicted values. Authors will need to state that they tested this assumption.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response) for each variable.

Multivariable linear regression (also known as multiple linear regression) is an extension of simple linear regression for designs with one continuous DV and multiple IVs or covariates measured on any scale. Remember, an IV is defined as an explanatory variable of specific research interest, while a covariate is a nuisance variable that is significantly associated with the DV but not of specific research interest. That is, covariates are typically included because they are related to the DV or because previous research has indicated they are important. In multiple linear regression, IVs can be any combination of continuous or discrete variables (e.g., height, gender). Further, this analysis is often a better option than simple linear regression because the inclusion of additional IVs often explains a higher percentage of variance in the DV. That is, multiple linear regression produces higher *R*^{2} values.

For example, previous research has shown a statistically significant negative Pearson’s correlation between serum 25-hydroxyvitamin D (25OHD; ng/mL) and serum parathyroid hormone (PTH; pg/mL) of −0.28.^{45} Based on this correlation, the percentage of variance in serum PTH explained by serum 25OHD is 8% (i.e., −0.28^{2}). That is, 8% of the reason a participant has a predicted serum PTH value is due to their serum 25OHD level. In an effort to increase this percentage of variance explained, a new study is designed to determine the effect serum 25OHD has on serum PTH after statistically adjusting for age, BMI, total calcium intake, and serum creatinine. Thus, a multiple linear regression analysis will be used to determine whether there is an effect of serum 25OHD on serum PTH over and above the effect of the covariates. That is, multiple linear regression assesses the unique correlation between 25OHD and serum PTH after removing the effects already accounted for by age, BMI, total calcium intake, and serum creatinine.

The percentage of variance explained in multiple linear regression is always referred to as *R*^{2}, where the *R* indicates the multiple correlation. That is, *R* is the multivariate extension of Pearson’s *r* and is defined as the combined or total correlation between all IVs and the DV. Similar to Pearson’s *r*, *R* ranges from −1 to 1, with 0 indicating no relationship. Thus, as *R* approaches −1 or 1, the association between the IVs and DV becomes stronger. In addition, most studies will also present an adjusted *R*^{2} value, sometimes labeled *R*^{2}_{adj} in the literature. Adjusted *R*^{2} is interpreted exactly the same as *R*^{2}, but it is adjusted for the sample size used in the study. When reading a study, comparing *R*^{2} and adjusted *R*^{2} is incredibly useful to interpretation, as large differences between the two indicates significant issues with the analysis, such as inadequate sample size, which essentially render the regression model useless and not generalizable to the population. Finally, it should be noted a multiple linear regression model will never explain 100% of the variance in the DV. However, do not disregard studies reporting low values of *R*^{2} because the definition of what constitutes a large *R*^{2} value varies by research arena. That is, lower *R*^{2} values are expected when using human participants because measurement error is usually high. For example, consider a study using participant self-reported daily calorie intake. Large *R*^{2} values are expected for bench research studies because in well conducted bench research measurement error is typically not an issue. For example, think about a biomedical research study using analytic chemistry.

In general, the results of a multiple linear regression model are interpreted in an almost identical fashion to simple linear regression. As a result, please reconsider the section on simple linear regression if necessary. Similar to simple linear regression, multiple regression produces a regression equation allowing for prediction of DV values based on the *y*-intercept and the slope of the regression line for each IV or covariate. Briefly, the intercept is the predicted value of the DV when all values of the IVs and covariates are 0, while the slope quantifies the change in the predicted DV with a one-unit increase in the IV or covariates. Again, the regression equation is incredibly useful to evidence-based practitioners looking to implement findings into their practice. For example, consider the multiple regression example above, where serum PTH was predicted from 25OHD and a set of covariates. Based on the regression equation from this study, the practitioner can provide the patient with empirical evidence regarding which variables (i.e., age, BMI, total calcium intake, serum creatinine, and serum 25OHD) to increase or decrease, if possible, in an effort to optimize serum PTH levels and increase bone production.

The overall test of the multiple linear regression model is an omnibus between-groups ANOVA, which indicates at least one of the IVs or covariates significantly predicts the DV. However, this omnibus test fails to indicate which IVs or covariates significantly predict the DV. Thus, the statistical test for each IV or covariate is considered. Each of the statistical tests for the IVs and covariates can be considered similar to a *post hoc* test; however, unlike the *post hoc* tests discussed for ANOVA-type models above, alpha remains unadjusted. In most studies, the results of the individual IVs or covariates are presented as regression slope or *t* values. Note that regardless of which result an author presents, the *p* values will be identical. Further, interpretation of the slopes for the individual coefficients is also slightly different compared to simple linear regression. In multiple linear regression, interpretation of a particular IV or covariate is statistically adjusted for all other IVs and covariates in the regression model similar to ANCOVA.

An example may help clarify this information. Consider a situation where the result of a statistically significant multiple linear regression analysis based on the example above indicates 25OHD has a statistically significant slope of −1.5 pg/mL. Because the analysis is multivariable, this slope must be interpreted considering all covariates included in the model. Thus, the slope of −1.5 pg/mL indicates that after adjusting for age, BMI, total calcium intake, and serum creatinine, a 1-ng/mL increase in 25OHD decreases predicted serum PTH 1.5 pg/mL.

Based on the example, the results of a statistically significant multiple linear regression analysis with no assumption violations are presented below. Notice the effects of statistically significant covariates (i.e., BMI and serum creatinine) are also described; however, authors will vary on which covariates, if any, they choose to interpret.

*The results of a multivariable linear regression analysis indicated age, BMI, total calcium intake, serum creatinine, and 25OHD significantly predicted serum PTH (F*_{5,472} *= 21.82, p < 0.05, adjusted R*^{2} *= 0.18). After adjusting for covariates, 25OHD significantly predicted serum PTH (slope =*−*1.5, p < 0.05). That is, with all else held constant, a 1 -ng/mL increase in serum 25OHD resulted in a 1.5- pg/mL decrease in serum PTH. Regarding the individual covariates, after adjustment, increases in BMI and serum creatinine (slope = 0.75 and 2.12, respectively, both p < 0.05) resulted higher serum PTH levels. Finally, after adjustment, age and total calcium intake were not associated with serum PTH*.

The assumptions of multiple linear regression include:

The DV is measured on an interval or ratio scale.

Absence of multicollinearity is ensured.

That is, no Pearson’s

*r*between any IVs and covariates should be greater than 0.90, as correlations this high indicate the variables are redundant. That is, high correlations indicate the variables may be measuring the same construct. Including redundant variables will significantly bias results. Most published studies provide Pearson correlations between the DV, IVs, and covariates; thus, a violation of this assumption is easy to identify.

There are no outliers.

The relationship between the DV and IVs and between the DV and covariates is linear.

Homoscedasticity is ensured.

The assumption states that the variability around the regression line is the same for all data. A violation of this assumption can be seen within a scatterplot. For example, consider a scatterplot where the lower values for a variable fall near the regression line and higher values for this same variable fall far from the regression line. In this situation, the variability around the regression line is not constant. The tenability of this assumption is difficult to ascertain in a journal article; thus, ensure the authors noted that it was tested.

Residuals are distributed normally.

Residuals are the difference between the actual and predicted values.

The observations are independent.

That is, each participant provides one, and only one, observation (i.e., data or response) for each variable.

The Spearman rank-order correlation coefficient (*r*_{s}), also known as Spearman’s rho (ρ), is the nonparametric alternative to Pearson’s *r*. This correlation is used when two continuous variables have outliers, when the variables are measured on an ordinal scale, or when the relationship is nonlinear. Similar to the other nonparametric statistical tests discussed previously, this correlation is based on rank ordered data as opposed to the actual values. The value of *r*_{s} ranges between −1 and 1, with 0 indicating no association. Thus, as *r*_{s} approaches −1 or 1, the association between the two variables becomes stronger. A statistically significant *r*_{s} indicates that the association is significantly different from 0.

As an example, consider a study assessing the association between triglyceride content and lag time in LDL oxidation in a sample of 18 renal transplant patients.^{46} Spearman’s rank-order correlation was used in this study because outliers were identified in the sample, and because the sample was small, removing outliers was not a viable option. The study found a statistically significant negative *r*_{s} of −0.502, suggesting that as triglyceride content increased, LDL oxidation decreased.

Based on the example above, the result of a statistically significant Spearman rank-order correlation coefficient is presented as follows:

*The Spearman rank-order correlation analysis was employed in lieu of Pearson’s correlation due to the presence of outliers and small sample size. Results indicated a statistically significant negative association between triglyceride content and lag time in LDL oxidation (r*_{s} *=*−*0.502, p < 0.05), which suggest increases in triglyceride content translates into decreases in LDL oxidation*.

The assumptions of the Spearman rank-order correlation coefficient include:

The two variables are measured on an ordinal, interval, or ratio scale.

The relationship between the two rank-ordered variables is linear.

Although the relationship between the two variables based on their actual values may be nonlinear, the relationship based on rank-ordered data must be linear. This assumption typically cannot be tested by what the authors provide in the narrative. Thus, when authors fail to indicate whether the assumption was tested results should be viewed with caution.

The observations are independent.

The interpretation of a logistic regression analysis is similar to linear regression; thus, a basic understanding of the interpretation of simple and multiple linear regression is extremely useful. Please reconsider reading the sections discussing simple and multivariable linear regression as much of the material discussed here is simply an extension of the material described in detail previously.

Logistic regression is used when the DV is measured on a dichotomous scale and the relationship between the DV and IV is nonlinear. This analysis is ubiquitous in the biomedical sciences. It is important to note that in any logistic regression analysis, the measurement scale of the DV is always considered unordered. That is, the dichotomous DV is always considered to be measured on a nominal scale. While the logistic regression analyses discussed here are only for a dichotomous DV, an extension of logistic regression is available for a categorical DV with three or more categories. This analysis is termed multinomial logistic regression, with the definition of multinomial being multiple nominal categories. Interpretation of results from this analysis is similar to the analyses discussed in this section and interested readers are encouraged to consider the suggested readings at the end of the chapter.

As an example of a design requiring a simple logistic regression analysis, consider a 5-year study designed to assess the effect that the duration of statin use measured as percentage of time on any statin during the study period has on all-cause mortality in a sample of Veterans Administration patients previously experiencing congestive heart failure. Note that the DV is dichotomous (i.e., dead versus alive). Further, a simple logistic regression analysis can always be extended to a multivariable analysis by including additional IVs and covariates in an effort to explain more of the reason why patients experienced the outcome of interest. For example, consider a multivariable extension to the study above where the effect duration of statin use has on all-cause mortality is assessed after controlling for age, race, gender, concurrent medications, and comorbid conditions.

For all logistic regression models, researchers must choose a reference category for the DV. When identified, the reference category is used as a comparison group for the primary outcome of interest. In most situations, the reference category is typically the category determined by the researcher to be of less specific interest. For example, consider all-cause mortality, a dichotomous DV (i.e., dead versus alive). Most studies are interested in the individuals who died; essentially the researcher wants to identify the primary reasons for death. Thus, with patients alive at the end of the study period considered the reference category, all regression slopes are calculated for patients who died compared to patients who lived. Thus, the first step in properly interpreting the results of logistic regression analysis is to identify the primary outcome of interest and the reference category within the DV. It should be noted that most authors will not explicitly identify the primary outcome of interest or the reference category; however, this information can be obtained easily as all results and interpretations are typically written in relation to the primary outcome of interest.

Similar to linear regression, a logistic regression analysis provides a regression equation that can be used to predict the probability of experiencing primary outcome of interest. Briefly, the regression equation contains a *y*-intercept and slope values for all IVs included in the analysis. This equation is interpreted slightly different from linear regression because the association between the DV and IV is nonlinear. That is, because probability is bounded between 0 and 1, the slope has to essentially shut off at these bounds. However, the usefulness of the equation is the same. That is, the equation can be used to assist evidence-based practitioners in instructing patients regarding what changes need to be made to optimize or prevent a specific outcome.

The primary statistical test in logistic regression determines whether the logistic model including all IVs or covariates better predicts the probability of experiencing the outcome of interest compared to the model with no IVs or covariates. That is, a logistic regression analysis determines whether the IVs significantly predict the primary outcome of interest. Similar to linear regression, this overall test is an omnibus chi-square test. If this omnibus chi-square test is statistically significant, the statistical significance of each IV or covariate is assessed and interpreted. These tests of individual predictors can be thought of as *post hoc* tests, and typically no adjustment is made to alpha. When interpreting the results for individual predictors, authors typically provide two values, the slope and odds ratio.

The slope is interpreted similar to linear regression; however, slopes in logistic regression indicate changes in the **log-odds** (also known as **logits**) of experiencing the outcome of interest. That is, a one-unit increase in the IV indicates a change in the predicted log-odds of experiencing the primary outcome of interest. While a full description of log-odds or logits is beyond the scope of this chapter, they can be thought of simple as a linear transformation of probability. That is, after transformation, log-odds or logits are linear; thus, their values can be interpreted similarly to slopes in linear regression. For example, reconsider the 5-year study assessing the effect duration of statin use, measured as percentage of time on any statin during the study period, has on all-cause mortality. Say, the slope for statin use is −0.25. With death considered the primary outcome of interest and alive serving as the reference category, this slope suggests that a 1% increase in statin use during the study period resulted in a 0.25 unit decrease in the log-odds of dying. Note this was a decrease because the slope was negative. Based on this interpretation, a common question is, what does a 0.25 unit decrease in log-odds mean? While the slope is integral in producing the regression equation, the interpretation in log-odds can be fairly convoluted. Thus, logistic regression provides an alternative value that some individuals find easier to interpret—the odds ratio.

The odds ratio produced by a logistic regression analysis is calculated and interpreted similarly to the odds ratios discussed in the Epidemiological Statistics section earlier in the chapter. Briefly, odds ratios range from 0 to infinity, with 1 indicating no association. Therefore, an odds ratio above 1 indicates an increase in the odds of experiencing the primary outcome of interest, whereas an odds ratio below 1 indicates a decrease in the odds of experiencing the primary outcome of interest. For example, reconsider the example above with a slope of −0.25. The associated odds ratio for this slope is 0.78. Because the odds ratio is below 1, a 1% increase in statin use is associated with a 22% (i.e., 1–0.78) decrease in the odds of dying during the study period.

It is important to be aware that authors will vary the information they present in journal articles, as one article may only provide slopes as log-odds and another article may provide odds ratios. This is not an issue, however, because there is a direct mathematical relationship between slopes and odds ratios. That is, the slope is the natural log of the odds ratio (i.e., ln 0.78 = −0.25) and the odds ratio is simply the exponentiated slope (i.e., e^{−0.25} = 0.78). It should become clear that this mathematical relationship is where the definition of log-odds originates; they are literally the log of the odds. Given this relationship, if an author provides log-odds, the odds ratio can be easily calculated to ease interpretation. Also, note that regardless of which value the authors present, the associated *p* value will be identical. That is, a statistically significant slope will have a statistically significant odds ratio, and vice versa.

Finally, similar to linear regression, the primary reason a researcher includes additional IVs and covariates in a multivariable logistic regression model are to increase the amount of variance explained in the primary outcome of interest. That is, multivariable logistic regression models aim to better identify the reason why participants experienced the outcome of interest. However, unlike *R*^{2} in linear regression, there is no accepted measure for quantifying explained variance in logistic regression. While a detailed description regarding why is beyond the scope of this chapter, it has to do with the fixed variance of the logistic distribution used to model the residual values. However, be aware that several pseudo-*R*^{2} values may be presented in the literature, with the most common including the Nagelkerke *R*^{2} and the Cox and Snell *R*^{2}. These pseudo-*R*^{2} values are used to approximate *R*^{2} from linear regression and are interpreted in similar fashion. For example, reconsider the 5-year study assessing the effect of statin use on all-cause mortality. Say the Nagelkerke *R*^{2} value from the logistic regression analysis was 18%. This value indicates that 18% of the reason why patients died was due to their statin use. Again, it is important to note that these pseudo-*R*^{2} values will never be near 100%, and the definition of a large or small pseudo-*R*^{2} value is determined by the specific research arena, as discussed in the section on multivariable linear regression.

Based on the example described above, the result of a simple logistic regression analysis is presented as follows:

*The results of a simple logistic regression analysis indicated a statistically significant association between duration of statin use and all-cause mortality. (*$ \chi 1 2 =13.65$*, p < 0.05) where a 1% increase in duration of statin use resulted in a 22% decrease in the odds of dying during the study period.*

An example of a multivariable logistic regression analysis, with the addition of age, race, gender, concurrent medications, and comorbid conditions as covariates is presented below. Notice in this example, the researcher is not interested in the individual effects of the covariates, as they are not interpreted.

*The results of a multivariable logistic regression analysis indicated a statistically significant overall association between the variables as a set and all-cause mortality (*$ \chi 1 2 =156.02$*, p < 0.05). After controlling for age, race, gender, concurrent medications, and comorbid conditions, duration of statin use significantly predicted all-cause mortality (OR = 0.62, p < 0.05). Thus, holding all variables constant, a 1% increase in statin use resulted in a 38% decrease in all-cause mortality.*

The assumptions of logistic regression include:

The DV is discrete with mutually exclusive categories.

The sample size is large.

A very rudimentary rule is to have at least 50 participants per variable included in the model. A large sample is required so that the parameters (e.g., slopes, standard errors) are estimated accurately.

^{47}Using this rule, a model with 10 IVs or covariates requires a minimum of 500 participants (i.e., 10 × 50 = 500).

Adequacy of expected frequencies is ensured.

This assumption applies only to categorical IVs and is the same assumption as the chi-square test. That is, no more than 20% of cells can have expected frequencies less than 5. This is a difficult assumption to verify, outside of calculating the expected frequencies by hand. Thus, if an article fails to indicate this assumption was tested, view results with caution. However, most studies using logistic regression will have large sample sizes and this assumption is rarely violated.

Linearity in the logit is ensured.

Remember, logit and log-odds are synonyms. This is a convoluted assumption, and difficult to explain without getting into mathematical detail. However, the logit is defined as the linear transformation of probability. This assumption is tested by determining whether the relationship between continuous IVs or covariates and the DV is linear. This is a key assumption, and a violation severely biases results. Thus, if authors do not mention the assumption was ensured in the methods or results sections, view results with caution.

Absence of multicollinearity is ensured.

That is, no Pearson’s

*r*between any continuous IVs and covariates should be greater than 0.90, as correlations this high indicate the variables are redundant. That is, high correlations indicate the variables may be measuring the same facet. Including redundant variables will significantly bias results.

There are no outliers.

The observations are independent.

Survival analysis (or failure analysis) consists of three of the most commonly used statistical techniques in the biomedical sciences—life tables, the Kaplan-Meier method, and Cox proportional-hazards model or Cox regression. Survival analysis is concerned with time-to-event data; that is, the time to experience an outcome of interest. For example, consider a study designed to examine whether a new hormone therapy, in comparison to chemotherapy, prolongs remission in women previously diagnosed with breast cancer over a 20-week study period.

In survival analysis, participants who do not experience the outcome of interest are considered to survive, while those who experience the event are considered to fail. Although this is fairly grim terminology, survival and failure do not necessarily imply living or dying. For example, in the example above failure was defined as breast cancer recurrence, not death. In survival analysis, the outcome will typically be dichotomous, patients do not need to enter the study at the same time because enrollment can be continuous, and patients are followed until the study ends which is known as end of follow-up.

Patients who do not experience the event by the end of follow-up, who are lost to follow-up, or who drop out of the study are termed censored. The key advantage survival analysis has over the analyses presented above, particularly logistic regression, is it can handle censored data, which is essentially incomplete data. That is, censored data are considered incomplete because the researcher does not know when or if participants experienced the event. Survival analysis can handle censored data because the DV is time, which is a continuous variable allowed to vary for each participant. Thus, as long as a participant has a survival time indicated, they are included in analysis. The key point here is that survival time for a censored participant is the time until they were censored, whether that is end of follow-up or whether they left the study for reasons other than suffering the event. Thus, all participants who entered the study are included in analysis regardless of whether they experienced the event or are censored because the analysis only considers the time they were in the study.

Life tables present time-to-event data in table format. That is, a life table tabulates the time that has elapsed until an event is experienced. Life tables are used to indicate the proportion of individuals surviving or not experiencing the event based on fixed or varying time intervals.

For example, reconsider the example above evaluating the effectiveness of the new hormone therapy in preventing recurrence of breast cancer. A life table allows the researcher to tabulate the cumulative proportion of women who do not have a recurrence of breast cancer at any interval, whether it is 6 months, 1 year, or 3 years.

As stated in the introduction to this section, life tables can be based on fixed or varying time intervals. Life tables based on fixed time intervals have a significant weakness, as they do not consider the exact time within the specified interval when the patient experienced the event. Thus, depending on the length of the time interval, a great amount of information could be lost regarding the exact time the event was experienced. That is, the longer the time interval, the less precise a researcher can be regarding the exact moment the event occurred. Thankfully, better methods have emerged, particularity the Kaplan-Meier method.

The Kaplan-Meier method is the most widely used estimator of survival time in the biomedical sciences.^{48} The Kaplan-Meier method is an extension of the life table using varying time intervals. Using this method, the cumulative proportion surviving is recalculated every time an event occurs.^{49}

As an example, instead of assessing the total number of women who have a breast cancer recurrence at fixed intervals of 6 months, the Kaplan-Meier method recalculates the proportion of women surviving every time a woman in the study has a recurrence. In most studies, Kaplan-Meier data is presented as a graph of the cumulative survival over the study period known as a Kaplan-Meier curve. The Kaplan-Meier curve presents either a survival or hazard functions, where the survival function is the cumulative frequency of participants not experiencing the event, whereas the hazard function is the cumulative frequency of participants experiencing the event. A Kaplan-Meier curve is provided in the vast majority of published literature using survival analysis and will appear similar to the survival curve presented in Figure 8–15. Note that for ease of interpretation the curve in Figure 8–15 only presents data for the sample of women initiating the new hormone therapy and does not include data for women initiating chemotherapy. Notice the *y*-axis indicates the cumulative proportion of women surviving, while the *x*-axis indicates the total number of weeks of the study. The solid line represents the survival function. That is, the cumulative proportion of women not experiencing the event at any given time. The survival function indicates when a woman experienced the event when the function steps down. For example, by week 5, two women have experienced the event, indicated by the two steps in the survival function. Further, notice the vertical dashes throughout the survival function. These dashes indicate individual women who were censored. That is, women who dropped out of the study for reasons other than experiencing the event such as side effects or they simply were lost to follow-up by moving out of the area. Few studies provide explicit information regarding censored participants on the Kaplan-Meier curve because most survival analyses involve large samples. Thus, the dashes in Figure 8–15 are usually omitted from the curve; however, frequency counts of censored participants are always provided within the narrative or in table format.

The Kaplan-Meier curve can also be presented for multiple groups. For example, consider the overall survival of women initiating the new hormone therapy compared to women initiating chemotherapy. Figure 8–16 presents a Kaplan-Meier curve where the survival functions for both treatment groups are presented simultaneously. Interpretation of these survival functions is identical to the methods described above for Figure 8–15. However, with two or more treatment groups, a statistical test must be conducted to assess for a statistically significant group difference in survival rate. The most common test in the biomedical literature is the log-rank test (aka, the Mantel-Cox test).^{49} A statistically significant log-rank test indicates there is a significant difference in survival rate between the groups. However, it is important to note that a Kaplan-Meier curve and associated log-rank test have no way of indicating why the breast cancer recurred beyond the possibility of the therapy being ineffective.

Based on the survival functions presented in Figure 8–16, the results of a statistically significant log-rank test are presented as follows:

*Based on Kaplan-Meier curves, the results of the log-rank test indicated a statistically significant difference in duration of breast cancer remission between the new hormone therapy and chemotherapy groups (*$ \chi 1 2 =5.68$*, p < 0.05), with women initiating the new hormone therapy experiencing significantly longer breast cancer remission*.

It should be noted that when more than two treatment groups are being compared the log-rank test is an omnibus test. That is, with three or more groups, a statistically significant log-rank test will indicate that a statistically significant difference exists between groups, but will not indicate specifically which groups differ. Thus, *post hoc* log-rank tests are required to determine where a significant difference in survival occurred.

For example, consider the addition of another treatment group to the breast cancer example, say, women who do not want to initiate any therapy. Following a statistically significant omnibus log-rank test, three *post hoc* log-rank tests would be required to determine whether statistically significant differences in survival rate occurred between hormone therapy versus chemotherapy, hormone therapy versus no therapy, and chemotherapy versus no therapy.

The results of statistically significant *post hoc* tests are presented as follows:

*Based on Kaplan-Meier curves, the results of the log-rank test indicated a statistically significant difference in breast cancer recurrence between the three treatment groups (*$ \chi 2 2 =9.76$*, p < 0.05). Post hoc log-rank tests indicated women initiating the new hormone therapy experienced significantly longer breast cancer remission compared to women receiving chemotherapy or women choosing not to receive therapy (both p < 0.05). No statistically significant difference was indicated between women initiating chemotherapy and women receiving no therapy*.

Although the Kaplan-Meier method is effective in assessing for overall differences in survival, the analysis is unable to identify the association between covariates and survival. That is, the Kaplan-Meier method cannot identify whether the IV significantly predicts survival, and for many studies prediction is a far more important consideration. Thus, a form of regression analysis is required. The Cox proportional hazards model (also known as Cox regression) is a semiparametric method used to predict the time to experience an event of interest. Note that the DV is time-to-event, the event is usually a dichotomous variable, and the analysis is considered semiparametric because it includes both parametric and nonparametric components. In addition, all predictor variables in a Cox regression are termed covariates. That is, in the literature, authors will not identify a distinction between IVs and covariates. Finally, it should also be noted that most published studies progress from Kaplan-Meier curves and log-rank tests to Cox regression analysis. That is, the Kaplan-Meier curve will first present the survival functions for the covariate of interest as well as associated log-rank tests and then authors will present the results of a Cox regression assessing for the relationship between covariates and the event of interest.

The interpretation of Cox regression can be considered a combination of linear and logistic regression; however, the primary difference is that in Cox regression results are considered time dependent. That is, Cox regression is concerned with the time-dependent risk of experiencing the event instead of the overall occurrence of events as in logistic regression. Remember, the DV is time. For example, consider a study designed to assess the effect duration of statin use, measured as percentage of time in any statin during the study period, has on all-cause mortality in a sample of Veterans Administration patients previously experiencing congestive heart failure. If the researchers were interested in the effect duration of statin use had on prolonging the time until death during the study period a Cox regression analysis is the analysis of choice. Again, the primary consideration in Cox regression is time-to-event, not in the overall probability of the event as in logistic regression.

Similar to linear and logistic regression, Cox regression can be simple or multivariable. For example, in the example above a simple Cox regression analysis was required because only duration of statin use was used to predict the risk of death. However, if the study was extended to statistically control for age, gender, race, comorbid conditions, and concurrent medications, a multivariable Cox regression is appropriate. That is, assess the effects duration of statin use had on the risk of death over and above the effects of the other covariates.

Similar to logistic regression analysis, within the DV the researcher must choose the primary event of interest and the associated reference category. When identified, the reference category is used as a comparison group for the event of interest. In most situations, the reference category is typically the category determined by the researcher to be of less specific interest. For example, consider all-cause mortality, a dichotomous outcome variable, in which most studies are interested in the individuals who died. Thus, patients who lived are usually considered the reference category and all regression slopes are calculated for patients who died compared to patients who lived. Therefore, the first step in properly interpreting the results of a Cox regression analysis is to identify the event and the reference category. If an author does not explicitly state the reference category, this information can be obtained easily as all results and interpretations are typically written in relation to the primary event of interest.

It is critically important to note that in the literature authors will report using one of two different Cox regression analyses—with or without time varying covariates. The decision of which model to use is based on whether the proportionality of hazards assumption was violated. This assumption typically applies to all categorical covariates and states that, although events can begin to occur at any time during the study period, when events do begin, the rate at which events occur between levels of a categorical covariate must remain constant over time. That is, when events begin to occur, the survival functions for the groups must be the same or roughly parallel.

For example, reconsider Figure 8–16. Here, the proportionality of hazards assumption is not violated because events occur at approximately the same rate in each group. Notice women initiating the new hormone therapy did not begin experiencing breast cancer recurrence until week 4, as indicated by no steps in the curve until week 4, whereas women in the chemotherapy group began experiencing the recurrence at week 1, steps occurred immediately. However, when events began to occur in either group, they occurred at approximately the same rate. That is, the slopes of the survival functions are roughly parallel. Thus, the proportionality of hazards assumption is not violated and the treatment group covariate is assumed to have constant survival rates over time. More specifically, in this situation a Cox regression without time varying covariates is appropriate.

By contrast, a violation of the proportionality assumption is provided in Figure 8–17. Notice that in this figure, events began occurring at roughly around weeks 3 and 4 within both treatment groups. However, the survival functions are drastically different, and in fact intersect twice. In general, any time survival functions intersect, the proportionality of hazards assumption can be considered violated, because the rate of survival within each group varies across time. In this situation, Cox regression with time varying covariates would be required.

In conclusion, when reading a journal article, take careful notice of the Kaplan-Meier curves presented prior to the Cox regression analysis. Clear violation of this assumption is apparent any time survival functions intersect. If a violation is observed, identify whether the appropriate Cox regression analysis was employed. That is, if a violation is indicated, Cox regression with time-varying covariates must be used. If Cox regression without time varying covariates was used in the presence of a violation, results and interpretations are extremely misleading.

Similar to the other forms of regression discussed, Cox regression allows researchers to produce a regression equation useful in determining the overall risk score for a patient based on specific characteristics. The primary difference in Cox regression, however, is that there is no *y*-intercept representing the baseline hazard function. Thus, overall risk is calculated simply by using the slopes for the covariates. Even without the intercept, however, this regression equation remains useful for evidence-based practitioners when consulting their patients on changes required decreasing the risk of experiencing an unfavorable event.

Identical to logistic regression, the overall statistical test in Cox regression is whether the model including the covariates predicts the time elapsed prior to experiencing the event significantly better than the model with no covariates. That is, a Cox regression analysis determines whether the covariates significantly predict the time elapsed prior to experiencing the primary event of interest. This overall test is an omnibus chi-square test. Only if this omnibus chi-square test is statistically significant does the researcher evaluate the statistical significance of each covariate. The tests of individual predictors are usually more important and these will be presented in all studies using a Cox regression model. Tests of individual predictors will be presented by regression slopes and/or hazard ratios.

In Cox regression, slopes are interpreted similar to logistic regression. However, for this analysis, a one-unit increase in a covariate results in an increase or decrease in the log-hazard or log-risk of experiencing the event. While a full description of log-risk is beyond the scope of this chapter, they can be thought of simply as a linear transformation of probability of experiencing the event. That is, after transformation, log-risks are linear and their values can be interpreted similarly to slopes in linear regression. For example, say the slope for women initiating the new hormone therapy was −0.65. That is, women initiating chemotherapy served as the reference or comparison group. Thus, the slope represents a 0.65 unit decrease in the log-risk of breast cancer recurrence for the new hormone therapy compared to chemotherapy. A 0.65 unit decrease in log-hazard is difficult to explain and beyond the scope of this chapter; thus, the presentation and interpretation of hazard ratios are often a useful alternative.

The hazard ratio produced by a Cox regression analysis is calculated and interpreted similarly to the relative risk discussed in the Epidemiological Statistics section above. However, it is important to note that a hazard ratio is not identical to relative risk. Briefly, hazard ratios range from 0 to infinity, with 1 indicating no association. Therefore, a hazard ratio above 1 indicates an increase in the risk of experiencing the event, whereas a hazard ratio below 1 indicates a decrease in the risk of experiencing the event. For example, reconsider the example above where the slope for the new hormone therapy was −0.65. The associated hazard ratio for this slope is 0.52. Because the hazard ratio is below 1, initiating the new hormone therapy resulted in a 48% (i.e., 1 to 0.52) decrease in the risk of breast cancer recurrence compared to chemotherapy.

It is important to consider that authors may vary the information they present in journal articles, as one study may only provide slopes and another study may only provide hazard ratios. This is not an issue because there is a direct mathematical relationship between slopes and hazard ratios. That is, the slope is simply the natural log of the hazard ratio (i.e., ln 0.52 = −0.65), whereas the hazard ratio is simply the exponentiated slope (i.e., e^{−0.65} = 0.52). It should become clear that this mathematical relationship is where the definition of log-risk originates; they are literally the log of the risk. Thus, if an author only provides slopes, the hazard ratio can be easily calculated to ease interpretation. Also, note that regardless of which value the authors present, the associated *p* value will be identical. That is, a statistically significant slope will have a statistically significant hazard ratio, and vice versa.

Finally, no pseudo-*R*^{2} exists for Cox regression to indicate proportion of variance explained. Although several have been suggested they are not interpreted as the percentage of variance explained and these values do not indicate how much of the overall reason why a participant experienced the event.^{46} Thus, when reading a journal article, do not be discouraged by authors failing to provide this information.

Based on the breast cancer example above, a statistically significant multivariable Cox regression analysis is presented below. Note that this Results section will typically be presented in addition to the results of the Kaplan-Meier analysis above.

*No violation of the proportionality of hazards assumption was indicated; thus, a Cox regression without time dependent covariates was conducted to assess the effectiveness of a new hormone therapy compared to chemotherapy in preventing breast cancer recurrence after adjusting for age, concurrent medications, and comorbid conditions. Results indicated the covariates, as a set, significantly predicted time to breast cancer recurrence (*$ \chi 8 2 =63.12$*, p < 0.05). Holding age, concurrent medications, and comorbid conditions constant, women initiating the new hormone therapy experienced a 48% decrease in the risk of breast cancer recurrence compared to women initiating chemotherapy*.

The assumptions of Cox regression include:

Time is measured on an interval or ratio scale.

Sample size must be large.

A very rudimentary rule is to have at least 50 participants per variable included in the model. A large sample is required so that the parameters (e.g., slopes, standard errors) are estimated accurately.

^{45}Using this rule, a model with 10 IVs and covariates requires a minimum of 500 participants (i.e., 10 × 50 = 500).

Proportionality of hazards is ensured.

The survival functions for all categorical covariates must be similar.

No differences between withdrawn and remaining cases exist.

Because Cox regression can handle censored data, participants who are lost to follow-up must not differ from those whose outcome is known. That is, participants who dropped out of the study must not be different from those who completed it. For example, women who dropped out of the new hormone group because they are experiencing unbearable side effects that do not occur in the chemotherapy group.

Absence of multicollinearity is ensured.

That is, no Pearson’s

*r*between any continuous covariates should be greater than 0.90, as correlations this high indicate the variables are redundant. High correlations indicate the variables may be measuring the same facet. Including redundant variables will significantly bias results.

There are no outliers.

Observations are independent.

A thorough understanding of statistical methods is integral to effectively evaluating medical literature. Being cognizant of the effect study design has on results, interpretation, and generalization is incredibly important to implementing evidence-based practice. Statistical analyses are simply a piece of the puzzle when evaluating literature, since the study design and research question determine the appropriate analyses. It must be noted that simply because a study is published does not define it as a quality study. Further, all research has flaws, some trivial, others significant. A reader’s task is to determine whether these flaws prevent the research from being credible.^{50} Thus, when reviewing an empirical study, the following steps must be considered carefully:

Thoughtfully consider the study design (see Chapters 4 and 5). This includes, but is not limited to, the theory, specific research question(s), randomization, sample characteristics, data collection methods, variables, and outcomes. A poor design will lead to inaccurate or biased estimates, leading to an inferior study.

Evaluate the statistical test. Is it appropriate for the research question? Did the author test for assumption violations? If no assumption tests are stated, can violations be determined from the descriptive statistics provided? Was the test interpreted properly? Were effect size (i.e., clinical significance) measures provided?

Evaluate the discussion section. Are the results interpreted within the context of the sample and population? Are generalizations accurate? What were the limitations? How does the study lend itself to future research?

Finally, to reiterate, this chapter is by no means exhaustive of all statistical tests, nor does it provide a complete overview of statistical tests. Interested readers are encouraged to consult any of the suggested readings below for a more thorough treatment of the topics discussed.

CrossRef [PubMed: 2661727]

CrossRef [PubMed: 11497536]

CrossRef [PubMed: 10822117]

CrossRef [PubMed: 16738075]

CrossRef [PubMed: 10937333]

CrossRef [PubMed: 15740449]

CrossRef

CrossRef

^{2}tests. Biometrics. 1954;10(4):417-–51.

CrossRef

CrossRef

CrossRef [PubMed: 14568738]

CrossRef [PubMed: 18408991]

CrossRef

CrossRef

CrossRef

CrossRef [PubMed: 7564083]

The suggested readings below are for interested readers to gain further insight into some of the topics covered in this chapter. Note that most of the information provided in the first half of this chapter can be obtained in any introductory statistics textbook.

Epidemiological Statistics

Clinical Significance and Effect Size

Study Design and Randomized Controlled Trials

Nonparametric Statistical Tests

ANOVA Designs

Linear and Logistic Regression, Survival Analysis

Clustering and Nesting