Detecting statistical errors in veterinary research

Susan Shott Statistical Communications, PO Box 671, Harvard, IL 60033.

Search for other papers by Susan Shott in
Current site
Google Scholar
PubMed
Close
 PhD

Click on author name to view affiliation information

Why It Matters

Statistics can make or break the validity of a study. If the wrong statistical methods are used, the results can be misleading at best and nonsense at worst. Unfortunately, readers cannot assume that reports in veterinary journals are based on accurate statistics. Peer review is no guarantee of statistical or scientific accuracy. One critic of the peer review process noted that “there are so many recent reports of failures of the peer-review system that the difficulty is to select the most instructive.”1 Most of the reports published in veterinary journals are not reviewed by statisticians, and veterinary reviewers cannot always determine whether appropriate statistics were used. For these reasons, veterinarians need to critically evaluate the statistics in the reports they read.

This may appear to be an impossible (and repellent) task. However, many statistical issues are much simpler than they appear. A reader who knows how to apply a few basic statistical concepts can detect most of the major statistical errors in veterinary reports. Four common statistical errors that invalidate studies will be discussed here:

  • • Application of statistical methods that require normally distributed data to nonnormal data.

  • • Analysis of nonindependent data as if they were independent.

  • • Treatment of incomplete follow-up as equivalent to complete follow-up.

  • • Misinterpretation of results from small samples.

Misanalysis of Nonnormal Data

Many statistical methods require that data have statistically normal distributions. Using these methods to analyze extremely nonnormal data is a serious and common mistake. Normal distributions have a specific type of mathematical formula called Gaussian, which will not be described here. These distributions can be recognized by their shapes, which look like a bell (Figure 1). This is why normal distributions are also called bell-shaped curves.

Figure 1—
Figure 1—

Examples of normal data distributions.

Citation: Journal of the American Veterinary Medical Association 238, 3; 10.2460/javma.238.3.305

Normal distributions are continuous, which means that there are no gaps between numbers. For example, sheep weight is continuous, since a continuum of weights between any 2 possible weights can also occur. The number of kittens in a litter is not continuous, since it takes only the values 1, 2, 3, and so on. Numbers like 6.5 or 8.1 are not possible in such situations, so there are gaps. Data with gaps are called discrete. Discrete data cannot be normally distributed, whereas continuous data may or may not be normally distributed.

Although the term normality suggests that normal distributions are the norm, they are quite atypical in veterinary medicine. To add to the confusion, statistical normality has nothing to do with veterinary normality. For example, WBC counts of clinically normal calves can have nonnormal distributions.

Veterinary data typically have skewed distributions, in which most of the data are bunched up at one end with a tail at the other end. The direction of the skewness is determined by the tail. If the tail is on the right, the distribution is right skewed (Figure 2). If the tail is on the left, the distribution is left skewed (Figure 3).

Figure 2—
Figure 2—

Examples of right-skewed data distributions.

Citation: Journal of the American Veterinary Medical Association 238, 3; 10.2460/javma.238.3.305

Figure 3—
Figure 3—

Examples of left-skewed data distributions.

Citation: Journal of the American Veterinary Medical Association 238, 3; 10.2460/javma.238.3.305

Right-skewed data are far more common in veterinary medicine than are left-skewed data. For example, a graph of degenerative joint disease scores for 91 dogs after surgery for patellar luxation has an extremely non-normal distribution that is right skewed (Figure 4).2 Many other measurements, such as liver enzyme activities, blood hormone concentrations, pain scores, and age at first total hip joint replacement, also have skewed distributions. It is a serious mistake to use statistical methods that require normality to analyze such data.

Figure 4—
Figure 4—

Degenerative joint disease scores for 91 dogs after surgery for patellar luxation (adapted from Linney WR, Hammer DL, Shott S. Surgical treatment of medial patellar luxation without femoral trochlear groove deepening in 91 dogs. J Am Vet Med Assoc 2011;238:in press).

Citation: Journal of the American Veterinary Medical Association 238, 3; 10.2460/javma.238.3.305

Statistical tests that require normality include t tests, ANOVA, Pearson correlation, and least squares regression. These tests are examples of parametric statistical methods, which require data with a specific type of distribution (usually normal). When these tests are used to analyze extremely nonnormal data, the researcher might as well get his or her statistical results from a random number generator.

Nonnormally distributed data should be analyzed with nonparametric statistical methods, which do not require normal data or any other specific type of distribution. Nonparametric statistical tests include the Mann-Whitney test, Kruskal-Wallis test, Friedman test, and Spearman correlation. These methods do not assume that the data are not normal. They simply do not require normality. Although they have somewhat less statistical power than parametric methods when data are normally distributed, they usually work quite well.

In some circumstances, nonnormal data can be turned into normal data by transforming them. This involves taking square roots or logarithms or applying some other mathematical function. The transformed data will sometimes have a normal distribution and can then be analyzed with statistical methods that require normality. However, transformed data have the drawback of unfamiliar units (eg, log kilograms instead of kilograms) that many readers will be uncomfortable with.

A common research myth is that the central limit theorem implies that statistical methods that require normality can be used to analyze nonnormal data as long as the sample sizes are at least 30. The central limit theorem states that the distribution of the mean will, under certain conditions, resemble a normal distribution as the sample size increases. If the distribution of the data is extremely nonnormal, however, very large sample sizes are needed for this to happen. Samples of 30 or even 100 subjects are often too small for parametric tests to produce valid results when applied to markedly nonnormal data.

Another common research myth is that parametric statistical methods are always more powerful than nonparametric methods. According to this myth, researchers increase their chances of finding differences or relationships if they use statistical methods that require normality to analyze nonnormal data. However, parametric methods are more powerful than nonparametric methods only when data fit their assumptions. When their assumptions are not met, parametric methods often miss differences or relationships that would be detected by nonparametric methods.

A veterinary report with statistical methods that require normality should state the means by which the data were evaluated for normality. If nothing is said about this, there is good reason to suspect that the statistical methods were incorrectly applied to nonnormal data. The results of such analyses should be considered suspect at best.

Misanalysis of Nonindependent Data

Statistical methods make an important distinction between independent and nonindependent data. Data are independent when the value of one data point gives no information about any other data point. For example, the serum alkaline phosphatase (ALP) activity of a pig is independent of the ALP activity of an unrelated pig. The value for one pig tells us nothing about the value for the other pig. However, if we measure a pig's ALP activity today and then measure it again tomorrow, the value today gives us some idea of what its ALP activity will be tomorrow. These 2 ALP measurements are not independent.

In general, repeated measurements obtained over time on the same animal are not independent. Nonindependent data are also obtained when we collect data on multiple body parts of the same animal (eg, hooves, ears, kidneys, and eyes), multiple specimens from the same animal, or multiple estrous cycles, pregnancies, or parturitions of the same animal. In addition, data from genetically related animals are usually not independent.

In veterinary research, nonindependent data are sometimes incorrectly treated as if they were independent. This serious mistake invalidates the statistical results because the distributions of the statistics are not what they should be. In addition, the sample size is inflated, making it easier to find differences or relationships that are not real. For example, suppose a study involves analysis of 413 race results of 72 horses as if they were independent. It is obviously incorrect to assume that multiple races run by the same horse are independent of each other. Additional error exists because the sample size is inflated from 72 to 413. The true sample size is the number of horses, not the number of races.

Nonindependent data can be part of a rational study design. These studies can yield valuable information when the data are analyzed correctly with statistical methods designed to handle nonindependent measurements. For example, suppose a veterinarian wants to compare 2 types of intraocular lenses for dogs. For each dog, the researcher can randomly assign one eye to one of the lens types and the other eye to the other lens type. The resulting data will be paired because each dog provides a set of nonindependent measurements. Statistical methods for repeated measurements must be used to analyze such data. Nonparametric statistical procedures for repeated measurements include the Friedman test and the McNemar test. Parametric statistical procedures for repeated measurements include the paired t test and repeated-measures ANOVA.

When nonindependent data are described in a veterinary report, readers should check to ensure that the animal, and not other elements (eg, body part, specimen, or breed), is the unit of analysis. The data should be collected in a way that allows analysis with statistical methods for nonindependent data, and such methods should be used. For example, the intraocular lens study described previously pairs the 2 lens types for each dog. All dogs have both lens types, so statistical methods for paired data can be used. If the eyes were assigned to lens types without ensuring this pairing, some dogs would have the same lens type in both eyes. The data would be partly paired and partly unpaired. They could not be analyzed with statistical methods for paired data, nor could they be analyzed with statistical methods for independent data.

Misanalysis of Censored Data

Treating incomplete follow-up as equivalent to complete follow-up is another serious error in veterinary research. In many studies that involve following animals over time, not all of the animals complete the study or reach the endpoint of interest. For example, suppose a veterinarian investigates survival in cats after surgery for mast cell tumors. The primary endpoint is survival time: the number of months from surgery until death. A survival time is an example of a waiting time, the time from a starting point to the occurrence of an event of interest. Survival times are waiting times because we are waiting for death. The number of weeks from deworming a ferret to reinfection with worms is also a waiting time, as is the number of days from treating a cow for mastitis to recurrence of mastitis.

If a cat in the mast cell tumor study dies 4 months after surgery, its data are clear-cut. The survival time is 4 months. However, suppose another cat is still alive 6 months after surgery when the study ends, and another cat was still alive but lost to follow-up 7 months after surgery when the owners moved. The data for these cats are problematic. We know that the first cat survived at least 6 months and the second cat survived at least 7 months, but that is all we know.

The survival times for these cats are censored. A waiting time is censored when we know that it is larger than some number but we do not know its exact value. Censored data are not the same as missing data. We have no information at all about missing data. For example, if a rabbit's breed was not recorded in its chart, we do not know anything about its breed. Censored data are incomplete rather than missing because we have partial information.

When waiting times are censored, special methods in a branch of statistics called survival analysis must be used to analyze the data. These methods include Kaplan-Meier curves, the log-rank test, and Cox proportional hazards regression. They are used generally for waiting times, not just survival times. Lumping together censored waiting times as if they were equivalent and applying statistical methods that cannot handle censored data are serious errors that invalidate the results.

In the mast cell tumor study, for example, it would be a mistake to calculate the 1-year survival rate by dividing the number of cats known to have died within a year after surgery by the total number of cats in the study. Some of the cats do not have 1 year of follow-up, and this calculation assumes that all of these cats were still alive 1 year after surgery. This assumption is not reasonable because some of these cats may have died after their last follow-up and before 1 year after surgery.

When a veterinary report concerns waiting times, readers should expect censored data unless the report clearly states otherwise. When a study with censored data does not involve survival analysis methods, the results are invalid.

Misinterpretation of Results from Small Samples

Because veterinary research can be expensive and funding is often limited, many veterinary studies involve small numbers of subjects (ie, small samples). Such studies typically have low power, which means that they have only a small chance of detecting differences or relationships. When these studies fail to reveal differences or relationships, the results must be interpreted with caution. Failure to find a difference or relationship does not imply that it does not exist.

For example, suppose a veterinarian looks for a difference between African Grey parrots and cockatiels with respect to sociability. The sample sizes are small, and the study has only a 30% chance of finding a difference. If the veterinarian reports that no difference was found, one cannot conclude that there is no difference. Even if a difference does exist, the study had a 70% chance of missing it.

It is dangerous to rely on small sample sizes to detect adverse events because they have little chance of revealing events that are infrequent but common enough to warrant concern. For example, suppose a new heartworm preventative causes sudden death in 1 in 1,000 dogs that are given the drug. There is little chance (only 2%) that at least 1 dog in a sample of 20 dogs will suddenly die when given the drug. Even with a sample of 200 dogs, the chance that at least 1 dog will have sudden death is only 18%. This is 1 reason that adverse effects of medication are often detected after a drug has been marketed and widely prescribed. Uncommon adverse events become evident only with widespread clinical use of the drug.

Small sample sizes have yet another hazard: they are more likely than large samples to yield biased results. This problem occurs because a few atypical animals will have a much greater impact on results for small samples than those for larger samples. For example, suppose a sample of 6 pygmy goats includes 1 goat with undiagnosed orthopedic abnormalities. That goat's abnormal gait measurements will have a substantial impact on the gait statistics for this sample. The gait measurements from the other 5 healthy goats will not be enough to offset the 1 goat's abnormal measurements. If the abnormal goat is part of a sample of 60 goats, however, its gait measurements will have little impact on the sample's gait statistics. The measurements from the 59 healthy goats should outweigh the 1 goat's abnormal measurements.

Identification of Statistical Errors

Although other types of statistical errors appear in the veterinary literature, the mistakes described here are the most common and easily identified. The following guidelines may help readers detect statistical errors in veterinary reports:

  • • If the statistics seem to violate common sense, there is a good chance that they are wrong.

  • • If the statistics look incorrect, check to determine whether one of the mistakes described here has been committed.

  • • If the statistics are suspect but you cannot identify the error, consult a statistician.

Readers cannot prevent statistical errors from appearing in the veterinary literature, but they can identify many of these mistakes. By doing so, they can protect themselves, their practices, and their patients from the harm that may result when invalid study results are accepted and applied.

References

  • 1.

    Haack S. Peer review and publication: lessons for lawyers. Stetson Law Rev 2007; 36: 789819.

  • 2.

    Linney WRHammer DLShott S. Surgical treatment of medial patellar luxation without femoral trochlear groove deepening in 91 dogs. J Am Vet Med Assoc 2011; 238:in press.

    • Search Google Scholar
    • Export Citation
  • Figure 1—

    Examples of normal data distributions.

  • Figure 2—

    Examples of right-skewed data distributions.

  • Figure 3—

    Examples of left-skewed data distributions.

  • Figure 4—

    Degenerative joint disease scores for 91 dogs after surgery for patellar luxation (adapted from Linney WR, Hammer DL, Shott S. Surgical treatment of medial patellar luxation without femoral trochlear groove deepening in 91 dogs. J Am Vet Med Assoc 2011;238:in press).

  • 1.

    Haack S. Peer review and publication: lessons for lawyers. Stetson Law Rev 2007; 36: 789819.

  • 2.

    Linney WRHammer DLShott S. Surgical treatment of medial patellar luxation without femoral trochlear groove deepening in 91 dogs. J Am Vet Med Assoc 2011; 238:in press.

    • Search Google Scholar
    • Export Citation

Advertisement