Why It Matters
Comparison of percentages is one of the most common types of comparisons in the veterinary literature. For example, a study1 of horses with abdominal pain found gastric ulceration in 68.4% of horses with duodenitis–proximal jejunitis, compared with 14.3% of horses with large colon volvulus. Another study2 found that the percentage of positive results for detecting Brucella canis in canine semen was 34.6% for a PCR assay, compared with 9.6% for the rapid slide agglutination test with 2-mercaptoethanol.
Appropriate statistical methods for comparing percentages are determined by the type of data. For example, methods used to compare percentages from independent samples are quite different from methods used to compare percentages from paired samples. If the wrong method is used, the results can be extremely misleading.
Statistical methods for the following comparisons will be discussed here:
• Comparison of sample percentages with hypothesized population percentages.
• Comparison of independent percentages.
• Comparison of paired percentages.
• Comparison of percentages from 3 or more samples of repeated measurements.
Because computer software can be used to perform these comparisons, the associated calculations will not be reviewed here. Instead, the focus will be on determining which methods are appropriate and interpreting their results.
Comparison of Sample Percentages With Hypothesized Population Percentages
A study3 of uterine torsion in 55 dairy cows revealed the following seasonal distribution for torsion: spring, 20.0%; summer, 36.4%; fall, 16.4%; and winter, 27.3%. Do these data provide evidence that cases of uterine torsion are not evenly distributed throughout the seasons? To determine this requires testing the null hypothesis that the population percentage of uterine torsion cases in each season is 25%—in other words, that uterine torsion cases are evenly distributed throughout the seasons. Our alternative hypothesis would be that the population percentage of uterine torsion cases is not 25% for at least 1 season—in other words, that uterine torsion cases are not evenly distributed throughout the seasons. The value 25% is the hypothesized population percentage for each season, with the population comprising all cows with uterine torsion similar to those in the study. In general, hypothesized population percentages are numbers that the researcher specifies to determine whether the data provide evidence against these percentages.
The next step is to choose a significance level (0.05 is common) and perform a statistical test of the null hypothesis. To test hypothesized population percentages, the χ2 test of hypothesized percentages is used. This test is the same χ2 test that many university students encounter in genetics classes when experimenting with fruit flies. It is also called the χ2 test of goodness of fit. If the χ2 test yields a significant result, the final step will be to obtain confidence intervals for the population percentages. If the conclusion is that not all of the population percentages are equal to the hypothesized percentages, we usually want to estimate them.
For the uterine torsion data, the P value obtained was 0.16 (calculated with the aid of a computer). Because 0.16 is greater than our significance level of 0.05, the hypothesis cannot be rejected that the population percentage of uterine torsion cases in each season is 25%. On the other hand, we cannot claim that uterine torsion cases are evenly distributed throughout the seasons. We can only conclude that the data do not provide evidence against this hypothesis. Since the null hypothesis cannot be rejected, there is no need to calculate confidence intervals for the population percentages.
The χ2 test of hypothesized percentages and confidence intervals for population percentages are based on the following assumptions:
• Random sampling. A random sample is preferable, but the χ2 test of hypothesized percentages and confidence intervals for population percentages may be used when a sample is not randomly selected. The sample must not be biased, however. A biased sample is a sample that does not accurately reflect the population.
• Noncensored observations. None of the observations can be based on censored data. A data point is censored when it is known that the value is larger (or smaller) than some number, but the exact value is not known. For example, the survival time for a dog that is still alive 3 years after cancer surgery is censored. The survival time is at least 3 years, but the exact survival time is unknown.
• Independent observations. All of the observations must be independent of each other.
• Sufficiently large sample. The χ2 test of hypothesized percentages and confidence intervals for population percentages are based on approximations that work best when a sample is sufficiently large. For the χ2 test of hypothesized percentages, this assumption is a requirement for the expected frequencies, which are the frequencies that we expect to observe if the null hypothesis is true. No expected frequency should be < 1, and no more than 20% of the expected frequencies should be < 5. For confidence intervals for percentages, this assumption is a requirement for the sample percentage. The closer the sample percentage is to 0% or 100%, the larger the sample required. A statistician should be consulted when there is uncertainty that the sample is large enough.
The aforementioned assumptions, with the exception of random sampling, appeared to be met for the uterine torsion data. None of the observations were based on censored data. All of the observations appeared to be independent, since none of the cows had > 1 uterine torsion, and knowing the season for one cow's uterine torsion tells us nothing about the season for another cow's uterine torsion. The expected frequencies (calculated with the aid of a computer) were all > 5.
Comparison of Independent Percentages
In veterinary research, one of the most frequently tested null hypotheses is the hypothesis that 2 or more population percentages are the same. If 2 percentages are compared, the alternative hypothesis states that the 2 population percentages are different. If 3 or more percentages are compared, the alternative hypothesis states that at least one population percentage differs from another population percentage. The null hypothesis is false if only 2 of the population percentages are different, so we cannot conclude that all of them are different when we reject the null hypothesis. To determine which percentages are different, we need to carry out additional statistical tests to compare the percentages 2 at a time. Comparisons of groups 2 at a time are called pairwise comparisons.
In a study4 of anti–Neospora caninum antibodies in dogs, antibodies were detected in 21.7% of 92 dogs from a rural area and 10.7% of 300 dogs from an urban area. Is the difference between these percentages significant? In other words, do the data provide evidence that the rural and urban population percentages of dogs with anti–Neospora caninum antibodies are different? Here, the 2 populations consist of all rural dogs similar to those in the study and all urban dogs similar to those in the study. The null hypothesis states that the rural population percentage of dogs with these antibodies is the same as the urban population percentage of dogs with these antibodies. The alternative hypothesis states that these 2 population percentages are different.
To compare 2 or more percentages from independent samples, we use the χ2 test of association, which is also called the χ2 test of independence. (By sample, we mean a group of animals for which data are collected and not blood, fecal, or other types of samples or specimens.) If the test result is significant, confidence intervals can be obtained for differences between the population percentages. If we conclude that the population percentages are different, then we need to estimate how different they are because many statistically significant differences are not large enough to have any practical or clinical value.
When a 0.05 significance level was used to test the hypothesis of equal rural and urban population percentages of dogs with anti–Neospora caninum antibodies, a P value of 0.006 was obtained. Because 0.006 is less than 0.05, the null hypothesis that the rural and urban population percentages are the same was rejected. The data provide evidence that these percentages are different.
Because it was concluded that the population percentages are different, a confidence interval was needed for the difference between these percentages to get an idea of how large that difference is. A computer calculation indicated that the 95% confidence interval for the difference between the rural population percentage with antibodies and the urban population percentage with antibodies was 1.9 to 20.1. Thus, we can be 95% sure that the difference between the 2 populations is between 1.9 and 20.1 percentage points.
What does this tell us? First, this confidence interval does not contain 0, so we are 95% sure that the difference between the population percentages is not 0. It was already decided, on the basis of the χ2 test results, that the difference is not 0, and the confidence interval is consistent with that decision. Second, this confidence interval gives us an estimate of how large the difference between the population percentages is. We are 95% sure that the rural population percentage with antibodies is between 1.9 and 20.1 percentage points higher than the urban population percentage with antibodies. The decision about whether this difference is clinically important is a veterinary issue and not a statistical one.
The χ2 test of association and confidence intervals for differences between independent population percentages are based on the same assumptions as for the χ2 test of hypothesized percentages and associated confidence intervals: random sampling, noncensored observations, independent observations, and sufficiently large samples.
With the exception of random sampling, all of these assumptions appeared to be met for the canine antibody data. None of the observations were based on censored data. Knowing whether one dog has antibodies tells us nothing about whether another dog has antibodies, so all of the observations were independent. All of the expected frequencies (calculated by use of a computer) were > 5, and the sample percentages were not close to 0% or 100%.
When the χ2 test of association is used to compare 2 independent percentages, an adjustment called a continuity correction is sometimes used to produce a different P value. This adjustment is inclined to produce P values that are too large, and its use is not recommended.5 Additional information about the continuity correction can be found elsewhere.6
Suppose we want to compare 2 independent percentages, and all of the assumptions needed for the χ2 test of association are met except one: some of the expected frequencies are too small. In this case, we can use another test of the hypothesis that the population percentages are the same. This test, called the Fisher exact test, relies on the same random sampling, independence, and noncensored data assumptions needed for the χ2 test of association. Unlike the χ2 test of association, the Fisher exact test has no restrictions on the size of the expected frequencies. Unfortunately, the calculations used to obtain P values for the Fisher exact test are based on a very small subset of all possible samples, making it a suboptimal statistic,7 but it is sometimes the only statistical test available. Because it is not optimal, the Fisher exact test should not be used when the expected frequencies are large enough for the χ2 test of association. When > 2 independent population percentages are compared, an extended version of the Fisher exact test can be used.
If the hypothesis is rejected that the population percentages are the same when the Fisher exact test is used, it is usually desirable to obtain a confidence interval to estimate the difference between these percentages. This can be complicated when the expected frequencies are too small for the χ2 test of association, and thus a statistician should be consulted in such circumstances.
Comparison of Paired Percentages
Not all comparisons of percentages are based on independent samples. Veterinary research commonly involves before-versus-after studies, in which measurements obtained before treatment are compared with measurements obtained after treatment. The result is 2 nonindependent samples of observations: the before-treatment measurements and the after-treatment measurements. When such designs are used appropriately, the variability of the results is sometimes greatly reduced.
Nonindependent samples result when 2 measurements of the same characteristic or parameter are obtained from the same subjects, or when each subject is deliberately paired with another subject on the basis of some important characteristic. A lamb may be paired with its twin, a veterinarian may be paired with her partner, or a horse may be matched with another horse with respect to age, sex, disease, or other characteristics.
When the pairing is successful, in the sense that paired subjects respond in a similar way, the measurements for those paired subjects will not be independent. Knowing how one member of a pair responds will provide some information about how the other member of the pair responds. Use of successfully paired subjects reduces variability, since each subject has a similar control subject for comparison. Samples obtained by pairing subjects or measuring something twice are called paired samples.
Paired samples are quite useful for comparing percentages and other descriptive statistics when analyzed correctly. A common mistake is the use of procedures for comparing independent samples to analyze paired samples. Because this error can produce highly misleading results, it is important to be able to distinguish between paired and independent samples.
How can we assess whether 2 samples are paired? This is entirely a matter of study design. When given 2 sets of numbers but no information about the study design, there is no way to determine whether the observations were paired. If pairing was done or repeated measurements were obtained, any thorough report will indicate this. If no basis for pairing is described, it is usually safe to assume that subjects were not paired.
Suppose we want to compare 2 percentages from paired samples. The null hypothesis is still that the 2 population percentages are the same, and the alternative hypothesis is still that the 2 population percentages are different. However, the χ2 test of association or Fisher exact test can no longer be used because paired samples are not independent. The misuse of these tests to compare percentages from paired samples is a common error in the veterinary literature.
When samples are paired, the appropriate test to assess the hypothesis that 2 population percentages are the same is the McNemar test. This test is based on the following assumptions:
• Random sampling. Random samples are not essential as long as the samples are not biased.
• Noncensored observations. None of the observations can be based on censored data.
• Paired samples. The McNemar test can only be used when paired samples are obtained.
• Independent observations in each sample. All of the observations in the first sample must be independent, and all of the observations in the second sample must be independent. Observations from different samples are not necessarily independent, since the samples are paired.
If we reject the hypothesis that paired population percentages are the same, it is usually desirable to obtain an estimate of the degree of difference between the percentages. Obtaining a confidence interval for the difference between 2 paired population percentages requires methods for paired data.8 The methods used to obtain confidence intervals for differences between independent population percentages are not appropriate and will yield incorrect results.
In a study9 of intradermal and serum ELISA testing for allergens in atopic dogs, results of intradermal testing suggested 47.0% of 265 dogs were allergic to Tyrophagus sp (storage mites), and 51.0% of the same dogs had positive ELISA results for antibodies against the organism. Can a McNemar test be used to test the hypothesis that the population percentage of dogs with positive results for Tyrophagus allergy by intradermal testing is the same as the population percentage with positive results by ELISA testing? Here, the population consists of all dogs that are similar to those in the study. The alternative hypothesis states that the 2 population percentages are different.
As always, the assumptions needed for the intended statistical test need to be checked. The samples were not randomly selected, but this does not rule out use of the McNemar test. None of the observations were based on censored data. The samples were clearly paired, with one sample consisting of intradermal test results for 265 dogs and the other sample consisting of ELISA test results for the same dogs. All of the intradermal test results were independent, given that the result for one dog tells us nothing about the result for another dog. For the same reason, all of the ELISA test results were independent. The McNemar test can be used to test the null hypothesis that the population percentage of dogs with positive results for intradermal testing is the same as the population percentage with positive results for the ELISA method.
The P value for the test comparison exceeded 0.05. If the significance level was set at 0.05, then the hypothesis that the population percentages are the same cannot be rejected. This does not mean that we can conclude that the population percentages are the same. We can only say that the data do not provide evidence that these percentages are different. Since the null hypothesis cannot be rejected, we will not obtain a confidence interval for the difference between the population percentages.
Comparison of Percentages From 3 or More Samples of Repeated Measurements
When a comparison of percentages obtained from 3 or more samples of repeated measurements is desired, the Cochran Q test can be used to test the hypothesis that all of the population percentages are the same. The alternative hypothesis states that at least one population percentage differs from another population percentage. Because samples of repeated measurements are not independent, neither the χ2 test of association nor the extended Fisher exact test can be used.
The Cochran Q test is based on the same assumptions of random sampling and noncensored observations as the other tests. The differences are as follows:
• Repeated measurements. The Cochran Q test can only be used when samples of repeated measurements are obtained.
• Independent observations in each sample. All of the observations in each sample must be independent. Observations from different samples are not necessarily independent, since the samples consist of repeated measurements.
When the results of the Cochran Q test are significant, we cannot conclude that all of the population percentages are different, only that at least 2 of them are. The next step is to carry out McNemar tests to compare the samples 2 at a time to determine which population percentages are different. Significant results of McNemar tests should be followed up by calculation of confidence intervals for differences between paired population percentages so the degree of difference in the population percentages can be determined.
A study10 of detomidine for treating horses with colic reported the following percentages of 155 horses with abnormal kicking behavior before and after receiving detomidine for colic: before treatment, 60.0%; 15 minutes after treatment, 4.5%; 30 minutes after treatment, 5.2%; 45 minutes after treatment, 5.2%; and 60 minutes after treatment, 8.4%.
Suppose we would like to test the hypothesis that the population percentages for the 5 time points are the same. Can the Cochran Q test be used to test this hypothesis? Here, the population consists of all horses with colic that are similar to those in the study. The alternative hypothesis states that at least one population percentage differs from another population percentage.
Again, the assumptions for the test should be checked. Although the samples were not randomly selected, this is not essential. None of the observations were based on censored data. Because the 5 percentages were obtained for the same group of horses, they are based on 5 samples of repeated measurements. In each sample, one horse's kicking behavior tells us nothing about another horse's kicking behavior, so all of the observations in each sample were independent. We can use the Cochran Q test to test the hypothesis that the 5 population percentages of horses with abnormal kicking behavior are the same.
A P value < 0.001 was obtained when the Cochran Q test was performed in this study.10 Even if the predefined level of significance was 0.01 rather than 0.05, the obtained P value would still be significant. Therefore, the hypothesis that all 5 population percentages are the same was rejected. The data provide evidence that least one population percentage differs from another population percentage. The next step would be to carry out McNemar tests to compare the percentages 2 at a time to determine which population percentages are different. For time points with a significant difference between percentages, a confidence interval should be obtained to estimate the difference between the population percentages.
Choosing the Right Method
Selecting the correct method for comparing percentages depends on the type of data. Samples of repeated measurements, including paired samples, require methods that are entirely different from the methods used for independent samples. Because repeated measurements are commonly misanalyzed in veterinary studies, readers should be particularly cautious when evaluating reports of studies involving percentages based on repeated measurements.
References
- 1.↑
Dukti SAPerkins SMurphy J, et al. Prevalence of gastric squamous ulceration in horses with abdominal pain. Equine Vet J 2006; 38:347–349.
- 2.↑
Keid LBSoares RMVasconcellos SA, et al. A polymerase chain reaction for the detection of Brucella canis in semen of naturally infected dogs. Theriogenology 2007; 67:1203–1210.
- 3.↑
Aubry PWarnick LDDesCôteaux L, et al. A study of 55 field cases of uterine torsion in dairy cattle. Can Vet J 2008; 49:366–372.
- 4.↑
Fernandes BCGennari SMSouza SL, et al. Prevalence of anti-Neospora caninum antibodies in dogs from urban, periurban and rural areas of the city of Uberlândia, Minas Gerais—Brazil. Vet Parasitol 2004; 123:33–40.
- 6.↑
Schork MARemington RD. Statistics with applications to the biological and health sciences. 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 2000.
- 7.↑
Upton GJG. A comparison of alternative tests for the 2×2 comparative trial. J R Stat Soc Ser A 1982; 145:86–105.
- 8.↑
Bland M. An introduction to medical statistics. 3rd ed. Oxford, England: Oxford University Press, 2000.
- 9.↑
Foster APLittlewood JDWebb P, et al. Comparison of intradermal and serum testing for allergen-specific IgE using a FcϵRIα-based assay in atopic dogs in the UK. Vet Immunol Immunopathol 2003; 93:51–60.
- 10.↑
Jöchle W. Dose selection for detomidine as a sedative and analgesic in horses with colic from controlled and open clinical studies. J Equine Vet Sci 1990; 10:6–11.