Machine learning predicts selected cat diseases using insurance data amid challenges in interpretability

Barr N. Hadar Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada

Search for other papers by Barr N. Hadar in
Current site
Google Scholar
PubMed
Close
 DVM, PhD https://orcid.org/0000-0003-1017-7531
,
Zvonimir Poljak Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada

Search for other papers by Zvonimir Poljak in
Current site
Google Scholar
PubMed
Close
 DVM, PhD https://orcid.org/0000-0001-8621-3661
,
Brenda Bonnett B. Bonnett Consulting, Georgian Bluffs, ON, Canada

Search for other papers by Brenda Bonnett in
Current site
Google Scholar
PubMed
Close
 DVM, PhD
,
Jason Coe Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada

Search for other papers by Jason Coe in
Current site
Google Scholar
PubMed
Close
 DVM, PhD https://orcid.org/0000-0002-0811-5051
,
Elizabeth A. Stone Department of Clinical Studies, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada

Search for other papers by Elizabeth A. Stone in
Current site
Google Scholar
PubMed
Close
 DVM, MS, MPP, DACVS
, and
Theresa M. Bernardo Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, ON, Canada

Search for other papers by Theresa M. Bernardo in
Current site
Google Scholar
PubMed
Close
 DVM, MSc https://orcid.org/0000-0002-1330-3284

Abstract

Objective

To develop models for prediction of the onset of specific diseases in cats using pet insurance data and to evaluate their predictive performance.

Methods

Agria Pet Insurance data from almost 550,000 cats (2011 to 2016) were analyzed and used to train predictive models for periodontal disease and skin tumors using breed, sex, and insurance claim history. Random downsampling and 1:1 matching by age, insurance duration, and time at risk balanced the dataset. Variables were then further processed, with random forest and conditional logistic regression used for analysis. Model accuracy was assessed through leave-one-out cross-validation, while variable importance plots, partial dependence plots, and coefficients were used for model interpretation.

Results

Model accuracy ranged from 81.9% to 88.2% (P < .01, baseline 50%). Key predictors included prior insurance claims for “digestive,” “whole body symptom,” “skin,” and “injury conditions,” which may be nonspecific and predictive of various diseases. Maine Coon, Siamese, and Burmese cats were associated with periodontal disease–positive predictions, while domestic cats were linked with negative predictions. For skin tumors, Norwegian Forest Cats, Devon Rex and Sphynx cats, and Maine Coon cats were associated with positive predictions, whereas Birman and domestic cats were linked with negative predictions.

Conclusions

This study presents a method of machine learning predictive analysis on pet insurance data, although more comprehensive medical information and approaches accounting for data characteristics may be necessary to develop clearer predictors.

Clinical Relevance

To prevent or detect these conditions early, veterinarians can use the breed risk results to guide clients, especially those with high-risk breeds, by offering early advice on lifestyle and monitoring.

Abstract

Objective

To develop models for prediction of the onset of specific diseases in cats using pet insurance data and to evaluate their predictive performance.

Methods

Agria Pet Insurance data from almost 550,000 cats (2011 to 2016) were analyzed and used to train predictive models for periodontal disease and skin tumors using breed, sex, and insurance claim history. Random downsampling and 1:1 matching by age, insurance duration, and time at risk balanced the dataset. Variables were then further processed, with random forest and conditional logistic regression used for analysis. Model accuracy was assessed through leave-one-out cross-validation, while variable importance plots, partial dependence plots, and coefficients were used for model interpretation.

Results

Model accuracy ranged from 81.9% to 88.2% (P < .01, baseline 50%). Key predictors included prior insurance claims for “digestive,” “whole body symptom,” “skin,” and “injury conditions,” which may be nonspecific and predictive of various diseases. Maine Coon, Siamese, and Burmese cats were associated with periodontal disease–positive predictions, while domestic cats were linked with negative predictions. For skin tumors, Norwegian Forest Cats, Devon Rex and Sphynx cats, and Maine Coon cats were associated with positive predictions, whereas Birman and domestic cats were linked with negative predictions.

Conclusions

This study presents a method of machine learning predictive analysis on pet insurance data, although more comprehensive medical information and approaches accounting for data characteristics may be necessary to develop clearer predictors.

Clinical Relevance

To prevent or detect these conditions early, veterinarians can use the breed risk results to guide clients, especially those with high-risk breeds, by offering early advice on lifestyle and monitoring.

With a growing interest in preventive care and early disease detection, comprehensive data sources like pet insurance information can be used to understand health trends and risk factors within pet populations. The integration of AI in veterinary medicine is revolutionizing the field, particularly in how data are analyzed. The AI-driven predictive models can accurately identify patterns in large complex datasets and predict health outcomes,1 enabling earlier interventions and improved animal health management.

Insurance data can be an excellent data source and has been validated for use in research.2,3 Over a decade ago, publications4,5 on insured cats in Sweden provided some of the earliest analyses of insurance data to determine mortality, incidence of disease, and use of antimicrobials in the insured pet population. Sweden has the highest rates of pet insurance worldwide, with an estimated 36% of its cat population having veterinary care insurance in 2017. Approximately 66.67% (2/3) of these insured cats were covered by Agria Pet Insurance, accounting for nearly 24% of the Swedish cat population. Today, it is estimated that over 50% of Swedish cats are covered by pet insurance.6 Agria's insurance databases have been the source of numerous peer-reviewed epidemiological studies712 on companion animals. Adoption of pet insurance throughout the world is leading to more research insights, such as the recent white papers shared by Nationwide.13 Estimated pet insurance adoption rates by 1 source14 for 2023/2024 include Sweden (approx 80%), UK (approx 30%), Italy (approx 29%), Japan (approx 15%), US (approx 2%), and Canada (approx 3%). Utilizing preexisting insurance databases provides insight into the population at risk, including diseased and healthy animals. Enhancing its value further, the substantial scale of insurance databases allows for stratification by various variables and assessments based on high statistical power. Swedish veterinarians use a well-established diagnostic coding system to provide diagnoses for insurance claims.15 Evaluation of Agria Pet Insurance data and practice records have revealed excellent representation of practice records for cat/dog, sex, breed, 85% of diagnoses, and fair representation for birth dates.16 More recent studies17,18 in dogs reported excellent and acceptable representation for epilepsy and atopic dermatitis. Clinics with electronic medical records tended to have better representation, likely signifying improved consistency today, as the adoption of digital records increases and better use of structured data is established.

In the age of data-driven decision-making, predictive healthcare using AI solutions like machine learning has emerged as a powerful tool with transformative potential.19,20 Machine learning techniques, such as random forest (RF),21 are gaining traction as potent alternatives to traditional parametric statistical methods.22 These techniques are highlighted for their powerful ability to handle large datasets with complex interactions and nonlinear relationships, without the need for extensive data preparation or compliance with model assumptions. Random forest offers a distinct advantage in its ensemble method, which combines multiple classification trees to improve predictive accuracy and to better control for model overfitting. Despite these theoretical advantages and the availability of data, the evaluation of the predictive ability of models based on insurance data remains an unexplored area with few contributions.

The objective of this study was to develop models for the prediction of the onset of specific diseases in cats using pet insurance data and to evaluate their predictive performance. This study also explored both the predictive potential and the clinical relevance of disease predictors, using various data processing approaches and data analysis techniques.

Methods

Original data

This retrospective longitudinal study was conducted utilizing veterinary care insurance data from Agria Pet Insurance in Sweden, from January 1, 2011, to December 31, 2016. The data included the following variables: cat identification, breed, sex, age at the time of veterinary care claim, age at insurance enrollment and/or discontinuation, as well as the diagnostic code(s).

Breeds were categorized into 41 breed codes, with closely related breeds merged for analysis by Agria, such as Cornish and German Rex into “Cornish/German Rex,” Devon Rex and Sphynx into “Devon Rex/Sphynx,” and others. Siamese-related breeds were grouped under “Siamese,” while the Siberian and Neva Masquerade cats were combined as “Siberian.” “Domestic” cats encompassed domestic shorthair and longhair varieties, while an “other” breed category included diverse breeds, resulting in 18 breed groups for analysis. Cats were classified as male or female; neuter status information was unavailable. Cats could be enrolled in a veterinary care insurance policy at any age. Age at diagnosis was determined based on the claim date. If multiple receipts were submitted from different veterinary appointments on the same date, they were typically coded as separate claims, each with 1 or more diagnoses per appointment.

Attending veterinarians employed a standardized hierarchical diagnostic registry to allocate diagnostic codes, encompassing both specific and broader categories. The initial diagnosis was subsequently categorized into a specific diagnostic class and further grouped into 3 hierarchical levels of increasingly general diagnostic categories. For instance, the initial diagnosis of “acute external otitis” would be categorized under the specific diagnostic classification “otitis externa” and encompassed within the highest-level diagnostic category “ear.” There were 24 highest-level diagnostic categories included the following: “behavior,” “blood vascular,” “claw,” “digestive,” “ear,” “endocrine,” “eyes,” “heart,” “immunological,” “infection,” “injury,” “locomotor,” “neoplasia,” “neurological,” “operation complication,” “repro female,” “repro male,” “respiratory lower,” “respiratory thoracic,” “respiratory upper,” “skin,” “whole body symptom,” “urinary lower,” and “urinary upper.” The following 6 diagnostic categories were excluded: “normal variation,” “exam prophylaxis,” “weak/sick offspring,” “dead/euthanized,” and 2 blank ones. The diagnostic category of “whole body symptom” encompassed general descriptions and clinical signs found within specific-level diagnoses, such as “sick,” “symptom of tiredness,” “body fever,” “anorexia,” and “emaciation.” Claims were registered when the total cost of veterinary appointments over a 125-day period exceeded the insurance deductible. Both the deductible amount and the maximum annual reimbursement were determined by the owner during enrollment (information not included in the dataset). Claims pertaining to preexisting conditions before insurance enrollment were not eligible for coverage by the insurance company. Information about preexisting conditions was not included in the provided dataset and, therefore, was not accounted for in the analysis. Insurance policies were renewed on an annual basis, and cats were under continuous observation until they died, withdrew from the insurance program, or reached the conclusion of the observation period (December 31, 2016). Any cats lacking or having uncertain information related to birth year, breed, sex, age, or enrollment date were excluded from the analysis.

Approval was obtained from the University of Guelph's Research Ethics Board (REB No. 20-05-029) for the use of the data for research purposes.

Outcomes of interest

Predictive models targeted 2 distinct diagnoses, periodontal disease and skin tumor, treating them as binary outcomes (disease positive and disease negative). For periodontal disease, 4 specific-level diagnoses within the digestive diagnostic category were combined to determine outcome status: gingivitis/stomatitis, periodontal disease, calculus of teeth, and root tip abscess. Similarly, for skin tumors, 4 specific-level diagnoses within the neoplasia diagnostic category were merged: melanoma/melanocytic skin, squamous cell carcinoma, mast cell tumor, and unspecified skin tumor. Cats that had submitted a claim for any of these diagnoses were labeled as disease positive. Disease-positive cats were excluded if they had < 2 years of insurance coverage before the first periodontal disease or skin tumor claim, resulting in a 2- to 5-year observation period before a positive diagnosis. Requiring at least a 2-year observation period before diagnosis allowed sufficient time to capture relevant health data and reduced the risk of missing early claims that could signal disease onset. Predictor variables included breed, sex, and the number of previous claims per general diagnostic category. Previous claims were included if they occurred before the date of first diagnosis for disease-positive cats and before a specific cutoff date for disease-negative cats. Consequently, the diagnoses that determined disease status were not included as predictors. The specific cutoff dates for disease-negative cats varied depending on the data processing and predictive model used, as detailed in subsequent sections.

Data processing and analysis

Data processing, analysis, and model workflow are provided (Figure 1). Initially, random 1:1 downsampling was employed to address the substantial class imbalance between the disease-negative majority class (316,134 to 316,409 cats; 99.68% to 99.77%) and the disease-positive minority class (735 to 1,010; 0.23% to 0.32%). This balanced the dataset (50% prevalence), preventing bias toward the majority class. In this data processing approach, the cutoff dates for previous claims of disease-negative cats were set to the end of the observation period. To account for possible variations in time at risk between disease-positive and disease-negative cats, the number of previous claims was divided by the observation length (total years the cat was insured). Previous claims were treated as continuous variables, and a standard RF classifier was used for analysis.

Figure 1
Figure 1

Data processing, analysis, and model workflow for predictive modeling on cat insurance data from Sweden (2011 to 2016).

Citation: American Journal of Veterinary Research 2025; 10.2460/ajvr.24.09.0282

To address potential confounders such as age and observation duration, and to account for time at risk using a design rather than an analytical approach, matched 1:1 downsampling was then implemented to create a new data subset. A matched set of disease-negative cats (controls) was created based on the age at the start of observation (months of age at the start of insurance) and total observation duration (months of insurance coverage) of each disease-positive cat (case). A single control was then randomly selected from the potential matches to pair with each case. For cases, the time from the observation start to the diagnosis date (first claim of periodontal disease or skin tumor) was calculated, and the same time at risk was applied to their matched controls to establish a cutoff date for previous claims. This ensured matching on age, total observation duration, and time at risk. To observe subtle expected differences among discrete levels of individual predictors, separate datasets treating the number of previous claims as continuous and categorical variables were established. Random forest analysis was performed on these case-control matched datasets, referred to as matched forest (MF).23

Since RF techniques do not inherently account for the paired nature of case-control data, within-pair data of the numbers of previous claims were then centered by subtracting the mean of each pair from each value. This approach accounted for potential differences in paired values. Previous claims were treated as continuous variables, and RF analysis was performed. This centered MF (CMF) model was developed to highlight differences more effectively and better identify all important factors.24

To further explore clinical relevance and the biological plausibility of predictors, conditional logistic regression (CLR) was subsequently employed. This traditional method has long been the standard for analyzing and explaining matched case-control data,25,26 particularly due to its ability to adjust for confounding variables and its robustness in dealing with matched pairs. The same matched case-control datasets with continuous and categorical previous claims were used, and both multivariable and univariable CLR models were applied.

Ultimately, the performance of 6 models with different data processing methods was assessed: RF with the number of previous claims as a continuous variable, MF with previous claims as a continuous variable, MF with previous claims as a cate-gorical variable, CMF with previous claims as a continuous variable, CLR with previous claims as a continuous variable (CLR-con), and CLR with previous claims as a categorical variable (CLR-cat).

Predictive analysis

Various predictive models were developed and compared, including 4 based on RF and 2 on CLR. The probability threshold for classifying positive/negative cases was set at the default of 0.5. The tuning parameters for the RF models were also set to standard default levels, including the number of trees (500) and the number of variables randomly sampled at each tree split (square root of the total number of predictor variables). Model performance was analyzed using leave-one-out cross-validation (LOOCV), generating a variety of model metrics. While CLR is primarily used for explanatory analysis, LOOCV was also applied to the multivariable CLR models to get a sense of its predictive ability as well.

Interpretability

Two key methods were utilized to explore the importance of individual variables or features for prediction in this study: multiway variable importance plots and partial dependence plots. Multiway importance plots provide a visual representation of the importance of different predictors in a model, helping to explain which variables are most influential in predicting the outcome. Three key model metrics were chosen for these plots: mean decrease in the Gini index of impurity (ie, increase of node purity), root node frequency (ie, number of trees in which the predictor is used for splitting the root node; dividing the whole sample into subsets), and accuracy decrease (ie, mean decrease of prediction accuracy after the predictor is permuted). Multiway importance plots were created using all predictor variables for both periodontal disease and skin tumors. Partial dependence plots offer a graphical depiction of the relationship between a predictor variable and the outcome of interest, while averaging out the effects of all other variables. Partial dependence plots were created to illustrate the effects of top influential predictors on the predicted affinity for periodontal disease–positive and skin tumor–positive designation. These relationships were then qualitatively compared to the statistically significant coefficients of the CLR models noting the direction of associations.

Data processing and analysis were conducted in R, version 4.3.0, and Rstudio, version 2023.06.0 + 421 (The R Foundation). Downsampling was performed using the R package caret, version 6.0.94. Models were developed and analyzed using the R packages randomForest, version 4.7.1.1; caret, version 6.0.94; and survival, version 3.5.5, and core functions in the base R environment. Plots were generated with the R packages randomForestExplainer, version 0.10.1; randomForest, version 4.7.1.1; and ggplot2, version 3.4.2.

Results

Almost 550,000 cats were insured during the observation period, contributing to 1,617,931 cat years at risk (CYAR). Domestic cats contributed 1,270,134 CYAR (78.5% of total CYAR), and purebred cats contributed 347,797 CYAR (21.5%). Male cats contributed 830,054.7 CYAR (51.3%) and female cats 787,876.3 CYAR (48.7%). Male and female percentages were similar among domestics (51.2%/48.8%, respectively) and purebreds (51.8%/48.2%, respectively). Descriptive statistics of this dataset are further explained in a previous study12 published by the current authors.

Periodontal disease

Model accuracy was comparable across all models. Six cases did not have matched controls and were excluded from MF, CMF, and CLR models. Model metrics are provided (Table 1). Previous digestive and whole-body symptom claims were the most influential variables across all models and were followed by skin and injury claims, and breed ranked fifth in all MF models (Figure 2). Breed was less influential in the nonmatched RF model than in the RF models that used the matched approach. When examined through partial dependence plots, cats with zero digestive claims were less likely to be periodontal disease positive compared to those with any number of prior digestive claims. Random forest predictive affinity for periodontal disease–positive designation peaked at 1 claim and leveled off slightly lower (Figure 3). The CLR models were consistent with these findings, exhibiting a similar positive association of periodontal disease as digestive claims increased (Table 2; Supplementary Tables S1S3). Similar results were also seen with previous whole-body symptom claims (Supplementary Figure S1). Partial dependence plots and CLR analysis revealed that Somali, Birman, and Russian Blue cats are least likely to be periodontal disease positive, while Siamese, Burmese, and Maine Coon cats show the highest likelihood (Figure 4). Complete summaries of the multivariable CLR-con model, univariable CLR-con models, and univariable CLR-cat models for previous digestive and whole-body symptom claims are provided.

Table 1

Model statistics of various predictive models for periodontal disease on cat insurance data from Sweden (2011 to 2016).

Random forest–continuous claims Matched forest–continuous claims Matched forest–categorical claims Centered matched forest–continuous claims Conditional logistic regression– continuous claims (multivariable) Conditional logistic regression–categorical claims (multivariable)
Data set Random 1:1 downsampling with claims/time at risk Matched by duration and entry age Matched by duration and entry age Matched by duration and entry age Matched by duration and entry age Matched by duration and entry age
n 1,470 1,458* 1,458* 1,458* 1,458* 1,458*
Balanced accuracy (95% CI)

88.16%

(86.40–89.77)

83.47%

(81.46–85.34)

83.94%

(81.96–85.80)

85.19%

(83.26–86.97)

86.35%

(84.48–88.07)

84.09%

(82.11–85.93)

κ 76.33% 66.94% 67.90% 70.37% 72.70% 68.18%
Sensitivity 96.19% 93.96% 95.06% 85.60% 86.42% 84.22%
Specificity 80.14% 72.98% 72.84% 84.77% 86.28% 83.95%
Positive predictive value 82.88% 77.66% 77.78% 84.90% 86.30% 83.99%
Negative predictive value 95.46% 92.36% 93.65% 85.48% 86.40% 84.18%
F1 89.04% 85.04% 85.56% 85.25% 86.36% 84.11%
Prevalence 50.00% 50.00% 50.00% 50.00% 50.00% 50.00%
P value < .01 < .01 < .01 < .01 < .01 < .01

Tuning parameter “mtry” for random forest models was held constant at a value of 5 variables at each split based on default mtry = sq(p). Random forest number of trees = 500. Conditional logistic regression iteration maximum = 1,000.

*6 cases did not have matched controls.

Figure 2
Figure 2

Random forest multiway variable importance plots for periodontal disease on cat insurance data from Sweden (2011 to 2016). Mean decrease in Gini impurity (more pure nodes) is on the x-axis and root node frequency (initial split in decision tree) on the y-axis. Point size indicates the decrease on model accuracy if that variable is permuted. Key predictors are highlighted, have larger points, and are in the top right of the graph.

Citation: American Journal of Veterinary Research 2025; 10.2460/ajvr.24.09.0282

Figure 3
Figure 3

Random forest partial dependence plots of digestive claims for periodontal disease on cat insurance data from Sweden (2011 to 2016).

Citation: American Journal of Veterinary Research 2025; 10.2460/ajvr.24.09.0282

Table 2

Conditional logistic regression models of digestive claims for periodontal disease on cat insurance data from Sweden (2011 to 2016).

Variable Coefficient (SE) OR (95% CI) P value
Digestive–univariable continuous† 1.06 (0.13) 2.88 (2.23–3.74) < .001***
Digestive–multivariable continuous† 1.00 (0.17) 2.72 (1.94–3.8) < .001***
Digestive–univariable 1 claim‡ 1.79 (0.21) 5.98 (3.99–8.94) < .001***
Digestive–univariable 2 claims‡ 1.52 (0.42) 4.58 (2.01–10.45) < .001***
Digestive–univariable 3 claims‡ 0.91 (0.55) 2.49 (0.84–7.37) .099
Digestive–univariable 4 claims‡ 2.81 (1.06) 16.66 (2.08–133.51) .008**
Digestive–univariable 5 claims‡ 18.20 (4,484.13) 80,429,754.83 (0.00–infinity) .997

Table includes up to 5 claims for univariable models, as additional claims showed no significant difference from the baseline.

†Conditional logistic regression model–continuous claim numbers.

‡Conditional logistic regression model–categorical claim numbers; reference value is 0 claims.

**P ≤ .01.

***P ≤ .001.

Figure 4
Figure 4

Random forest partial dependence plots of breed for periodontal disease on cat insurance data from Sweden (2011 to 2016).

Citation: American Journal of Veterinary Research 2025; 10.2460/ajvr.24.09.0282

Skin tumor

The skin tumor models had similar results to periodontal disease. Three cases did not have match controls and were excluded from MF, CMF, and CLR models. Model metrics are provided (Supplementary Table S4). Previous skin and injury claims were the most influential variables across all MF models and were followed by digestive and whole-body symptom claims, and nonskin neoplasia ranked fifth across all models (Supplementary Figure S2). Breed was ranked approximately eighth in all models and much lower in the nonmatched RF model. When examined through partial dependence plots, cats with zero skin claims were much less likely to be skin tumor positive compared to those with any skin claims. Random forest predictive affinity for skin tumor–positive designation also peaked at 1 digestive claim and leveled off slightly lower (Supplementary Figure S3). The CLR models were consistent with these findings, exhibiting a similar positive association of skin tumors as digestive claims increased (Supplementary Tables S5S7). Similar results were seen with whole-body symptom claims (Supplementary Figure S4) and other diagnostic categories. Partial dependence plots and CLR analysis revealed that Somali and Birman cats have a lower likelihood of skin tumor positivity, with Devon Rex/Sphynx cats, Norwegian Forest Cats, and Maine Coon cats showing higher odds (Supplementary Figure S5). Complete summaries of the multivariable CLR-con model, univariable CLR-con models, and univariable CLR-cat models for previous digestive and whole-body symptom claims are provided.

Discussion

In this study, we aimed to explore the predictive potential of pet insurance data through various data processing and analysis techniques. The RF models revealed good predictive accuracy and a substantial improvement over the baseline accuracy. Partial dependence plots and CLR results revealed a relatively low risk for periodontal disease with zero previous claims and a dramatic increase in risk with at least 1 claim for all categories of previous claims that were identified as important according to multiple categories. A similar pattern was revealed for predictors of skin tumors identified as the most important ones. The association between the number of previous claims and periodontal disease may be due to the various disorders that have been shown to be associated with oral lesions, including digestive disorders (eg, vomiting, nutrition), upper urinary disorders (eg, renal disease), endocrine disorders (eg, diabetes mellitus), infections (eg, FeLV, FIV, feline calicivirus), and immunological disorders (eg, erythema multiforme, eosinophilic granuloma complex, pemphigus vulgaris).2730 Digestive claims had the highest influence on periodontal disease, as did skin/injury claims on skin tumors. This may be due to the inclusion of related conditions within their respective diagnostic categories. For instance, “difficulty chewing” and “bleeding from the mouth/throat” are diagnoses within the digestive category that could precede periodontal disease. However, these diagnoses, with relatively few claims, were not excluded from the predictor set, as their impact is likely minimal. Diagnosis misclassification may have also occurred. For example, a skin tumor could initially be misclassified as an injury or a different skin lesion, particularly in early diagnostic stages.

While some correlations have potential biological explanations, the strong predictive power of previous claims in certain diagnostic categories lacks a clear biological basis. Therefore, the consistent pattern seen across claim categories and for different disease outcomes may suggest that it could be influenced by the nature of the data. The most influential diagnostic claims also corresponded to the categories with the highest number of claims.12 This pattern may indicate that cats with a single condition are more prone to others or it may also be due to detection bias. Cats with a higher number of claims may visit the veterinarian more frequently, thereby increasing the likelihood of disease detection. Many owners may not have noticed their cat had periodontal disease or a skin tumor or did not think they were important enough for a veterinary visit. Consequently, these conditions may have only been identified when cats presented for other health concerns, and a claim could have then been submitted after further investigation. This could suggest that previous claims may not actually predict the occurrence of periodontal disease and skin tumors but rather the likelihood of detecting them. Previous claims may be nonspecific and predictive of various diseases, especially when they encompass a composite of different diagnoses and clinical issues. An interesting avenue for further research would be to explore whether most insurance claims are from cats with many claims across multiple diagnoses or from cats with few claims in 1 or 2 diagnoses.

Breed emerged as a significant predictor, identifying several breeds as either high risk or low risk. Domestic cats were associated with periodontal disease–negative predictions, whereas Maine Coon, Siamese, and Burmese cats were linked with positive predictions. In the case of skin tumors, Norwegian Forest Cats, Devon Rex/Sphynx cats, and Maine Coon cats were linked with positive predictions, while Birman and domestic cats were associated with negative predictions. Periodontal disease and skin tumors, the most common form of feline neoplasia, are common causes of morbidity in cats.12,31,32 To help prevent or catch these conditions early, veterinarians can use breed risk information to better inform and guide their clients, particularly those with high-risk breeds, by initiating early-life conversations and providing tailored advice on specific lifestyle and monitoring recommendations for their pets. For instance, knowing that Maine Coon and Siamese cats are at higher risk for periodontal disease, veterinarians can emphasize the importance of regular dental check-ups and proactive oral care for these breeds. Similarly, for breeds predisposed to skin tumors, such as Norwegian Forest Cats and Devon Rex/Sphynx cats, veterinarians can advise on regular skin examinations and early detection strategies. By communicating these breed-specific risks and preventive measures, veterinarians can foster a proactive approach to pet health, helping pet owners understand the importance of regular veterinary visits and early interventions.

The underrepresentation of rare breeds, combined with breed groupings, may have limited the ability to fully capture breed-related disease risks. While matching on age was crucial for interpreting breed associations, it is worth noting that age is likely an important predictor variable itself. Although investigating age effects was beyond the scope of this study, understanding how age influences disease predictions would be valuable for clinicians in identifying when disease risk begins to significantly increase. While LOOCV can be prone to overfitting, particularly with complex models or smaller datasets, our use of an ensemble method like RF on a moderately sized dataset likely helped mitigate this risk. It is also important to consider the quality of the insurance data used in this study and the inherent limitations related to all secondary data sources. The accuracy of critical information, like diagnosis, may vary due to differences in clinical expertise, examination routines, diagnostic methods, and medical record documentation among veterinarians. Datasets may also suffer from inconsistencies in terminology. While Agria's standardized and structured dataset shows promise as a reliable and validated source for research,2,3,712,1518 further validation into the accuracy, completeness, and representativeness of the recorded data would be ideal. An attempt to indirectly assess the validity of the results from this study was done by comparing outcome trends across various diagnoses. However, it remains unclear whether the results genuinely represent biological associations or predominantly reveal data characteristics. This may be partially due to low detection sensitivity, leading to an underrepresentation of disease-positive cats. Minor or inexpensive problems may not have been recorded due to deductibles. However, an expert on Agria Pet Insurance indicates this likely had minimal impact since deductibles are generally low. Additionally, while maximum annual reimbursement amounts may vary across policies, they are usually high, rarely reached, and the vast majority pick a “standard” amount and thus should have a limited effect on claim submissions. Elective dental procedures and preexisting conditions, occurring more commonly, were not covered by veterinary care insurance. Additionally, the dataset only includes a 6-year observation period; claims and diagnoses made before this period are unknown. Therefore, while it is reasonable to assume that disease-positive cats (cases) are likely correctly identified, disease-negative cats (controls) may be incorrectly classified (false negatives). This could bias model evaluation, mislead predictor importance (most likely causing some influential variables to be missed), and ultimately impact the outcomes and interpretability of the results. The observed association between previous claims and disease status may be due to detection bias (also known as surveillance or monitoring bias). This bias suggests that cats with prior claims receive closer monitoring, increasing the likelihood of detecting other conditions. To address this, cases and controls could be matched or stratified by veterinary care intensity to ensure comparable monitoring levels or veterinary care utilization could be incorporated into statistical models. While the variance in policies could also cause detection bias due to differences in coverage potentially causing a difference in claim submissions, the only areas of variation during the study period were deductibles and maximum annual reimbursements. If this information was available, it could have been accounted for through matching, stratification, or modeling as well. Nevertheless, these factors likely had a minimal effect as described earlier. However, potential owner preferences to not pursue veterinary care or insurance claim submissions could impact the data, leading to underrepresentation of issues in certain patients. Furthermore, the level of detail used for predictors was limited by model complexity and data availability. A similar study1 that used insurance claims data to predict health outcomes in dogs demonstrated that feature richness had a significant impact on model performance. Using more granular diagnoses instead of general disease categories might have produced more disease-specific predictors. Including more comprehensive medical information from health records, such as diagnostic results and treatments (both therapeutic and preventive), may be valuable for better identifying predictors and understanding their relationship to specific disease outcomes.

There is an increase in the availability of diverse health-related datasets, such as pet insurance records, electronic medical records, health device metrics, and electronic health records. Efforts to enhance data accessibility and to combine unique datasets reflect a broader trend to promote the creation and utilization of “big data.”33,34 As the availability of big data sources increases, a key challenge is translating the abundance of data into meaningful information to transition from simply possessing big data toward creating valuable “smart data.”34 Artificial intelligence methods like machine learning inherently offer a versatile and powerful approach for analyzing epidemiological patterns in large complex high-dimensional datasets. However, the enthusiasm for leveraging it sometimes overlooks the critical step of data validation, which can increase the likelihood of finding spurious associations or generating results of unknown validity. It can be tempting to proceed with modeling on big datasets driven by technological capabilities. This study underscores the necessity for cautious interpretation of results, even when models perform well on large standardized well-structured data sources. While model accuracy was relatively good in this study, further investigation is needed to explore the clinical relevance of the predictors. Greater predictive potential can be achieved by adding rich predictor variables, integrating additional models, and employing various data analysis strategies. These predictive insights could enable veterinarians to offer pet owners specific preventive health advice by identifying risk factors for conditions, facilitating preventative measures and early detection, and delivering personalized treatment plans. Effective communication is crucial in translating predictive modeling into actionable strategies for pet owners. Developing comprehensive health reports and adaptive recommendation systems can enhance the impact of these models through clear and concise communication of AI-driven recommendations. With vast amounts of health data becoming available, veterinarians will need to assist pet owners in understanding its significance and guide them in choosing the best proactive health management strategies. This may enhance the veterinarian-client relationship, empower pet owners to actively manage their pets’ health, and improve compliance with preventive care protocols, leading to healthier pets and more efficient use of healthcare resources.

Supplementary Materials

Supplementary materials are posted online at the journal website: avmajournals.avma.org.

Acknowledgments

The authors thank Agria Pet Insurance for access to the database.

Disclosures

The authors have nothing to disclose. No AI-assisted technologies were used in the composition of this manuscript.

Funding

The authors thank the Morris Animal Foundation for funding this study (grant ID: D21FE-501). Dr. Hadar (PhD program) was also supported by the IDEXX Chair in Emerging Technologies and Preventive Healthcare, University of Guelph Faculty Research Funds, and the International Graduate Tuition Scholarship at the University of Guelph. Funding organizations did not have any involvement in the study design, data analysis and interpretation, writing, or publication of the manuscript.

References

  • 1.

    Debes C, Wowra J, Manzoor S, Ruple A. Predicting health outcomes in dogs using insurance claims data. Sci Rep. 2023;13(1):9122. doi:10.1038/s41598-023-36023-5

    • Search Google Scholar
    • Export Citation
  • 2.

    Egenvall A, Nødtvedt A, Penell J, Gunnarsson L, Bonnett BN. Insurance data for research in companion animals: benefits and limitations. Acta Vet Scand. 2009;51(1):42. doi:10.1186/1751-0147-51-42

    • Search Google Scholar
    • Export Citation
  • 3.

    Hardefeldt LY, Selinger J, Stevenson MA, et al. Population wide assessment of antimicrobial use in dogs and cats using a novel data source – a cohort study using pet insurance data. Vet Microbiol. 2018;225:3439. doi:10.1016/j.vetmic.2018.09.010

    • Search Google Scholar
    • Export Citation
  • 4.

    Egenvall A, Nødtvedt A, Häggström J, Ström Holst B, Möller L, Bonnett B. Mortality of life-insured Swedish cats during 1999–2006: age, breed, sex, and diagnosis. J Vet Intern Med. 2009;23(6):11751183. doi:10.1111/j.1939-1676.2009.0396.x

    • Search Google Scholar
    • Export Citation
  • 5.

    Egenvall A, Bonnett BN, Häggström J, Holst BS, Möller L, Nødtvedt A. Morbidity of insured Swedish cats during 1999–2006 by age, breed, sex, and diagnosis. J Feline Med Surg. 2010;12(12):948959. doi:10.1016/j.jfms.2010.08.008

    • Search Google Scholar
    • Export Citation
  • 6.

    Sweden Pet Insurance Market is expected to attain a Market size of SEK 8900 Mn growing at a CAGR of ∼8% (2022P-2027F) owing to increasing pet adoption, rising awareness and high veterinary costs: Ken Research. PR Newswire. Accessed May 27, 2024. https://www.prnewswire.com/news-releases/sweden-pet-insurance-market-is-expected-to-attain-a-market-size-of-sek-8900-mn-growing-at-a-cagr-of-8-2022p-2027f-owing-to-increasing-pet-adoption-rising-awareness-and-high-veterinary-costs-ken-research-301741790.html

  • 7.

    Öhlund M, Fall T, Ström Holst B, Hansson-Hamlin H, Bonnett B, Egenvall A. Incidence of diabetes mellitus in insured Swedish cats in relation to age, breed and sex. J Vet Intern Med. 2015;29(5):13421347. doi:10.1111/jvim.13584

    • Search Google Scholar
    • Export Citation
  • 8.

    Bergström A, Nødtvedt A, Lagerstedt AS, Egenvall A. Incidence and breed predilection for dystocia and risk factors for cesarean section in a Swedish population of insured dogs. Vet Surg. 2006;35(8):786791. doi:10.1111/j.1532-950X.2006.00223.x

    • Search Google Scholar
    • Export Citation
  • 9.

    Engdahl K, Hanson J, Bergström A, Bonnett B, Höglund O, Emanuelson U. The epidemiology of stifle joint disease in an insured Swedish dog population. Vet Rec. 2021;189(3):e197. doi:10.1002/vetr.197

    • Search Google Scholar
    • Export Citation
  • 10.

    Hanson JM, Tengvall K, Bonnett BN, Hedhammar Å. Naturally occurring adrenocortical insufficiency – an epidemiological study based on a Swedish-insured dog population of 525,028 dogs. J Vet Intern Med. 2016;30(1):7684. doi:10.1111/jvim.13815

    • Search Google Scholar
    • Export Citation
  • 11.

    Nødtvedt A, Guitian J, Egenvall A, Emanuelson U, Pfeiffer DU. The spatial distribution of atopic dermatitis cases in a population of insured Swedish dogs. Prev Vet Med. 2007;78(3):210222. doi:10.1016/j.prevetmed.2006.10.007

    • Search Google Scholar
    • Export Citation
  • 12.

    Hadar BN, Bonnett BN, Poljak Z, Bernardo TM. Morbidity of insured Swedish cats between 2011 and 2016: comparing disease risk in domestic crosses and purebreds. Vet Rec. 2023;192(12):e2778. doi:10.1002/vetr.2778

    • Search Google Scholar
    • Export Citation
  • 13.

    Veterinary research & analytics information. Nationwide veterinary analytics. Nationwide. Accessed December 15, 2023. https://www.petinsurance.com/veterinarians/research/

  • 14.

    Furry finances: global pet insurance adoption rates revealed. The Insurance Emporium. February 28, 2024. Accessed May 27, 2024. https://www.theinsuranceemporium.co.uk/blog/global-pet-insurance-adoption-rates/

  • 15.

    Olson P, Kängström L. Svenska djursjukhusföreningen (Swedish Animal Hospital Organization). Diagnosregister för häst, hund och katt (Diagnostic registry for the horse, the dog and the cat). Taberg, Stockholm: Tabergs tryckeri; 1993.

  • 16.

    Egenvall A, Bonnett BN, Olson P, Hedhammar Å. Validation of computerized Swedish dog and cat insurance data against veterinary practice records. Prev Vet Med. 1998;36(1):5165. doi:10.1016/S0167-5877(98)00073-7

    • Search Google Scholar
    • Export Citation
  • 17.

    Nødtvedt A, Bergvall K, Emanuelson U, Egenvall A. Canine atopic dermatitis: validation of recorded diagnosis against practice records in 335 insured Swedish dogs. Acta Vet Scand. 2006;48(1):8. doi:10.1186/1751-0147-48-8

    • Search Google Scholar
    • Export Citation
  • 18.

    Heske L, Berendt M, Jäderlund KH, Egenvall A, Nødtvedt A. Validation of the diagnosis canine epilepsy in a Swedish animal insurance database against practice records. Prev Vet Med. 2014;114(3):145150. doi:10.1016/j.prevetmed.2014.03.003

    • Search Google Scholar
    • Export Citation
  • 19.

    Javaid M, Haleem A, Pratap Singh R, Suman R, Rab S. Significance of machine learning in healthcare: features, pillars and applications. Int J Intell Netw. 2022;3:5873. doi:10.1016/j.ijin.2022.05.002

    • Search Google Scholar
    • Export Citation
  • 20.

    Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med. 2023;388(13):12011208. doi:10.1056/NEJMra2302038

    • Search Google Scholar
    • Export Citation
  • 21.

    Breiman L. Random forests. Mach Learn. 2001;45(1):532. doi:10.1023/A:1010933404324

  • 22.

    Couronné R, Probst P, Boulesteix AL. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics. 2018;19(1):270. doi:10.1186/s12859-018-2264-5

    • Search Google Scholar
    • Export Citation
  • 23.

    Shomal Zadeh N, Lin S, Runger GC. Matched forest: supervised learning for high-dimensional matched case-control studies. Bioinformatics. 2020;36(5):15701576. doi:10.1093/bioinformatics/btz785

    • Search Google Scholar
    • Export Citation
  • 24.

    Stanfill B, Reehl S, Bramer L, et al. Extending classification algorithms to case-control studies. Biomed Eng Comput Biol. 2019;10:1179597219858954. doi:10.1177/1179597219858954

    • Search Google Scholar
    • Export Citation
  • 25.

    Conway A, Rolley JX, Fulbrook P, Page K, Thompson DR. Improving statistical analysis of matched case-control studies. Res Nurs Health. 2013;36(3):320324. doi:10.1002/nur.21536

    • Search Google Scholar
    • Export Citation
  • 26.

    Hosmer DW. Applied Logistic Regression. Wiley; 1989.

  • 27.

    Gawor J. Systemic diseases influencing oral health and conditions. In: The Veterinary Dental Patient. John Wiley & Sons Ltd; 2021:219231. doi:10.1002/9781118974674.ch16

    • Search Google Scholar
    • Export Citation
  • 28.

    Dokuzeylul B, Kayar A, Or ME. Prevalence of systemic disorders in cats with oral lesions. Vet Med. 2016;61(4):219223. doi:10.17221/8823-VETMED

    • Search Google Scholar
    • Export Citation
  • 29.

    Soltero-Rivera M, Goldschmidt S, Arzi B. Feline chronic gingivostomatitis current concepts in clinical management. J Feline Med Surg. 2023;25(8):1098612X231186834. doi:10.1177/1098612X231186834

    • Search Google Scholar
    • Export Citation
  • 30.

    Reiter AM, Mendoza KA. Feline odontoclastic resorptive lesions: an unsolved enigma in veterinary dentistry. Vet Clin North Am Small Anim Pract. 2002;32(4):791837. doi:10.1016/S0195-5616(02)00027-X

    • Search Google Scholar
    • Export Citation
  • 31.

    O’Neill DG, Blenkarn A, Brodbelt DC, Church DB, Freeman A. Periodontal disease in cats under primary veterinary care in the UK: frequency and risk factors. J Feline Med Surg. 2023;25(3):1098612X231158154.

    • Search Google Scholar
    • Export Citation
  • 32.

    Ho NT, Smith KC, Dobromylskyj MJ. Retrospective study of more than 9000 feline cutaneous tumours in the UK: 2006–2013. J Feline Med Surg. 2018;20(2):128134. doi:10.1177/1098612X17699477

    • Search Google Scholar
    • Export Citation
  • 33.

    Andreu-Perez J, Poon CCY, Merrifield RD, Wong STC, Yang GZ. Big data for health. IEEE J Biomed Health Inform. 2015;19(4):11931208. doi:10.1109/JBHI.2015.2450362

    • Search Google Scholar
    • Export Citation
  • 34.

    VanderWaal K, Morrison RB, Neuhauser C, Vilalta C, Perez AM. Translating big data into smart data for veterinary epidemiology. Front Vet Sci. 2017;4:110. doi:10.3389/fvets.2017.00110

    • Search Google Scholar
    • Export Citation
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1061 1061 851
PDF Downloads 263 263 132
Advertisement