Machine Learning: An Orthopaedic Crash Course

By: Ian M. Al’Khafaji, M.D., FAANA and Pei Wei Chi, Ph.D.


The inception of artificial intelligence (AI) can be traced back to the mid-20th century, when the concept of developing machines capable of emulating human-like intelligence ignited the imagination of scientists and researchers. The term "artificial intelligence" was formally coined in 1956 at the Dartmouth workshop, a pivotal moment in the history of AI. Visionaries in the field, including Alan Turing and John McCarthy, laid the foundational theories by proposing the notion of a "universal machine," proficient in undertaking any intellectual task a human can achieve. Over subsequent decades, AI research traversed diverse phases, encompassing symbolic AI with its focus on rule-based systems; the emergence of neural networks inspired by the human brain; and the advent of machine learning (ML) techniques that have paved the path for contemporary AI applications.


Advancements in computational power and the development of novel algorithms propelled AI's mastery over tasks such as natural language processing, chess playing and image recognition. This progression heralded the present-day AI surge, penetrating diverse realms of our lives, including transportation, entertainment and health care systems. Understanding the current AI landscape in orthopaedics is not only crucial for harnessing its efficacious potential, but also for acknowledging its inherent limitations. Such comprehension equips us to refine and optimize AI's role in augmenting patient care, transcending its existing capacities.2


AI can be categorically delineated into subsets, grounded in their distinctive functionalities and capabilities. In the context of orthopaedic surgery, we will focus on the most pertinent to the field.  ML is a discipline that harnesses algorithms to enable systems to learn from extensive datasets, progressively enhancing their performance. "Training" the ML program is contingent upon an extensive dataset, achieved through supervised learning involving labeled data or unsupervised learning utilizing unlabeled data. A standard approach involves allocating 70-80% of the dataset for training, while the remaining 20-30% serves as a test set to gauge model performance.1


Supervised learning involves training a model on a labeled dataset, where each data point is intricately associated with a corresponding annotation or set of attributes. The objective is to learn the mapping between inputs and corresponding outputs that would yield accurate predictions or classifications on new, unseen data.1 For instance, a data point of “gender” or “sporting activity” might bear the label "successful return to sport." Subsequently, when a new case is presented, the model can predict the likelihood of "successful return to sport.”


In contrast, unsupervised learning deals with unlabeled datasets that lack informative annotations. The goal of unsupervised learning is to unveil distinct clusters within an extensive database or reduce its dimensions to reveal intricate patterns. For instance, unsupervised learning can be effectively used to uncover potential unknown subgroups by aggregating data points with similar attributes within the dataset.1


A fundamental tenet of any ML model's success or failure is the dataset it learns from. Emphasis must be placed not only on quantity but also on the quality of data. The heterogeneity of data that the ML program encounters, mirroring the general population, results in superior model performance. Inaccurate or incomplete data detrimentally affects model accuracy. The most time-consuming aspect of creating an ML model is curating the data to eliminate redundancy and fill in missing features.1,3


Statistical power assessment doesn't apply to ML, where no fixed data quantity guarantees model validity or accuracy. The model's fit lies on a continuum between "underfitting" and "overfitting." The bias-variance tradeoff describes this balance. Underfitting results from insufficient data points, yielding poor model performance. Overfitting occurs when models with near-perfect accuracy in the training set falter when evaluating testing data or being externally validated. Performance metrics like Brier and F1-Scores objectively measure ML model performance post-testing but are outside the scope of this article.1,3


Model validations are essential components when developing ML models. Many published orthopaedic ML models perform poorly during external verification due to variations in data collection, patient characteristics and omitted predictors.1,3 For instance, the Norwegian anterior cruciate ligament (ACL) registry stands as one of the most extensive databases in orthopaedics, hosting a multitude of ML models dedicated to predicting outcomes of ACL reconstructions.5,6 Regrettably, some of these models do not have strong external validation against the Danish ACL registry.4 Consequently, the universal applicability of these tools warrants a cautious reevaluation, considering the intricate interplay of factors that shape their efficacy and generalizability.


The AI analysis of extensive health care datasets holds profound potential for patient care improvement. The AI revolution is a permanent fixture, demanding enhanced AI literacy from the orthopaedic community. Furthermore, the process of optimizing ML demands enhanced data management and collaboration, aimed at improving algorithmic performance and facilitating the enhancement of external validation. 



  1. Pruneski, J.A., Williams, R.J., Nwachukwu, B.U. et al. “The Development and Deployment of Machine Learning Models.” Knee Surgery, Sports Traumatology, Arthroscopy. 2022;30:3917–3923.
  2. Cote, M.P., Lubowitz, J.H., Brand, J.C., Rossi, M.J. “Artificial Intelligence, Machine Learning and Medicine: A Little Background Goes a Long Way Toward Understanding.” Arthroscopy. 2021;37(6):1699-1702.
  3. Ramkumar, P.M., Pang, M., Polisetty, T., Helm, J.M., Karnuta, J.M. “Meaningless Applications and Misguided Methodologies in Artificial Intelligence-Related Orthopaedic Research Propagates Hype Over Hope.” Arthroscopy. 2022;38(9):2761-2766.
  4. Martin, R.K., Wastvedt, S., Pareek, A., Persson, A., Visnes, H., Fenstad, A.M., Moatshe, G., Wolfson, J., Lind, M., Engebretsen, L. “Machine Learning Algorithm to Predict Anterior Cruciate Ligament Revision Demonstrates External Validity.” [Published online January 1, 2022]. Knee Surgery, Sports Traumatology, Arthroscopy. 2022;30(2):368-375.
  5. Martin, R.K., Wastvedt, S., Pareek, A., Persson, A., Visnes, H., Fenstad, A.M., Moatshe, G., Wolfson, J., Engebretsen, L. “Predicting Subjective Failure of ACL Reconstruction: A Machine Learning Analysis of the Norwegian Knee Ligament Register and Patient-Reported Outcomes.” [Published online January 11, 2022]. The Journal of ISAKOS. 2022;7(3):1-9.
  6. Martin, R.K., Wastvedt, S., Pareek, A., Persson, A., Visnes, H., Fenstad, A.M., Moatshe, G., Wolfson, J., Engebretsen, L. “Predicting Anterior Cruciate Ligament Reconstruction Revision: A Machine Learning Analysis Utilizing the Norwegian Knee Ligament Register.” The Journal of Bone and Joint Surgery – American Volume. 2022;104(2):145-153.
Scroll to top