
AI Education Must Address Dataset Bias: MIT Expert Urges Curriculum Reform
In an era where artificial intelligence (AI) is increasingly integrated into critical sectors like healthcare, a significant gap persists in AI education. Many courses, while adept at teaching the deployment of AI models for diagnosing diseases and determining treatments, often neglect a crucial element: training students to identify biases in the data used to develop these models.
Leo Anthony Celi, a senior research scientist at MIT’s Institute for Medical Engineering and Science, a physician at Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School, highlights these shortcomings in a new paper. Celi advocates for curriculum reform to equip students with the skills to critically evaluate data before incorporating it into AI models. The urgency stems from the well-documented issue of models trained primarily on data from specific demographics, such as white males, performing poorly when applied to broader populations.
In a recent interview, Celi addressed key questions about the sources of bias in datasets and strategies for educators to mitigate these issues.
Q: How does bias get into these datasets, and how can these shortcomings be addressed?
A: Celi explains that biases in data are inevitably reflected in the models built from that data. He points to historical examples, such as pulse oximeters that overestimate oxygen levels in people of color due to inadequate representation in clinical trials. He emphasizes that medical devices are often optimized for healthy young males, neglecting the diverse population that ultimately uses them. He is also critical of using electronic health record systems as the primary building blocks for AI, arguing they were not designed for machine learning and can perpetuate existing biases.
Celi suggests exploring transformer models of numeric electronic health record data to model the underlying relationships between laboratory tests, vital signs, and treatments, which can mitigate the effect of missing data and provider implicit biases.
Q: Why is it important for courses in AI to cover the sources of potential bias? What did you find when you analyzed such courses’ content?
A: Reflecting on MIT’s AI course, which began in 2016, Celi notes the realization that students were incentivized to build models overfitted to statistical measures without understanding the underlying data’s flaws. An analysis of online AI courses revealed that many failed to adequately address data biases, with only a fraction including sections on bias and even fewer offering significant discussion on the topic. Of 11 courses reviewed, only five included sections on bias in datasets, and only two contained any significant discussion of bias.
Despite acknowledging the value of these courses for self-study, Celi stresses the importance of equipping learners with the agency to work critically with AI. He hopes his paper will highlight the critical gap in current AI education.
Q: What kind of content should course developers be incorporating?
A: Celi advocates for a comprehensive checklist of questions to guide students in understanding the origins of their data. This includes identifying the observers and data collectors and understanding the landscape of the institutions involved. For example, when using ICU data, students should consider who gets admitted to the ICU and who doesn’t, as this introduces a sampling selection bias. Celi suggests that at least 50% of course content should focus on understanding the data, arguing that modeling becomes straightforward once the data is thoroughly understood.
He also highlights the MIT Critical Data consortium’s datathons, which bring together diverse groups of healthcare workers and data scientists to analyze health and disease in local contexts. These events foster critical thinking by combining different backgrounds and generations, enabling participants to understand the data they are working with.
Celi urges students not to build models without understanding the data’s origins, the patients included, and the accuracy of measurement devices across different individuals. He encourages the use of local datasets to ensure relevance, even if they reveal data quality issues. Acknowledging and addressing these issues is crucial for improving data collection practices and building reliable AI models.
Ultimately, Celi hopes to inspire a shift in perspective among AI practitioners, emphasizing the immense potential and the significant risks of harm if AI development is not approached with critical awareness.



