Healthcare AI faces an existential crisis. With no accountability mechanism for population-scale clinical data, models trained at one location will not generalize to another. This means lower efficacy (value capture), more reinventing the wheel (increased costs), and worse patient care (the whole point). Why is data quality so hard in healthcare? While it is always difficult to influence upstream data producers + suppliers, now imagine that those producers are able to dodge the data quality issue by appealing to the private, proprietary, and complex nature of the data (that "you'd probably need an MD/PhD to understand"). Thank you to AE Lewis, Nicole Weiskopf, Zachary B Abrams, Randi Foraker, Albert Lai, Philip Payne, and Aditi Gupta for your thorough review highlighting this problem space. In their words: Conclusion: Guidelines are needed for EHR data quality assessment to improve the efficiency, transparency, comparability, and interoperability of data quality assessment. These guidelines must be both scalable and flexible. Automation could be helpful in generalizing this process. Electronic health record data quality assessment and tools: a systematic review: https://lnkd.in/gh9iyFh9 In my words: There are many tailwinds at our back, from HL7 FHIR to 21st Century Cures Act, but we need to plug this data quality gap. The time to act is now. Who's with me?
Challenges in Real-World Data Collection
Explore top LinkedIn content from expert professionals.
Summary
Collecting real-world data is a critical step in developing AI and data-driven models, but it comes with significant challenges. These challenges often stem from issues like data quality, bias, and lack of standardization, all of which directly impact model performance and reliability.
- Focus on data quality: Ensure data is clean, accurate, and representative by performing extensive quality checks, addressing missing values, and eliminating outliers or biases before using it for analysis or training.
- Understand your data sources: Clearly document the origins, permissions, and characteristics of your data to avoid legal or ethical challenges. This transparency also helps with troubleshooting and improving processes.
- Balance data diversity: Pay attention to data representation across different subpopulations or scenarios to avoid biased models and ensure fair and accurate outcomes.
-
-
10 of the most-cited datasets contain a substantial number of errors. And yes, that includes datasets like ImageNet, MNIST, CIFAR-10, and QuickDraw which have become the definitive test sets for computer vision models. Some context: A few years ago, 3 MIT graduate students published a study that found that ImageNet had a 5.8% error rate in its labels. QuickDraw had an even higher error rate: 10.1%. Why should we care? 1. We have an inflated sense of the performance of AI models that are testing against these datasets. Even if models achieve high performance on those test sets, there’s a limit to how much those test sets reflect what really matters: performance in real-world situations. 2. AI models trained using these datasets are starting off on the wrong foot. Models are only as good as the data they learn from, and if they’re consistently trained on incorrectly labeled information, then systematic errors can be introduced. 3. Through a combination of 1 and 2, trust in these AI models is vulnerable to being eroded. Stakeholders expect AI systems to perform accurately and dependably. But when the underlying data is flawed and these expectations aren’t met, we start to see a growing mistrust in AI. So, what can we learn from this? If 10 of the most cited datasets contain so many errors, we should assume the same of our own data unless proven otherwise. We need to get serious about fixing — and building trust in — our data, starting with improving our data hygiene. That might mean implementing rigorous validation protocols, standardizing data collection procedures, continuously monitoring for data integrity, or a combination of tactics (depending on your organization’s needs). But if we get it right, we're not just improving our data; we're setting our future AI models to be dependable and accurate. #dataengineering #dataquality #datahygiene #generativeai #ai
-
For companies interested in adopting AI technologies, the number one thing standing in their way is quality data. Having a clear understanding of what types of data are flowing, what types of data have been categorized, and what data is updated and accurate is essential. Evaluate your data supply chain: - Is your data trustworthy? - Where is it coming from? Did you acquire it from third parties or is it in-house? - If you did acquire it from third parties, are you allowed to use it? How much of it are you allowed to use? Does the third party even have the right to use it? It’s trickier than it sounds, and the quality control element can be daunting to many organizations. Some companies aren’t even clear about where their data lives, let alone how to manipulate those insights to their advantage. The best advice I can give is to not launch a huge undertaking right away: get proof of concept first. Start with one or two smaller applications, make progress with those, and see what value you gain. Then move on to something larger. Sometimes, you just don't know how much sensitive data is leaving the organization, especially if you're invoking external APIs. If one algorithm is off, biased, or pulling the wrong data, whatever prompts you’re inputting won’t be accurately executed. A single kink in the system could ignite a massive fire. Remember what happened with one company’s famous credit card fail? Women were approved for smaller lines of credit due to a gender bias that engineers couldn’t identify. No one could prove what attributes were coming from where, and which were being used to train models. The fact that they hadn’t accounted for gender bias at all made it even worse. An algorithm, even if uncorrelated to such attributes, can end up biased if it’s using inputs that happen to correlate with gender, race, or other identifiers, and there are many out there. Using your own insights can help avert some of these challenges, particularly in fields like medicine where patients consent to offering their data for specific use cases. When you're dealing with third-party data, you have to be a lot more careful. You generally have to pay for it, and sensitivity controls are less specified. We haven’t yet reached a point where we 100% know how these models are actually behaving, and it’s an emerging problem. But it begins and ends with quality control over your data.
-
Opening the floodgates to more data isn't a surefire recipe for success in AI projects. In some cases, access to data and models at scale, only makes it easier to amplify harmful biases. This recent MIT study examines the consequences of not having the right data. Here's why it matters: 🚨 Quantity doesn't equal quality. Imagine having data from 1,000 patients, but only 10 are women over 70. This imbalance skews a model's reliability across different demographics. 🔍 The study highlights 'subpopulation shifts,' where machine learning models perform inconsistently for different demographic groups. In simpler terms, the same model could be accurate for one group but faulty for another. ⚖️ It's not only about how accurate a model is overall, but also how it performs within these subpopulations. The disparity can be life-altering, particularly in sectors like healthcare where the stakes are high. 💡 The illusion of data availability can be misleading. The focus should be on having accurate, verifiable, and representative samples, especially when lives are on the line. #AI #Healthcare #DataQuality #Equity 📊🌐
-
The biggest nightmare for a data team working on AI models isn't the lack of data—it's bad data. Take Zendesk data as an example. Many data teams connect Zendesk data to their data warehouse without considering the quality of that data. Here's what we've noticed: -20 to 40% of Zendesk data is not useful for analytics. It often includes automated notifications that serve no analytical purpose. -In most cases, agents manually tag this data. This means you're relying on someone accurately selecting one category out of 200 from a list that hasn't been updated in two years. -When a new product issue needs to be tracked and you need historical analysis, it's impossible. You're stuck with a snapshot in time, without the capability to update past data based on new insights. All this is because Zendesk is designed with agents in mind, not data analysis. Bad Zendesk data can quickly turn the potential of leveraging AI from a dream into a nightmare.
-
A Single AI Model = 389 QA Checks + 5 Team Members The real data challenges of building Health AI 👇 👀 Ever seen one glucose lab logged 38 ways? That's one of the many data challenges Mark Sendak and his Duke team of five (clinicians, data scientists, project lead) had to tackle in building a pediatric sepsis model. Their paper lays out the challenges: collapsing 108 lab analytes, running 389 quality checks, and building 181 clean features. This hasn't changed much since 2022. (link in comments). arXiv:2208.02670 💥 Imagine scaling that across 1,000+ data points in your health system. Hours of rule-based transforms, clinician adjudication, and domain expertise power every AI model. Algorithms fail without it. 🎯 The data struggle in AI is real.
-
You will be using vector databases more in the future as data scientists. As use cases evolve in the era of LLMs and GenAI, its likely. Checking data quality is important now, it will be even more important with vector databases. 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬: 🔺Error Tracing : Vector embeddings can hide data quality issues, slowing root cause analysis. Quality problems can lead to flawed vectors - you may to trace back to the non-vector source. 🔺 Feature Quality: Compromised quality in vector embeddings may affect model performance. Especially if we assume the data in the vector db is correct. 🔺 Data Lineage: Correcting data and model issues becomes complex without clear traceability or documentation.The more features and complex the model, the more agonizing the struggle. 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 𝐟𝐨𝐫 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐭𝐢𝐬𝐭𝐬: 💠 Quality Checks Before Insertion Rigorously check data quality before inserting into vector databases. 💠 Maintain Good Documentation Keep clear documentation that notes the data sources, data users, and downstream dependencies. I learned the hard way when a missing data source wreaked havoc on an important project. 💠 Be Aware of Bias Understand and mitigate. Poor data quality may complicate features in vector DBs or inject unexpected bias to models. 💠 Plan for Iterative Development Prepare for iterative development to address challenges or inconsistencies. Log issues and plan for data downtime, especially if you're not used to vector dbs. Consider regular audits and collaboration between teams to improve quality. The unique challenges of vector databases require extra attention to quality and ingestion. While Vector databases hold promise and challenge, try to resolve data quality issues as close to the source as possible. Or before loading it into a vector database. #datalife360 #datastrategy #ai #datascience
-
LLMs are shepherding in a new era of AI, no doubt about it. And while the volume and velocity of innovation is astounding, I feel that we are forgetting the importance of the quality of the data that powers this. There is definitely a lot of talk on what data is used to train the massive LLMs such as OpenAI, and there is a lot of talk on leveraging your own data through finetuning and RAG. I also see an increased attention on ops, whether it is LLMOps, MLOps or DataOps, all of which is great to keeping your system and data running. What I seeing far less attention to is managing your data, ensuring it is of high quality and that it is available when and where you need it. We all know about garbage in garbage out -- if you do not give your system good data, you will not get good results. I believe that this new era of AI means that data engineering and data infrastructure will become key. There are numerous challenges to get your system into production from a data perspective. Here are some key areas that I have seen causing challenges: 1. Data: The data used in development is often not representative of what is seen in production. This means the data cleaning and transforms may miss important aspects of production data. This in turn degrades the model performance as they were not trained and tested appropriately. Often new data sources are introduced in development that may not be available in production and they need to be identified early. 2. Pipelines: Moving our data/ETL pipelines from development to staging to production environments. Either the environment (libraries, versions, tools) have incompatibilities or the functions written in development were not tested in the other environments. This means broken pipelines or functions that need rewriting. 3. Scaling: Although your pipelines and systems worked fine in development, even with some stress testing, once you get to the production environment and do integration testing, you realize that the system is not scaling the way you expected and are not meeting the SLAs. This is true even for offline pipelines. Having the right infrastructure, platforms and teams in place to facilitate rapid innovation with seamless lifting to production is key to stay competitive. This is the one thing I see again and again being a large risk factor for many companies. What do you all think? Are there other key areas you believe are crucial to pay attention to in order to achieve efficient ways to get LLM and ML innovations into production?
-
When you start a new machine learning project, do you immediately dive into coding up a new model? I hope not. Understanding the data is a crucial step in starting any ML project. This is because the quality and relevance of the data significantly impact the performance of the ML model. For some projects, data may already be collected. For others, the data collection process must first be defined and executed. Your literature review may help guide what type of data you should collect and how much data you might need for your project. Once data is collected, it will likely need to be annotated – also a task that can be enlightened by your literature review. - What type of annotations are needed? Pixel-, patch-, and image-level are the most common. - What tools have been used to assist with annotation? Can annotations come from some other modality? Perhaps from molecular analysis of a biological sample or an existing set of annotations like Open Street Map for satellite imagery. - How subjective are your annotations? Researching or running your own experiment to assess interobserver agreement can reveal the extent of this challenge. You also need to understand the quality of your data. This includes checking for missing values, outliers, and inconsistencies in the data. These could include tissue preparation artifacts, imaging defects like noise or blurriness, or other out-of-domain scenarios. By identifying data quality issues, you can preprocess and clean it appropriately and plan for any challenges that you cannot eliminate upfront. Data preprocessing may include normalization, scaling, or other transformations. For large images, it typically includes tiling into small patches. The data and annotations must be stored in a format that is efficient for model training. Understanding the data also helps you identify any biases that can affect the model's performance and reliability. Biases may be due to a lack of training data for a particular subgroup or a spurious correlation. Batch effects due to technical variations like processing differences at different labs or geographic variations. Or even samples labeled by different annotators. For most applications, domain experts should be consulted in learning about the data: - How was the data collected? - What does it represent? - What features do experts look at in studying the data? - What variations are present or might be expected in real world use? - What artifacts or quality issues might be present that could confuse a model? Some of these aspects can be quite nuanced and not obvious to someone untrained in a particular field. This critical step of understanding the data helps to assess the quality and relevance, identify and address data bias, and determine the appropriate preprocessing techniques. #MachineLearning #DeepLearning #ComputerVision
-
Most manufacturers exploring AI hit the same wall: data. You can estimate the cost of GPUs. You can scope how long it’ll take to write and debug your models. But the data? That’s the unpredictable part...and often the real bottleneck. I recently joined Manufacturing Tomorrow, a podcast from The Ohio State University, to discuss what it really takes to operationalize computer vision in industrial environments. From defect detection to safety monitoring, the challenges are almost always about data. We dig into topics like: 🔍 How rare events (e.g. hairline cracks or safety violations) distort your training distribution 📸 The data challenge in defect detection, where failures might occur once every 10,000 items, making real-world examples hard to come by ⚙️ Why plug-and-play AI remains elusive without robust scenario and failure mode analysis 🛠️ How teams are using tools like Voxel51 to iteratively curate, annotate, and evaluate their visual data If you’re building vision systems on the factory floor, definitely give this podcast a listen: https://lnkd.in/efq75RnP Interested in learning more about how Voxel51 can help you scale? https://lnkd.in/ewszrbcp