Exercise 4
Looking Critically at Your Dataset
Every dataset has limits â from who is included (or excluded) to how groups are defined and labeled. These choices are never neutral: they shape how research can be interpreted and who may be made visible or invisible.


đ Step 1: Read
Read one or two (light) readings
đŹ Step 2: Discuss
Apply discussion questions to a real or hypothetical research project
âď¸ Step 3: Reflect
Use a worksheet to reflect on how this exercise applies to your work, and note key takeaways
Step 1.
Read
đ Read as a group.
Note: web versions include links to additional content.
Reading
Read the following reading as a group. You might take turns reading aloud or spend a few minutes reading quietly.
Step 2.
Discuss
đŹ Discuss a real or imaginary project
Discussion Prompts
Use the prompts below to guide your groupâs conversation.
You can focus on a real research project or make one up for this exercise.
Getting Oriented
- Letâs begin by getting clear on some high-level details to make the next questions easier.
- Dataset name
- Number of people included
- Types of data collected
- Any obvious limitations or biases
- For virtual discussions: link to the dataset descriptor (if available)
Consider arranging this information in a table, like the example below. This format can also be useful for sharing in a public forum:
| Dataset Name | Number of people included at the time of your research | Types of data | Key limitations or biases | Link to dataset descriptor |
|---|---|---|---|---|
All of Us | 800,000 | Medical records, biosamples, genetic, wearable, omic | People living in US only | researchallofus.org/data-tools/data-snapshots/ |
Limitations of the Dataset
What data are available has an important impact on what research is undertaken.
- How does the structure of the dataset(s) you will use in your research impact the focus of your research and/or the types of analysis you are able to do?
- Have others raised concerns about this dataset? Do you have concerns about its integrity (for example, accuracy, completeness, or reliability)?
- As you review the datasetâs community labels or descriptors, can you imagine if and how any of them might be offensive to the communities they attempt to describe?
- Are there known biases embedded in its collection (e.g., ârace correctionâ algorithms, IQ measurements, or clinical tools that misperform across populations)?
- How will you account for them?
- Describe any ongoing discourse or advocacy related to the categories used in this dataset (e.g., recent NASEM recommendations, OMB census revisions).
Consent and Participation
- What metadata is available about the dataset(s) (e.g. data nutrition labels, datasheets)?
- Can you access the datasetâs informed consent document or a summary (e.g., in the documentation, metadata, or descriptors)?
- Did participants consent on their own behalf, or was surrogate consent used (in the case of cognitively impaired participants and minors)?
- Or is there a mix of both self-consented and surrogate-consented participants?
- Were participants compensated for providing their data? If so, how?
- Does the dataset include requirements for monetary or non-monetary benefit-sharing back to participants or communities?
- Was any data collected during patient care (e.g., EHR)?
- If yes, how might social dynamics have influenced what was recorded?
For example, a trans patient may be seeking access to gender affirming care, a patient who uses drugs or alcohol may be embarrassed to report this to their provider, a minor patient may not report certain behaviors in the presence of their parents.
Transparency
- Transparency is a core value of this work. We strongly encourage you to share your reflections and dataset descriptions in places where communities and other stakeholders can see them.
This might include:
- A project website or institutional transparency page
- Community newsletters or plain-language project updates
- Social media
Further Reading (Optional)
If youâd like to explore further, here are some external resources. If youâd like to explore further, here are some external resources.
Many of these are linked directly in the relevant questions above.
Dataset Biases & Categorization
Referenced in Limitations of the Dataset
- Hidden in Plain Sight â Reconsidering the Use of Race Correction in Clinical Algorithms (on ârace correctionâ algorithms)
- The Eugenic Origins of IQ Testing: Implications for Post-Atkins Litigation (on IQ measurements)
- Racial Disparity in Oxygen Saturation Measurements by Pulse Oximetry: Evidence and Implications (on clinical tools that misperform across populations)
Population Descriptors & Standards
Referenced in Limitations of the Dataset
- Using Population Descriptors in Genetics and Genomics Research (on recent NASEM recommendations)
- Initial Proposals For Updating OMB’s Race and Ethnicity Statistical Standards (on OMB census revisions)
Dataset Documentation & Transparency
Referenced in Consent and Participation
- The Data Nutrition Project (on data nutrition labels)
- Datasheets for Datasets (on datasheets)
Benefit-Sharing
More information on benefit-sharing
- Benefit-sharing – Nagoya Protocol HuB (on monetary or non-monetary benefit-sharing – Referenced in Consent and Participation)
- Benefit-Sharing by Design: A Call to Action for Human Genomics Research
Data Management Plans
Practical guides for developing data management plans
Step 3.
Reflect
âď¸ Document your takeaways
Note on versions:
Reflection Worksheets
Take a few minutes to reflect on this exercise using the worksheet below. Choose the version that best matches your role â or share one worksheet as a group. Jot down any insights, questions, or takeaways.
Next Steps
Youâve completed this exercise. Great work! đ



