Population Descriptors in Big Data Research
The importance of curation and choice
By the CHIRON Project Team | Published December 4, 2025
🖨️ Print-friendly PDF
Curating and analyzing data are not neutral tasks. They involve making choices about how groups are lumped together, split apart, or excluded entirely. These choices affect how people like researchers, clinicians, policy makers, and patients make sense of and use the research.
Standardization Efforts
Because data curation and analysis choices can have such large effects, many scholars have pushed for more standardization for these tasks. Examples of resources to help with this include:
- The National Academies of Sciences, Engineering, and Medicine (NASEM)’s 2023 report, “Using Population Descriptors in Genetics and Genomics Research” has useful advice even for researchers outside of genetics and genomics. It includes a short set of recommendations. Examples include:
- Researchers should avoid using the term “Caucasian” because it is rooted in white supremacy.
- Researchers should document and share how they chose population descriptors they use. They should reflect on, document, and share their reasons for lumping (combining) population descriptors together.
- The PhenX Toolkit is a collection of approved methods for measuring peoples’ traits (phenotypes) and the things they are exposed to (exposures). For example, it includes a protocol for measuring social vulnerability using variables in existing data. This helps keep methods consistent across studies, making it easier to compare and replicate results. It may be useful as you explore and use population descriptors in your work. It was created by scholars from many fields, including CHIRON academic workgroup member Maile Tauali’i.
Community Advocacy
Advice on population descriptors often comes from communities speaking up for themselves when the usual grouping schemas don’t meet their needs. For example,
- Many Pacific Islanders reject the common grouping “Asian American and Pacific Islander” (AAPI). When researchers lump them together with the much larger “Asian American” group, their data is often not visible in the results. When governments then use this research to decide how to allocate resources, they overlook the needs of Pacific Islanders.
- Some groups also critique the label “Asian American” itself because it lumps many diverse groups into one. Since 1997, the U.S. Census has listed “Asian” and “Native Hawaiian or other Pacific Islander” as separate options. People filling out the Census can also choose a specific subgroup.1
Changes to the U.S. Census
As in the example above, the U.S. Census has often been a site for debate about race and ethnicity categories. Two changes set to take place in 2030 include:
- Adding a “Middle Eastern North African” (MENA) category for race and ethnicity. Before this point, these groups have had no option to identify themselves other than “White”. Similar to Pacific Islanders in the grouping AAPI, this made their data invisible. This change comes after extensive advocacy from Arab Americans.
- Race and ethnicity will be asked in a single multiple-choice question instead of two separate ones. This change was made so that people who identify as Hispanic or Latino can select that option on its own. Previously, many had to choose “Some Other Race” in the separate race question. However, some worry this change will make it harder to see the data of Latinos who choose more than one response, like Afro-Latinos.
Sex and Gender Categories
Sometimes changes that are meant to provide more accuracy do the opposite instead. In 2022, NASEM released guidance on asking about sex and gender in surveys. They recommended this two-question approach (direct excerpt):
Q1: What sex were you assigned at birth, on your original birth certificate?
- Â Female
- Â Male
- (Don’t know)
- (Prefer not to answer)
Q2: What is your current gender? [Mark only one]
- Female
- Male
- Transgender
- [If respondent is American Indian or Alaska Native] Two-Spirit
- I use a different term: [free text]
- (Don’t know)
- (Prefer not to answer)
While these questions are meant to help researchers collect data on transgender participants, they fall short. Critics point out that, for example, a trans woman would be forced to choose between “transgender” and “woman.” They also note that including “transgender” as an option and not “cisgender” presents the cisgender experience as “normal.”
The Sexual and Gender Minority Interest Group at the National Cancer Institute (NCI) published a set of revisions to NASEM’s guidance. Their advice includes:
Other critics of these NASEM guidelines highlight other suggestions in their paper “Queering genomics: How cisnormativity undermines genomic science”:
Even though this advice was written for people who run surveys, other types of researchers can also learn from it. Repository researchers also make choices about categorizing sex and gender data that can impact the accuracy of research.
Why we can’t just give the “right answers”
While it would be nice if we could simply share the best practices for using population descriptors, we cannot. This is because:
- The “best” way will be different depending on the project. What works for one project might not work for another.
- As shown by these examples, this is an ongoing conversation. A simple guide would likely become outdated quickly.
Because of this, researchers must think carefully about what is right for their projects. CHIRON Exercise 1: Representing Groups Thoughtfully helps researchers make mindful decisions about these aspects of their research.
Sources and Further Reading
1. Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity. The White House. Accessed November 11, 2025. https://obamawhitehouse.archives.gov/node/15626Â 11. Bureau UC. What Updates to OMB’s Race/Ethnicity Standards Mean for the Census Bureau. Census.gov. Accessed November 11, 2025. https://www.census.gov/newsroom/blogs/random-samplings/2024/04/updates-race-ethnicity-standards.html