How a Popular Medical Device Encodes Racial Bias
Pulse oximeters give biased results for people with darker skin. The consequences could be serious.
August 5, 2020
Aug 5, 2020
23 Min read time
Pulse oximeters give biased results for people with darker skin. The consequences could be serious.
Editors’ Note: On December 17, 2020, the New England Journal of Medicine published a research letter, “Racial Bias in Pulse Oximetry Measurement,” prompted by some of the issues explored in this essay. (Read a New York Times story about the study here.) On January 28, 2021, U.S. Senators Elizabeth Warren, Cory Booker, and Ron Wyden wrote a letter to Janet Woodcock, acting commissioner of the U.S. Food and Drug Administration, urging the agency to address these concerns.
COVID-19 care has brought the pulse oximeter into many American homes. This compact medical device, costing as little as $20, clips onto a fingertip and helps gauge how much oxygen is making it to the blood. When COVID-19 fevers moved through my household earlier this year, everything suddenly revolved around the number on its tiny screen, which reports oxygen saturation as a percentage. Normal readings are in the range of 95 to 100 percent; my husband could only sleep if I stayed up to make sure his readings didn’t plummet into the 70s again. Our doctor said to go back to the hospital if the device’s reading dropped to 92 and stayed there, but most nights it hovered along that edge. I began to wonder exactly what this object was telling us.
To measure oxygen in your blood, light from a pulse oximeter must pass through your skin. This should give us pause, since several technologies based on color sensing are known to reproduce racial bias.
To picture what’s happening inside a pulse ox—as health care providers call it—start by thinking about what’s happening inside your body. Blood saturated with oxygen is bright crimson thanks to iron-containing hemoglobin, which picks up the gas molecules from your lungs to deliver them to your organs. In the absence of oxygen, the same hemoglobin dims to a cold purple-red. The oximeter detects this chromatic chemistry by shining two lights—one infrared, one red—through your finger and sensing how much comes through on the other side. Oxygen-saturated hemoglobin absorbs more infrared light and also allows more red light to pass through than its deoxygenated counterpart. Adjusting for certain technicalities using your pulse, the device reads out the color of your blood several times a second.
To “see” your blood, though, the light must pass through your skin. This should give us pause, since a range of technologies based on color sensing are known to reproduce racial bias. Photographic film calibrated for white skin, for example, often created distorted images of nonwhite people until its built-in assumptions started to be acknowledged and reworked in the 1970s; traces of racial biases remain in photography still today. Similar disparities have surfaced around several health devices, including Fitbits. How had designers managed to avoid such problems in the case of the oximeter, I wondered? As I dug deeper, I couldn’t find any record that the problem ever was fully fixed. Most oximeters on the market today were initially calibrated primarily for light skin, and they still often reproduce subtle errors for nonwhite people.
In medical and technology communities there is a perception that this bias isn’t a big deal. To understand why, I reached out to manufacturers, doctors, researchers, and government regulators to ask for any updates to these previously documented issues. Many responded along these lines: “The errors haven’t really been dealt with, but here’s why it doesn’t matter.” Others thought the stories that get told about the harmlessness of racial disparities reveal the very opposite: unequal standards that have become normalized. It all matters—the errors, the history that produced them, the future they’re being built into, and the justifications about racism they reveal in U.S. science and medicine.
• • •
In 2005 a team of physicians studied oximetry’s racial bias in critical detail. The group often works at the famous mountaintop Hypoxia Lab, founded at the University of California, San Francisco (UCSF) by John Severinghaus, inventor of blood gas analysis, who did foundational work in medical devices for anesthesiology. “In our eighteen years of testing pulse oximeter accuracy,” the team noted in their article, “the majority of subjects have been light skinned. . . . Most pulse oximeters have probably been calibrated using light-skinned individuals, with the assumption that skin pigment does not matter.”
In medical and technology communities there is a perception that this bias isn’t a big deal. But it all matters—the errors, the history that produced them, the future they’re being built into, and the justifications about racism they reveal in U.S. science and medicine.
But after hearing about a range of “unacceptable errors in pulse oximetry” among Black wearers, the UCSF study was “specifically designed to determine whether errors at low [arterial oxygen saturation] correlate with skin color.” Since errors don’t tend to show up at healthy oxygen levels, a special protocol is necessary to check accuracy at lower oxygen, which better simulates an actual health crisis. The doctors collected readings with a range of people using several pulse ox models, then checked their readings against a different kind of test based on arterial blood gas, the “gold standard” test for oxygen levels. (The latter measure is more invasive, requiring blood from an artery, which is why the pulse ox is often used as a proxy in hospitals.)
Crosschecking these two measures over 1,067 data points, the team found a clear pattern of errors. For nonwhite people the machines mostly tended to overestimate saturation levels by several points. The study only included participants who identified as Black or white, but the authors noted that degrees of errors have also been observed among Latinx, Indigenous, and many other nonwhite people. The team’s follow-up study, published in 2007, focused on safety errors for people with “intermediate” skin tones and included a larger group of women. This more detailed data again found a clear pattern: pulse ox “bias was generally the greatest in dark-skinned subjects, intermediate for intermediate skin tones, and least for lightly pigmented individuals.” Racial errors grew significant at lower oxygen levels, starting around 90 and growing widest in the 70s.
In principle, the implications can be troubling. The night we first got a pulse ox, my husband woke up with his oxygen at 77. In their studies of that low saturation range, the UCSF doctors noticed “a bias of up to 8 percent . . . in individuals with darkly pigmented skin,” errors that “may be quite significant under some circumstances.” Thus, for a nonwhite person, a reading of 77 like my husband’s could hide a true saturation as low as 69—even greater immediate danger. But EMTs or intake nurses might not be able to detect those discrepancies during triage. The number appears objective and race-neutral.
Indeed, while the oximeter is a key tool for some patients in deciding when to go to the hospital, it’s also what they use at the hospital. Clinical guidance about giving oxygen tends to be loosely keyed to a certain threshold of oxygen saturation; protocols recommend particular interventions at 88, 90, and 92 percent, for example. Racial errors in these higher saturation ranges tend to be narrower disparities of one to four percentage points, but they still can mislead if they go undetected. In particular situations, another study notes, errors of that margin “may severely affect the treatment decisions in borderline cases.”
For a nonwhite person, a reading of 77 like my husband’s could hide an oxygen saturation as low as 69—even greater immediate danger. But EMTs or intake nurses might not be able to detect those discrepancies during triage.
This might seem like a fine point, but medicine is made of fine points that turn into ordinary decisions. Using the UCSF data, one company’s illustrations demonstrate the skin color variability of three brands of pulse oximeters (Nonin, Nellcor, and Masimo) for one of the most common clinical decision points: a reading of 88 percent. Pulse ox readings can also be affected by conditions such as anemia, jaundice, poor circulation, and nail polish. Physicians in a clinic may not distinguish errors stemming from an underlying condition and those caused by the device’s bias on darker skin. The UCSF lab data are revealing on this point. The study participants were “healthy, nonsmoking” Black and white young people in their twenties and thirties, mostly UCSF medical students, none of whom “had lung disease, obesity, or cardiovascular problems.” This pool of participants allowed the researchers to isolate skin color calibration errors alone, eliminating misreadings due to underlying comorbidities.
Image courtesy of Nonin Medical, illustrating the findings of Feiner et al. (2007). Nonin Technical Bulletin, September 2008.
Most hospital protocols now recommend starting oxygen at 90. Below that threshold damage to vital organs such as the heart, brain, lungs, and kidneys becomes an immediate danger. In a mixed general population, a true blood oxygen saturation of 88 percent would, on average, produce a pulse ox reading of 89 to 90 using the most common meter in hospitals. In that case, guidelines would correctly suggest going on oxygen. But Black patients, equally in crisis at 88, would get an average reading of 91—just above the intervention threshold.
Physicians disagree on the clinical significance of these discrepancies. Do slight racial errors really matter in practice? Like any vital sign, pulse ox readings are one among many factors considered when making a critical care decision. Most caregivers I spoke to noted that a nurse or doctor on careful watch, drawing on a range of other information, would use their training to pick out patterns and place numbers in broader context alongside a patient’s perceived sense of distress. One critical care specialist told me she felt that the errors found by the UCSF studies would not change the care that patients with darker skin receive where she worked. I could imagine how that may be true in particular cases such as hers, but no one had collected reassuring evidence about the topic at her hospital—much less nationally or globally—so I found myself staring at the disquieting graphs of the only systematic data available as she told anecdotes about how she would contextualize such readings. I hung up the phone feeling unsettled by her words: there was “usually” no way this could matter, she said. Her insights helped me formulate a more elusive question: What about those moments that fall outside usually?
This might seem like a fine point, but medicine is made of fine points that turn into ordinary decisions. Clinical guidance tends to be loosely keyed to a certain threshold of oxygen saturation.
In my own experience this spring, the hospital’s pulse ox gave a reading of 91 exactly as I arrived at the ER with trouble breathing. I was told that around 90 might mean I needed oxygen, while 91 meant wait and see. This seemed to be the rule of thumb in use, though it did not appear hard and fast. I did not receive crosschecks such as an arterial blood gas test. Such procedures are much more common in critical care units, but 95 percent of people coping with coronavirus today never end up there. The ER nursing team around me seemed to be looking at the pulse ox numbers very closely. They were wary about the “happy hypoxia” associated with COVID-19. Before long my oxygen came up a few points and I was sent home, still with difficulty breathing, now with instructions to keep isolating and buy a pulse ox. I am white, and these calls worked out. But a Black person with the same pulse ox reading at intake could have been at or below the threshold to get oxygen. How would anyone have known for sure?
These concerns don’t end with clinical practice, either. Medicare reimbursement also uses pulse ox measures as key thresholds, with much less nuance than a nurse or doctor. At a reading of 88 or 89, Medicare will reimburse for oxygen at home, but at 90 it won’t. In effect, this means people with darker skin may have to be sicker in order to qualify for the same treatment as people white skin. This could lead to delays in recovery, worse outcomes, and greater likelihoods of future comorbidities as patients wait for the meter to catch up to bodily realities.
• • •
Some caregivers I spoke with sounded exhausted to field questions about pulse ox biases. They were beleaguered, no doubt, by a thousand other COVID-19 contingencies and more obvious manifestations of inequities. Even if they had never noticed glitches, it could be painful to wonder. Others I spoke to argued that any racial discrepancies at all were simply unacceptable. When people rely on devices for a snapshot, just as with Kodak film, shouldn’t everyone’s picture be equally clear? Anything less widens room for mistakes that may amplify existing inequalities. It creates a situation where hospital care teams need to work around the subtle racial biases of their tools.
There was “usually” no way this could matter, one critical care doctor told me. What about those moments that fall outside usually?
“How is racism operating here?” The physician, epidemiologist, and civil rights activist Camara Phyllis Jones urges health practitioners to ask this question throughout their work. In the case of pulse oximetry, errors of slight degrees mean a lot more than they otherwise might because of the larger patterns of inadvertent racism in hospitals they plug into. Nonwhite patients are already more likely to have identical signs classified as less urgent by physicians, as decades of research documenting unintentional medical racism shows. Measurement errors falsely indicating that hospitalized patients are safer than they are could further contribute to suboptimal care. As caregivers argue, “Any decision making rooted in implicit bias is detrimental” when “an incorrect assumption could literally mean the difference between life and death.”
Amid problems with unreliable testing for COVID-19, for example, some patients of color report being dismissed from the ER by doctors attributing their difficulty breathing to anxiety. In fact, in the name of combatting known treatment disparities in ERs, the Association of American Medical Colleges suggests hospitals “remove as much individual discretion as possible,” instead seeking “objective measures” to help doctors overcome “implicit biases that providers don’t even know they have.” In reality, the policy could further amplify the problem in cases where seemingly objective measures like pulse ox readings themselves display hidden racial bias. What happens when efforts to overcome physician bias rely on devices that are also biased?
On top of this, pulse ox data is a key vital sign being fed into the algorithms that increasingly guide hospital decisions. As reported in Nature and Science, many algorithms already suggest inadequate care along patterned racial divides: patients of color have to be sicker, on average, in order to receive the same interventions as white patients. They are less likely to be promptly identified for ICU admission, even with otherwise identical profiles. Yet algorithmic tools such as the EPIC “Deterioration Index” can only aspire to be as good as the instruments feeding data into them. With pulse ox disparities, what are machines learning from these distorted inputs? The proprietary EPIC Early Warning equation incorporates the Rothman Index, and half of the eight cut-off numbers for oxygen saturation built into that measure are in the range for racial errors. Like the problems magnified by “the coded gaze” of algorithms elsewhere, even small racial disparities could amplify unequal outputs.
Medicare reimbursement also uses pulse ox measures as key thresholds, with much less nuance than a nurse or doctor. At a reading of 88 or 89, Medicare will reimburse for oxygen at home, but at 90 it won’t.
Beyond the pulse ox alone, this also matters for other wearable chromatic devices and the algorithms they feed. Pretending that they are colorblind can further amplify how “Racism, Not Genetics, Explains Why Black Americans Are Dying of COVID-19.” I called my colleague from MIT’s Little Devices Lab, Jose Gomez-Marquez, whose research involves prying open devices to understand their inner workings. He always knows the latest med-tech rumors, and I wanted to ask if there was some inside story about recalibrating oximeters more recently. Had there been some quiet racial justice work that already made corrections for its biased design?
None that he’d heard of, Jose said. Oximeters predated much of the current DIY digital medical technology scene, developed across Europe, North America, and Japan decades ago. Among makers today, the device is often considered simple to the point of being child’s play, in comparison to the cutting-edge spaces where most groups compete for prestigious breakthroughs and lucrative markets.
For devices shaped by “discriminatory design,” as sociologist of science and technology Ruha Benjamin calls it, inequalities that are not intentional can still produce patterned exclusions and unequal rates of survival. The UCSF doctors who documented these disparities suggested “built-in user-optional adjustments” be designed into future models. But more than a decade later, I couldn’t find any examples on the market. The doctors also concluded that, at bare minimum, “warning labels should be provided to users, possibly with suggested correction factors.” I checked the box my pulse ox came in, but it only had fine print about inaccuracies linked to dark nail polish.
When I reached out to the team behind those breakthrough UCSF studies fifteen years ago, Professors Philip Bickler and John Feiner, they confirmed that they had not yet seen evidence of the change they hoped for around this issue. Bickler—now chief of neuroanesthesia, UCSF professor, and collaborating director of the Hypoxia Lab—said that as far as he was aware, “Manufacturers, as a group, have not responded at all adequately to this problem.” He notes that he views the current state of oximetry as a “great example of a bias in medical technology that disenfranchises a huge percentage of the earth’s population,” which especially worries him with “COVID-19 disproportionately affecting Black and Latinx populations.”
Amid problems with unreliable testing for COVID-19, for example, some patients of color report being dismissed from the ER by doctors attributing their difficulty breathing to anxiety.
One pulse ox manufacturer, Nonin, sought to address race-based errors in their devices back in the 2000s. A page of their website explains their work so far in comparison to their larger competitors. Several other companies in the original study also graciously replied to my questions, but none provided data showing the problem has been fixed. I combed through published studies they pointed to for context. The most widely quoted was a study from 2017, which several companies presented as a bright spot showing that oximetry readings were not racially biased among thirty-five infants. (Other studies have shown that babies’ low melanin production and the much thinner microstructure of newborn skin leave them less susceptible to chromatic measures’ racial bias.) This is reassuring news for infant ICUs but it does not tell us the device errors have been fixed for others: the study itself notes standing disparities for adults.
One of the largest manufacturers said they had reassuring internal data for one specific line of models, but that response left me wondering about the many other models they sell to hospitals today. Companies should create public-facing record and global historical memory of any such corrective work that already happened, behind the scenes of our health systems’ privatized patchworks, to let us all know clearly where things stand. After all, these are not new questions: while COVID-19 gives new emphasis to the pulse ox, the device has long been crucial for treating respiratory conditions with their own histories of chronic racial biases in diagnosis and care.
For devices shaped by “discriminatory design,” as sociologist of science and technology Ruha Benjamin calls it, inequalities that are not intentional can still produce patterned exclusions and unequal rates of survival.
At present, there seems to be little consensus among doctors, too, about what to make of the available studies, including those cited in 2019 textbooks on the need to correct for devices’ racial errors. One such study still being reprinted from 1990 recounts data showing the pulse ox target used for white patients on ventilators, 92, often resulted in hypoxia for Black patients; for this patient group, a pulse ox reading of 95 corresponded to an arterial blood gas reading of 92. Yet several doctors I checked with said they never learned this, even back in 1990. Should health care providers be aware of these significant errors, or are textbooks teaching doctors outdated corrections that could also potentially do harm by leading to confusion or wrong adjustments? Companies should be transparent, assessing and clarifying any margins of racial bias on their websites, because getting this wrong in either direction could amplify racial care disparities.
• • •
Until then, the pulse ox could be read as a case study of systemic racism in miniature—a nexus where, as anthropologists note, black boxes and public secrets often go hand in hand. Since the original UCSF study ended with a call to action, it is disturbing to track its afterlife in the medical literature and within the contours of the present pandemic. Later studies citing the UCSF work often imply the bodies of nonwhite people are to blame for making the device malfunction. Most recently, one 2020 study attributed race-based pulse ox errors to “co-morbidities upon which the device is used.” But the participants in that study had no underlying medical conditions; they were healthy young Black people.
In the 1990s the Food and Drug Administration (FDA) stopped allowing all-white male study samples. But mostly white study samples are still the norm; current guidelines suggest including at least two people with “darkly pigmented” skin in a group otherwise 85 percent white. Yet this can still obscure errors due to racial bias, by allowing those few participants’ data points to be cast as outer clusters in white-centric safety standards. As scholar of institutional cultures Sara Ahmed explains, this type of structure for “being included” still reproduces and recasts the norms of an unmarked white center, “against which others appear as points of deviation.”
While COVID-19 gives new emphasis to the pulse ox, the device has long been crucial for treating respiratory conditions with their own histories of chronic racial biases in diagnosis and care.
One early literature review commenting on pulse ox racial bias, for example, highlighted several studies showing “significantly more signal quality problems” for Black patients. It also covered one study that did not find any bias—but the reviewers noted that the last study only included four Black patients in a group of twenty-one subjects, so “the population size was probably too small to show up minor differences in pulse oximetry performance.” That study, critiqued as inconclusive to assess bias because it had under-sampled people of color back in 1991, included the exact ratio of nonwhite participants that the FDA guidelines still recommend including today.
The UCSF studies provided an illuminating alternative model to correct such issues: by collecting data for equal-sized subgroups, they broke the numbers down to check whether it was equally safe for each group. This showed something the FDA study designs had worryingly missed: the most common oximeters in U.S. hospitals at the time did not meet FDA thresholds of safety for people with darker skin. When those data points get blended into mostly white statistics, the data may look fine. In this, the pulse ox is also a microcosm for the problems facing our democracy. Equal safety does not mean majority-fits-all.
These devices’ subtle inequities are also haunted by much deeper histories of racism in science and medicine. During the time when corporations rose from plantations, machines to measure breathing were designed to quantify—and justify—racial hierarchies. These orders were built on the idea that darker skin color itself was a comorbidity. Medical doctors of the era argued that violent regimes of Black enslavement and Indigenous dispossession were not unjust because they held important health benefits for the supposedly inherently dysfunctional biologies of nonwhite people. Certain devices to measure breathing became part of larger machines to keep people in place, as historian Lundy Braun shows in her work on this medical legacy. This is part of larger patterns that scholars such as Dorothy Roberts and Anne Fausto-Sterling show get continuously encoded into medical school curricula and scientific health research taken to be cutting-edge. Even today, in many clinics, the spirometer often has a “race button” as a legacy of this disturbing history.
Until these problems are confronted, the pulse ox could be read as a case study of systemic racism in miniature—a nexus where, as anthropologists note, black boxes and public secrets often go hand in hand.
Oximeters, by contrast, were first conceived to monitor and protect the breathing capacities of those with privileged mobility. It is no coincidence that novelist Esi Edugyan imagined freedom’s trajectory as a hot air balloon ride over a sugar plantation: in fact, the idea for oximetry began at that height. Hot air balloon experiments in the 1800s led to the development of blood oxygen saturation measures after scientist-adventurers became paralyzed while airborne, as made famous across Europe and the Americas by scientist Paul Bert’s studies of the Zenith (though the pulse oximeter as known today wouldn’t be realized until decades later, by Takuo Aoyagi). Now crucial to the practice of anesthesiology, the device was initially most popular among those able to reach high altitude: pilots, astronauts, mountaineers. Oximetry’s origins came from the sciences of safety for white flight, and pulse oximeters still protect people unevenly against a virus that causes difficulty breathing, in ways that some experts liken to falling oxygen at high altitude.
There is no reason to build these disparities into the next generation of technologies. Yet that is exactly what will happen if we don’t take active steps to remove existing racial biases. The pulse ox’s unequal metrics are one among countless converging factors that stack the deck against nonwhite people facing systemic inequities. Yet there will never be one single reset button for history, activists remind us; the hard work ahead is tackling each facet of such inequity as it comes into view. Rather than normalized inequalities, the pulse ox could become a case study in everyday repair work, as Toni Morrison calls it—small, material, mundane practices in the direction of justice. In the face of vastly unfinished racial reckoning and historical repair, it matters all the more to do the work of investing in the small chances for concrete action right now in our hands.
Oximeters remain another disturbing materialization of how white supremacy has been built into our systems and infrastructures of perception—even programmed into the very machines we rely on to quantify danger when someone can’t breathe.
Engineers at MIT, for example, say adding adjustable LED lights to pulse oximeters could enable devices to set individualized baselines for each wearer, tailoring accuracy and fostering equitable safety. The technical capacity already exists. Funding from the National Institutes of Health could help fast-track long overdue corrections as part of a broad consortium coming together to fix this problem, to share progress so far and resources moving forward. With COVID-19 death tolls already over 160,000 in the United States alone and rising daily, the pulse ox is a vital tool for survival. It should not work least accurately for those whose health is most in danger.
These patterned errors are disturbingly symbolic traces of whose safety our institutions and technologies were built for, leaving people of color to hope that less than equal will be good enough. Truly rethinking collective safety and justice means teaching the next generation—and trying to learn ourselves—how to build worlds that don’t normalize any margin of error that would disproportionately obfuscate patients’ vital signs based on the color of their skin. Each moment until this work exists, oximeters remain another disturbing materialization of how white supremacy has been built into our systems and infrastructures of perception—even programmed into the very machines we rely on to quantify danger when someone can’t breathe.
While we have you...
...we need your help. Confronting the many challenges of COVID-19—from the medical to the economic, the social to the political—demands all the moral and deliberative clarity we can muster. In Thinking in a Pandemic, we’ve organized the latest arguments from doctors and epidemiologists, philosophers and economists, legal scholars and historians, activists and citizens, as they think not just through this moment but beyond it. While much remains uncertain, Boston Review’s responsibility to public reason is sure. That’s why you’ll never see a paywall or ads. It also means that we rely on you, our readers, for support. If you like what you read here, pledge your contribution to keep it free for everyone by making a tax-deductible donation.
August 05, 2020
23 Min read time