AI and Skin of Color: Hidden Biases Raise Questions

Media formats available:

The call for artificial intelligence (AI) to transform medicine and enhance physicians’ ability to improve diagnosis, treatment, and patient care has never been so clear. Within dermatology, there has been great interest in utilizing these tools but not without acknowledging that much work still needs to be done.

When companies deploy AI programs--such as those for facial recognition--they are using machine learning, a subfield of artificial intelligence that enables computers to learn through data that has been accumulated. With time, programmers can fine tune the training model, or machine learning algorithm. Theoretically, the more data, the better the program. However, the quality of the data also plays a significant role and cannot be ignored.

Facial Recognition

For years, there has been much evidence that face-analysis algorithms are less accurate for darker skin types. In 2018, it was determined that facial analysis services from Microsoft, Face ++, and IBM disproportionally misclassified the gender of darker-skinned individuals and female faces. Joy Buolamwini, lead author of the study from the MIT Media lab and founder of the Algorithmic Justice League Project, cautioned that facial recognition technology is increasingly being used by law enforcement agencies and public arenas without being tested for accuracy, leading to widespread implications for employment and imprisonment.1

In 2022, in accordance with its new Responsible AI Standard, Microsoft announced that it would remove certain facial analysis tools from its Azure AI services which identify attributes such as gender, age, emotional states, hair, and makeup. After a 10-year run, Facebook, now META, ended its facial recognition feature that would recognize and suggest friends to “tag” in photos.

Since then, several major corporations, including Google and META, have adopted measures to test the efficacy of their AI software. Google introduced the Monk Skin Tone (MST) Scale, a result of the collaboration efforts of Harvard sociologist Ellis Monk and the company to improve on the Fitzpatrick skin type scale by increasing the number of skin colors from 6 to 10. Google has implemented MST in its products as search results are classified according to the scale. Nevertheless, photos have been reported mislabeled with stereotyped images.

Addressing Skin Tone Bias

Recently this year, Sony published research that suggest that there are additional layers of bias related to skin color within computer vision datasets and models.2 The study determined that current skin color scales influence the way AI classifies people and their emotions. With a skin tone that is light or red, there is a higher probability that AI considers the individual to be smiling and the converse is true. Bias, then, not only exists for skin tone but also for skin hue. As a result, Sony has introduced the “Hue Angle,” a new classification system that quantifies to what extent datasets are skewed towards light-red skin color and under-represent dark-yellow skin color. This tool may become one of many, then, to mitigate skin tone bias in AI models.

While there have been calls for collaboration among various stakeholders to work together to identify bias in the areas of facial recognition, skin image analysis, deep fake detection, and to have open-source datasets, some group such as the Electronic Frontier Foundation (EFF) vehemently criticize AI powered facial recognition altogether, with the Orwellian feeling that “Big Brother is watching you.”

Quality In, Quality Out: Why Diversity Training of AI Is Critical

If trained on data not representative of the real world, an AI algorithm will underperform and potentially perpetuate biases.

By Daniel Schlessinger, MD

Although it has existed in some form for decades, artificial intelligence (AI) has become part of the general lexicon more recently with the release of more public-facing algorithms, such as DALL-E and ChatGPT. In medicine, and particularly in fields with an emphasis on visual recognition (such as dermatology, radiology, and pathology), AI algorithms in controlled environments have been shown to be effective at completing a variety of tasks ranging from diagnosing pigmenting skin lesions from dermatoscopic images¹ to evaluating histopathologic images and grading severity of alopecia.² While these achievements within dermatology so far have been impressive, the way they have handled skin-of-color dermatology and various forms of bias has not always been careful or consistent.

AI algorithms are subject to bias in multiple forms. Just as a child forms impressions of the world based on heuristics and their prior exposure, an AI algorithm also does not exist in a vacuum. With a lack of skin-of-color representation in dermatology textbooks, diagnosing dermatologic conditions in patients with diverse skin tones is already challenging for many dermatologists;^3-5 the introduction of AI to clinical practice has the potential to either improve this or exacerbate this phenomenon.

If trained on data not representative of the real world, an AI algorithm will underperform and potentially perpetuate biases. A skin cancer detection model trained only on images of skin cancer in lighter-skinned individuals will generally misdiagnose future cases in patients with darker skin types, which has been borne out in multiple studies.^6,7 Notably, perpetuation of racial bias is not only an issue for computer vision algorithms, but it has also recently been shown to exist in large language models.⁸

Not all algorithms are created equally, and there will likely be an increasing push by regulatory and governing bodies to develop AI standards of equity, among other things. As a member of the Standards Working Group of the AAD’s Augmented Intelligence Committee, I have had a chance to participate in discussions related to what exactly these quality standards should be. Aside from promoting the accuracy of the information provided, we also are focused on data transparency, clarity in training and test data, and the safety of the AI solution. These are to ensure, for example, that AI models disclose data on the diversity of their training data. A majority of currently published dermatology algorithms do not disclose this data, and those that did often did not include any patients of Fitzpatrick type V or VI.^9,10

As with almost any medical intervention, something possessing the ability to help may also cause harm. Especially when considering already marginalized groups, it is essential that we build guardrails around AI to not only avoid worsening existing inequities, but also hopefully ameliorate healthcare disparities. With thought and care, our specialty can help lead the way that responsible AI development is practiced.

Disclosures: Co-founder, FixMySkin Healing Balms; Shareholder, Appiell.

Daniel Schlessinger, MD, FAAD, is currently completing a fellowship in Mohs and cosmetic surgery at Northwestern University. He currently serves on the American Academy of Dermatology’s Augmented Intelligence Task Force and Clinical Guidelines Committee.

1. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. Feb 2 2017;542(7639):115-118. doi:10.1038/nature21056

2. Lee S, Lee JW, Choe SJ, et al. Clinically applicable deep learning framework for measurement of the extent of hair loss in patients with alopecia areata. JAMA Dermatol. Sep 1 2020;156(9):1018-1020. doi:10.1001/jamadermatol.2020.2188

3. Adelekun A, Onyekaba G, Lipoff JB. Skin color in dermatology textbooks: An updated evaluation and analysis. J Am Acad Dermatol. 2021;84(1):194-196. doi:10.1016/j.jaad.2020.04.084

4. Harp T, Militello M, McCarver V, et al. Further analysis of skin of color representation in dermatology textbooks used by residents. J Am Acad Dermatol. 2022;87(1):e39-e41. doi:10.1016/j.jaad.2022.02.069

5. Syder NC, Omar D, McKenzie S, Brown-Korsah JB, Taylor SC, Elbuluk N. Gaps in medical education curricula on skin of color in medical school, residency, and beyond: Part 1. J Am Acad Dermatol. 2023;89(5):885-892. doi:10.1016/j.jaad.2022.03.053

6. Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. Aug 1 2018;doi:10.1001/jamadermatol.2018.2348

7. Han SS, Kim MS, Lim W, Park GH, Park I, Chang SE. Classification of the Clinical Images for Benign and Malignant Cutaneous Tumors Using a Deep Learning Algorithm. J Investig Dermatol. Jul 2018;138(7):1529-1538. doi:10.1016/j.jid.2018.01.028

8. Omiye JA, Lester JC, Spichak S, Rotemberg V, Daneshjou R. Large language models propagate race-based medicine. NPJ Digit Med. Oct 20 2023;6(1):195. doi:10.1038/s41746-023-00939-z

9. Guo LN, Lee MS, Kassamali B, Mita C, Nambudiri VE. Bias in, bias out: Underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection-A scoping review. J Am Acad Dermatol. Jul 2022;87(1):157-159. doi:10.1016/j.jaad.2021.06.884

10. Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: A scoping review. JAMA Dermatol. Nov 1 2021;157(11):1362-1369. doi:10.1001/jamadermatol.2021.3129

Physicians, Patients, and Consumers

Within dermatology and aesthetics, there has been a rapid rise in the number of tools and applications targeted to both practitioners as well as patients. Whereas dermatologist once had the option to take standardized before-and-after photos of patients, they can now show what patients would look like after they have been rejuvenated with filler or toxin during their cosmetic consultations. Patients may be shown simulations of aging and see what they would look like at certain time points in their lives. For the general public, there have been a flood of virtual “try on” and skincare analysis applications put forth by different skincare, haircare, and makeup companies. Consumers can utilize these applications to try to match themselves to the correct foundation shade, figure out what areas of their face are oily or dry and continue to obsess about the number of pores they have.

However, we do not know where the data derived to come up with these algorithmic models comes from, nor how it is even validated. Does the model account for how photographic light shines on lighter versus darker skintypes and whether that would make one appear oiler, for example? Does it have enough representation of certain racial and ethnic groups? For example, the data used for machine learning to diagnose skin cancers and other dermatological conditions primarily relies on fair-skinned populations across the US, Europe, and Australia. How do these models work in diverse populations or diseases that have a smaller proportion of individuals of certain backgrounds?

We also make assumptions about the aging process when we simulate how facial features change chronologically; implicitly, we acknowledge that everyone desires full lips.

When certain populations are relatively homogenous, these algorithmic models may have more validity. However, as the diversity and complexity of the US population increases, it is outdated to think that we should make the selection to be limited to White, Black, Hispanic, or Asian. Patients will soon recognize that these tools yield generalizations or reinforce stereotypes that exist, rather than embracing the personalization that they truly desire.

Thus, while technology can be deployed with benign intentions to make our lives easier, we must be cognizant about the deleterious impact that it may have and unforeseen biases that exist. Keeping these things in mind, we can harness the power of AI to achieve the outcomes we desire.

1. Buolamwini, J. «Gender shades: Intersectional accuracy disparities in commercial gender classification». J Mach Learn Res. 2018;81:77–91.

2. Thong W, Joniak P, Xiang J. Beyond skin tone: A multidimensional measure of apparent skin tone. arXiv (Cornell University). 2023;doi.org/10.48550/arXiv.2309.05148

Scroll back to top

Ready to Claim Your Credits?

You have attempts to pass this post-test. Take your time and review carefully before submitting.

Good luck!

DermWire TV Extra: Dr. Ruiz on GEP Testing for SCC
DermWire TV Extra: Dr. Ruiz on GEP Testing for SCC
Skin of Color
DermWire TV Extra: Dr. Ruiz on GEP Testing for SCC
- Emily Ruiz, MD, MPH
Multiple Eruptive Myxoid Dermatofibromas in Patient With History of HIV
Multiple Eruptive Myxoid Dermatofibromas in Patient With History of HIV
Skin of Color
Multiple Eruptive Myxoid Dermatofibromas in Patient With History of HIV
- Sheryl Hoyer, MD
- Christine Pak, MD
- Isabella Zorra
Cultural and Aesthetic Considerations in Patients with Skin of Color
Cultural and Aesthetic Considerations in Patients with Skin of Color
Skin of Color
Cultural and Aesthetic Considerations in Patients with Skin of Color
- Victoria Palmer, MBBS, MSC
DermWireTV: New Data Presented for Atopic Dermatitis and Prurigo Nodularis Candidates at AAD
DermWireTV: New Data Presented for Atopic Dermatitis and Prurigo Nodularis Candidates at AAD
DermWire TV
DermWireTV: New Data Presented for Atopic Dermatitis and Prurigo Nodularis Candidates at AAD

Facial Recognition

Addressing Skin Tone Bias

Quality In, Quality Out: Why Diversity Training of AI Is Critical

If trained on data not representative of the real world, an AI algorithm will underperform and potentially perpetuate biases.

Physicians, Patients, and Consumers

Recommended

Title

Share on ReachMD

Get a Dose of PracticalDermatology in Your Inbox and Practice Smarter Medicine