Artificial intelligence (AI) has the potential to revolutionize industries and transform society. However, lacking training data for marginalized groups is a significant gap in developing ethical, inclusive, and equitable AI. To build fair AI systems, we need to prioritize diverse training datasets. It is not in the stars to hold our destiny but in ourselves.
Training data is the information that AI systems use to learn patterns and make predictions. If a training dataset contains less data about a particular group, predictions for that group will be worse.
Training Data Gaps and Opportunities.
We’re training AI systems using data that inevitably reflect the past. This can lead to social inequality and discrimination with biased AI algorithms toward specific groups.
Examples of AI ethics dilemmas include AI tools perpetuating housing discrimination, such as in tenant selection and mortgage qualifications, Amazon’s AI recruiting tool, which was shut down after using it for one year, or AI algorithms that create ‘risk scores’ to predict which patients might likely develop certain diseases, such as skin cancer, and discriminate against minority ethnic people.
There are sophisticated models to detect depression and anxiety by analyzing our voices. Clinical studies have corroborated the technology. The problem? The startup lacks training data for Black and Hispanic groups.
If most data is unrepresentative of marginalized groups, the systems trained are more likely to make decisions that exclude diverse communities.
For example, MIT researchers found that an AI system that predicts mortality risk from chest X-rays was less accurate for Black patients than white patients. The system was trained on data from three hospitals with predominantly white patients. AI algorithms that create ‘risk scores’ to predict which patients might likely develop certain diseases, such as skin cancer, discriminate against ethnic groups.
However, when a diverse dataset is available, it unleashes the technology’s potential, as shown by AI used to improve breast cancer screening for Black women.
These examples illustrate the importance of training data for equitable AI. Our training data must reflect the diversity and complexity of our society.
Creating diverse datasets improves AI for all.
Adding marginalized groups to AI training data improves the test results for minority and majority groups. This is documented in a paper (undergoing peer review) by Chari et al., which also suggests a way to include minority groups based on how much they share information with other groups.
The inclusion of minority groups is an ethical imperative and crucial to improving AI performance and reducing bias.
The asymmetrical battle between profits and ethics.
We might face an asymmetrical battle as history teaches us that in the tech race, pursuing profits at all costs created damaging social media networks with devastating effects on society, and we chose to ignore that.
While the rise of ChatGPT and AI-driven vertical search engines promises to revolutionize the way we search and interact with the internet, it’s crucial to consider the potential pitfalls of this transformative technology. Investors, donors, and brands are crucial in addressing this ethical dilemma.
An opportunity for catalytic investing and grants:
There are opportunities for catalytic investing and grants that can help address these challenges.
In healthcare and technology, we have seen examples of catalytic investors and grantors deploying capital to address specific market dynamics.
Researchers at Stanford’s HA (Human-Centered Artificial Intelligence) initiative have proposed using advanced market commitments (AMCs) for training data. AMCs are contracts that guarantee a market for a product or service before it is developed or delivered. This approach can incentivize private innovation and investment in areas with high social value but low commercial demand.
Impact investment and catalytic grants operate similarly. They deploy capital to overcome the initial barriers of low commercial demand or high initial entry costs.
For example, an entrepreneur may be unable to afford to create a dataset of a particular marginalized group. Still, a catalytic grant or investment can unlock the opportunity by paying for such a dataset, which can then be used by many entrepreneurs pursuing inclusive AI solutions.
By catalyzing innovation in areas with high social value but low commercial demand, impact capital can help to create new markets and opportunities for growth.
Training data is not just a technical issue. It is a social and ethical issue that affects how we use AI to shape our future.
It is not in the stars to hold our destiny but in ourselves. Our responsibility is to address the root causes that prevent the creation of effective, fair, and inclusive AI solutions that reflect the values of a just and equitable society.