Algorithmic decision-making: Traps for the unwary
A day in the algorithmic life
Dressed smartly in a new suit, queuing up to buy your morning coffee at the AI-enabled coffee booth, you notice that you’ve been charged 30 pence more than your more casually attired fellow commuters. The differential pricing is the latest innovation from the ‘Equitable Coffee’ company, which uses AI vision and big data to lower prices for less well-off consumers. At lunchtime, chatting with your neighbour, you discover that you seem to be paying almost twice as much for cosmetics deliveries from the same online company. Later in the day, at last some good news: your car insurance company appears to have picked up on your tweets about the spiralling cost of premiums, and is now offering you a 20 percent discount to renew as a ‘valued customer’. No such luck, however, for your technophobe father, whose car insurance continues to soar, despite years of safe driving.
New data, new differences
Most of us would probably balk at such overt discrimination, but in fact businesses have always segmented and differentiated consumers to some degree. No one really questions half-price discounts for the elderly on Wednesdays, or new joiner discounts for credit cards or mobile phone services.
Yet such differentiation has previously been limited by technical or informational constraints: the cost of changing menus, or limited insight into consumers. And consumers are generally aware, at least subconsciously, that it is happening. What is changing now is the role of algorithms and big data, which are making it possible to filter, sift and segment individuals in ways never thought possible—for example, by the location and appearance of your house, your use of technology, your social media networks, your tone of voice or emotional reactions at a particular time of day. And, in many cases, the existence and basis of such algorithmic decision-making are not apparent to users.
More and more, algorithmic decision-making is going to play an important part in our lives, in deciding who gets access to public services, who gets a job and who doesn’t, and who pays what prices. AI systems could increase access for some while reducing it for others. Advances in voice-recognition technologies, for example, could play an important role in early-stage diagnosis of mental illness and detecting the onset of Alzheimer’s disease. However, they could also be used to exploit consumers’ emotional vulnerability in purchasing decisions.
Regulators taking notice
Regulators, competition authorities and equality watchdogs are starting to take notice. The UK Competition and Markets Authority (CMA) recently published research on the competition effects of algorithms and announced the launch of a new Digital Markets Unit to oversee a fresh digital regulatory regime promised by government. The European Commission recently published the Artificial Intelligence Act, a proposal for a regulation laying down harmonised rules on artificial intelligence, and the EU Vice President for Values and Transparency has warned of the dangers of ‘copying and pasting’ everyday racial and other biases into algorithms, with an AI white paper scheduled for publication in 2021. At the behest of the UK government, the Centre for Data Ethics and Innovation has recently reviewed the implications of algorithmic bias in recruitment, policing, financial services and local government.
What does this article cover?
This article investigates the role of algorithmic decision-making in two major areas of algorithmic use: (i) healthcare screening; and (ii) price and service differentiation. The central message is that, used correctly, algorithm decision-making can be a powerful tool for widening access. However, this can often come at the cost of new biases and exclusions, which are frequently unintentional and not apparent to algorithm designers. Our 'traps for the unwary' checklist proposes some considerations. Given the mounting interest from regulators and the very real potential for reputational damage, businesses need to fully understand the potential pitfalls and risks that such algorithms can entail.
(I) Algorithm decision-making in healthcare
Early-stage diagnosis of many diseases—ranging from diabetes to Alzheimer’s to pancreatic cancer—could save many lives and dramatically cut the costs of treatment. Yet early-stage screening has traditionally been expensive to operate because of its labour-intensive nature. Moreover, certain parts of the population can be hard to reach for such testing.
AI technology offers the prospect of much faster screening and detection, drawing on both clinical and non-clinical data. A deep learning system developed by Google Mind in collaboration with Moorfield’s Eye Hospital, for example, scanned retinal images to predict the onset of an extreme form of macular degeneration, the most common cause of vision loss among the elderly. The algorithm was able to outperform 5 out of 6 human physicians in predicting the onset of macular degeneration within 6 months.
More tantalising, however, is the prospect of ‘advance warning’ AI technology that can alert us to problems before we are even aware of them --your mobile phone or smart watch becomes your instant doctor. Data from consumer wearables such as smartwatches can be analysed to spot physical ailments or diseases. One recent study reported in Nature showed that smartwatch readings of biological data could identify coronavirus infections in 63% of cases, before the symptoms became evident.
Especially promising is the use of AI speech-based systems in mental health. An algorithm developed by researchers at the University of Boulder Colorado is able to detect signs of depression, schizophrenia and other mental-health problems with greater accuracy than psychiatrists can. Individuals answer questions on a mobile app, with the recordings then analysed by an algorithm to spot tell-tale signs of mental illness such as changes in frequency and tone of language. Such developments offer the eventual prospect of low-cost diagnosis of mental illness through Alexa or other voice technology systems.
Yet there is concern that poorly calibrated algorithms could restrict access to healthcare on the basis of sex, race, or other characteristics. One major problem is incomplete health data. Certain racial or socioeconomic groups may be under-represented in data because of previous difficulties in accessing treatment or medical records fragmented across different institutions. Another is known as ‘the healthy volunteer effect’. In one study the UK Biobank—a database with genomic data on approximately 500,000 people in the UK—was found not to be representative of the general population, as those who volunteer were typically healthier, less likely to be obese, and better off than the average. More generally, different ethnic groups and genders will often have different genetic markers for diseases—Sickle Cell disease is more closely linked to people of African or Mediterranean heritage, for example—severely affecting the accuracy of algorithms trained predominantly on those of European ancestry.
This raises the question of what can be considered fair or unbiased outcomes in algorithmic decision-making in relation to different groups of the population. Unfortunately, there is no single measure of unfairness, and different measures will often conflict with each other. However, economics can help shed light on the various measures and the trade-offs and costs involved, leading to better understanding of risks of alternative decisions.
In the case of healthcare screening, one measure of fairness is predictive accuracy or predictive parity—does the algorithm correctly predict outcomes or risks for those flagged by algorithms as high-risk? Another is the risk of false negatives—incorrectly identifying an ill individual as being healthy. And then there is the risk of false positives—identifying someone as ill when in fact they are healthy.
Illustration of different algorithmic biases in healthcare screening
A healthcare organisation uses a machine-learning algorithm to screen patients for early-stage diabetes. The algorithm’s success rate in detecting diabetes from samples is 66%, roughly equivalent to that of human doctors, but results are delivered faster and at a lower cost. However, some have claimed that the algorithm is unfair among different racial groups. The algorithm designers respond by arguing that the prediction rates of diabetes are identical for each group.
Here we can utilise the statistical approach set out by Chouldechova (2017), in her analysis of algorithmic fairness in the context of recidivism prediction in the US. Assume there are two groups of people to be tested, one consisting of 12 white people and the other of 16 black people. An asterisk denotes a true (positive) case of diabetes. The bold letters are those cases identified (flagged) as high risk by the algorithm. As can be seen from the visual, the algorithm flags some cases (bold W and bold Bs) that are not positive cases of diabetes in reality.
W W W W
W W W W*
The first question is how well does the algorithm perform in detecting true positive cases of diabetes among those whom it has flagged as high risk--what we call predictive accuracy. In group A, the algorithm identifies 3 people as having diabetes (bold), of whom two actually have the disease (bold asterisk). In group B, the algorithm identifies 6 people (bold) as having diabetes, of whom 4 actually have the disease (bold asterisk). Therefore the predictive accuracy is 0.66 in each case and the algorithm achieves predictive parity.
However, the two groups differ in their error rates. For Group W the algorithm incorrectly marks 1 out of 9 people who do not have diabetes as high risk (false positive rate of 11%) while group B has a false positive rate of 2 out of 9 (22%). In this example a false positive case might mean a worrying time for the patient with follow-up tests for a more accurate diagnosis.
More alarming, however, is the difference in rates of false negatives—failing to detect the disease at all. For Group W, the algorithm’s rate of false negatives is 1 out of 3, or 0.33. For Group B, the rate of false negatives is 3 out of 7, or 0.4285. Put differently, the algorithm is almost 29% more likely to miss a true case of diabetes in Group B as compared with Group A. The implication is that such cases may go undetected for much longer, with more serious long-term complications.
In fact, in any situation where the prevalence of the disease varies between groups in the population, it is statistically impossible to achieve predictive parity and identical false positive and false negative rates between groups. In this case the prevalence of diabetes is 3 out of 12 in Group W (0.25) and 7 out of 16 (0.44) in Group B.
Moreover, the greater the difference in prevalence, the bigger the disparity in false negative rates. For example, holding everything else constant, an increase in the prevalence of diabetes in group B to 0.56 would raise the rate of false negatives to 56%--in other words, 56% of true positive cases of the diseases are overlooked for treatment. The implication: biases must be defined and measured in a systematic way, so that decision makers can understand the true risks and trade-offs involved in algorithmic decision-making (see checklist section on ‘Measuring bias and category errors’).
(II) Algorithmic price and quality discrimination
Discounts for frequent users or bulk buyers. No-claims discounts for safe drivers. Home insurance rates that vary by locality.
While price discrimination often gets a bad press, it has been commonly applied in many retail and consumer markets, and can often be justified by differences in risk, consumer preferences or cost factors. However, big data and algorithmic pricing now make it possible to achieve much more finely grained differentiation in consumer preferences, often on the basis of factors that are hard to observe or discern. This sets up a potential conflict between increasing personalisation and access, on one hand, and risk of unintended bias or discrimination (which is especially concerning if discrimination occurs against vulnerable customers) on the other.
While algorithmic price discrimination is controversial, it can in some cases improve consumer welfare through better targeting of services or by making otherwise uneconomic services feasible to provide. Here economic analysis can illustrate the trade-off between economic efficiency, on one hand, and conceptions of fairness on the other.
Illustration 1: Algorithmic price discrimination
FashionPlate (a fictional company) is looking to provide its haute couture advisory and delivery service in Rustica, a semi-rural area in the south west of the UK. It knows from retail surveys and census data that there are broadly three groups of customers willing to pay different prices for its products (i.e. they have different willingness-to-pay or WTP). However, it has no way of identifying who these customers are.
Demand schedule for FashionPlate
- Group 1: 200 customers willing to pay £9 each= total revenue of £1800
- Group 2: 300 customers willing to pay £3 each= total revenue of £900
- Group 3: 500 customers willing to pay £1 each = total revenue of £500
There are substantial fixed costs of £1700 per month in running the distribution network. But once the distribution network is in place, the cost of serving ay incremental consumer is low, approximately £1.
Without the ability to identify the willingness to pay of individual customers, an operator would likely charge a single price of £9, generating revenue of £1800 from Group 1. However, this would be less than total costs of £1900 (£200 variable and £1700 fixed), so it would be unprofitable to provide the service.
By contrast, if the company charges all customers the lowest price of £1, it will make a significant loss on the service, as total costs [£1700 fixed +£1000 variable=£2700] exceed total revenues of £1000.
In fact, in this example, there is no single price that will give FashionPlate a profit and allow the service to be provided.
To pursue its mission of bringing fashion to the masses, the company decides to deploy an algorithm which uses geographic location, area-based survey data and social media feeds to provide a more detailed view of individuals in the area according to the maximum price they would be willing to pay for this kind of service. By charging different prices to different groups according to their willingness to pay, the company is able to generate enough revenue to cover total costs and make a small profit. Under the algorithmic determined pricing scheme, total revenues of £3200 now exceed total costs of £2700, making it profitable for FashionPlate to provide its services in the area. While access to the service has been enabled, however, some groups are paying much more than others.
More problematically, a consumer organization decides to investigate FashionPlate’s pricing using surveys and discovers that, on average, female users are paying £3.64 for the service, compared with £2.76 paid by male subscribers (see Table 2 for gender breakdown). While FashionPlate’s algorithm is ostensibly blind to gender—in fact, it doesn’t even collect such data—it turns out that there is a strong indirect correlation between willingness to pay (and hence FashionPlate’s prices) and gender, a protected characteristic under The UK Equality Act 2010.
Demand schedule with gender breakdown
- Group 1: 200 customers willing to pay £9 each= total revenue of £1800 (Gender split: 60% female; 40% male)
- Group 2: 300 customers willing to pay £3 each= total revenue of £900 (Gender split: 60% female: 40% male)
- Group 3: 500 customers willing to pay £1 each = total revenue of £500 (Gender split: 40% female; 60% male)
Figure 1: Algorithmic price discrimination Source: Frontier Economics
Re-calibrating pricing to mitigate bias
In this case integrating data on gender into the algorithm can help identify and correct such unintended distributional effects. Such data may suggest adjustments to pricing that can remove the distortion in average prices between groups. For example, by lowering prices for females in the highest WTP group from £9 to £5.33, the average price for females drops to £2.76, equalising the average prices between males and females. While this change means that there is some differentiation in prices within the highest willingness to pay group, it addresses the average bias on the observable and protected characteristic, namely gender. Under this stratified pricing scheme, FashionPlate’s total revenues are lowered to £2760, but are still sufficient to cover costs and make a small profit.
In this case integrating data on the protected characteristic and changes to the pricing scheme can overcome the problem of average bias while increasing access to the service.
Illustration 2: Algorithmic quality differentiation
Algorithms can also be used to differentiate consumers according to their willingness to pay for (or accept) varying levels of quality provision.
Consider the example of UtilityCo, a fictional utility provider that operates in a sparsely populated region. The region is covered a universal service regulation that requires all customers to be offered a single tariff of £1 for the service in question regardless of location. UtilityCo has the freedom not to operate in the area, but, if it does so, it must abide by the universal service obligation.
UtilityCo has the technical ability to offer three different quality levels for its service, with different costs associated with serving one additional customer (marginal cost) where the lower the quality level, the lower the marginal cost.
Numeric example: Marginal costs for UtilityCo are as follows:
- ‘Premier’ quality service: cost of serving one additional customer (marginal cost) = £5
- ‘Standard’ quality service: marginal cost = £1
- ‘Basic’ quality service: marginal cost = £0.5
Fixed costs are £200
Willingness-to-pay for these quality levels varies as follows:
- Group 1: 200 customers willing to pay £9 each for the premier service, not willing to accept basic quality
- Group 2: 300 customers willing to pay £2 each for the standard service, not willing to accept basic quality
- Group 3: 500 customers willing to pay £1 each for the basic service
From surveys it knows that Groups 1 and 2 are not willing to accept a basic quality service. It also knows there are broad variations in willingness-to-pay for service quality (as shown in the box), although it isn’t able to observe these differences at the individual household level.
Given its universal service obligation, in this situation UtilityCo generates total revenue of £1000 by providing services at the standard level but incurs costs of £1200 (£1000 variable and £200 fixed), making a loss in the service it provides overall.
Note that, if UtilityCo decided to provide a basic service, Groups 1 and 2 would not accept it. By providing a service to Group 3 only, UtilityCo would still make a loss.
Figure 2: UtilityCo’s revenues and costs when required to charge a single price to all customers (standard quality) Source: Frontier Economics
UtilityCo now decides to engage a geographic algorithm that can predict willingness-to-pay for different levels of service quality at the household level.
Equipped with the ability to identify customers according to their quality preferences, UtilityCo can now vary quality provision to make the service profitable to provide. It continues to offer a single price of £1 to all customers, generating revenues of £1000 as before. For Group 3, however, it provides basic quality, making a saving of £250 on its variable costs (500 * £0.5). Total revenues of £1000 now exceed total costs of £950, yielding a small profit for the service overall.
However, the regulator receives complaints from some households in Group 3 who are now experiencing lower service levels, a particular concern given that these are more likely to be low-income or vulnerable households. In response, the regulator decides to introduce a minimum quality level. However, lacking information on costs and demand, the regulator sets the quality floor too high, at premier level, with the result that UtilityCo now makes significant losses (Total revenue of £1000 - total costs of £5200) and decides to withdraw from the region. In fact, even setting a quality floor at standard level would put UtilityCo back in the same loss-making position as before.
An alternative is for UtilityCo and the regulator to share information on the workings of the algorithm and adjust prices to remove the adverse quality impact on Group 3. For example, the regulator could allow UtilityCo to charge Group 1 customers £9 (their maximum willingness-to-pay) for provision of a premier-service level, generating an extra £1600 in revenue from this group. This could then be used to cross-subsidise the extra cost of providing standard service to Group 3 (£250) while allowing UtilityCo to increase its profits overall. While this represents a small departure from the universal-service pricing policy, it might be justified in terms of the overall welfare impact (both Groups 1 and 3 get a better level of quality, while Group 2 stays the same). This example illustrates the importance of close collaboration between regulators and service providers in algorithmic design and usage.
(III) Traps for the unwary: A checklist
So how can businesses harness algorithms to increase access and personalisation while minimising the risks of regulatory missteps and reputational damage when things go wrong? While much will depend on individual context, we propose below a checklist of six areas that firms, algorithm designers, and regulators should consider when evaluating the fairness of systems.
I. Check for indirect discrimination
In the UK discrimination is outlawed under The Equality Act 2010 for nine protected characteristics—, such as sex, race, age and disability. When facing accusations of bias, algorithm designers will usually point out that their algorithms do not include data on protected characteristics, and indeed they may not even collect such data. However, discrimination can still occur indirectly, when an important variable in the algorithm is highly correlated with the characteristic in question. For example, an algorithm that gives discounts to customers who it predicts are likely to switch mobile-phone network on the basis of social media use might well end up discriminating between younger and older consumers. An algorithm that predicts insurance risks by locality might inadvertently correlate with the distribution of racial groups.
Indirect correlation could still be construed as unfair or reputationally damaging—for example an algorithm that seems to target or exclude lower-income consumers or consumers with impulsive buying behaviour.
As we showed in the FashionPlate illustration, it will often be possible in such cases to design a pricing scheme that is uncorrelated with the protected characteristic in question, mitigating the bias while still allowing sufficient variation across individuals to improve personalisation or access outcomes.
II. Consider the counterfactual
Economic analysis will increasingly be critical in the fight for good algorithmic design. Modelling and cost-benefit analysis can be deployed to highlight the benefits and costs of alternative algorithm designs, as well as simulate the impact of these algorithms on different groups of the population.
Economic modelling will sometimes show that a counter-factual world with no algorithmic decision-making can lead to worse outcomes: for example, reduced rates of disease detection overall or increased morbidity, all at a higher cost in terms of human resources. In many instances the counterfactual—decision-making by humans—may be subject to greater biases than those made by machines. Scenarios and simulations can help decision makers map a comprehensive picture of the benefits and costs of alternative courses of action.
Economic analysis can also help in good regulatory design, as regulators are forced to balance sometimes competing objectives of economic efficiency and fairness. As shown previously in Illustration 2, economics can inform how information sharing influences regulatory design, and contributes to improving outcomes for businesses, regulators, and consumers.
III. Identify and measure biases
As shown in Section I, there is no single measure of fairness or bias. Addressing one type of bias might introduce another type of bias. Moreover, context always matters. Some mistakes are more costly or consequential than others, but it requires human judgement to assess these. A false positive in an algorithmic test for lung cancer might mean a worrying period followed by further tests; a false negative could be a death sentence. Economic analysis can help identify these different biases and the interdependencies between them. Beyond the quantification of prediction rates and errors, economic analysis brings many tools to help measure the costs and benefits of these risks: for example, valuation of health outcomes through quality-adjusted life years (Qualys), shadow prices for lost output, discounting models and sensitivity analysis.
IV. Look out for imported bias
Human biases can be imported to algorithms in various ways. For example, recruitment algorithms are trained on data for previous successful and unsuccessful hires, which will likely be influenced by the current profile of the workforce and attitudes to ‘good fit’ within the organization. But bias may also be imported from the users of a particular platform or algorithm. One experimental study by researchers at Harvard Business School in 2015 found that ‘applications from guests with distinctively African-American names were 16% less likely to be accepted relative to identical guests with distinctively white names.’ Another study in 2015 found that Asian hosts earned 20% less than white hosts. AirBnB has since made strenuous efforts to combat bias, expelling racially biased hosts, removing racial identifiers, and working with civil rights organizations to research the causes of bias.
V. Check for skewed incentives
Economic analysis can also be deployed to understand the incentive structures implicit in an algorithm and whether these are contributing to unintentionally skewed outcomes. Algorithms, like humans, respond to incentives, but these can often be poorly understood or designed, leading to unintentional outcomes.
In particular, problems can occur when algorithms are designed to optimise costs or revenue and these vary by different groups. In one famous case, an algorithm designed to attract more females into technology jobs was found to be predominantly attracting male applicants, largely because the advertising costs of reaching female audiences are generally much higher than for males.
Another example was the decision of the New York Insurance regulator to open an investigation into allegations of racial bias in United Healthcare’s Optum algorithm, which is used to identify patients with chronic diseases who could benefit from additional home-care support and pharmaceutical services. A research study cited in Science argued that the algorithm prioritised healthier white patients over sicker black patients, largely because it ranked patients by medical costs and these were generally higher for white patients with similar conditions. In this case a relatively simple fix—refocusing the algorithm on health outcomes—was available.
VI. Scrutinise training data
Training data is rarely representative of the population at large. A recruitment algorithm is representative of the people who managed to be interviewed and selected by the organization. A health database is only representative of the people who fall ill and manage to get treatment. Structural inequalities in society mean that many people are ‘missing’ in training data of various kinds. Economic analysis can highlight structural imbalances in training data (by comparing against population-level data) and devise appropriate weighting schemes to address in-built biases.
Algorithmic decision making, backed by vast amounts of digital data, is increasingly a fact of life in public and private services, influencing decisions both large and small. Greater use of economic tools can help regulators and businesses avoid unintentional biases and distortions, while also creating better outcomes for business and society.