Extreme closeup photograph of a professional microphone.
Enlarge / Microphones are how our machines listen to us.

We’re outsourcing ever more of our decision making to algorithms, partly as a matter of convenience, and partly because algorithms are ostensibly free of some of the biases that humans suffer from. Ostensibly. As it turns out, algorithms that are trained on data that’s already subject to human biases can readily recapitulate them, as we’ve seen in places like the banking and judicial systems. Other algorithms have just turned out to be not especially good.

Now, researchers at Stanford have identified another area with potential issues: the speech-recognition algorithms that do everything from basic transcription to letting our phones fulfill our requests. These algorithms seem to have more issues with the speech patterns used by African Americans, although there’s a chance that geography plays a part, too.

A non-comedy of errors

Voice-recognition systems have become so central to modern technology that most of the large companies in the space have developed their own. For the study, the research team tested systems from Amazon, Apple, Google, IBM, and Microsoft. While some of these systems are sold as services to other businesses, the ones from Apple and Google are as close as your phone. Their growing role in daily life makes their failures intensely frustrating, so the researchers decided to have a look at whether those failures display any sort of bias.

To do so, the researchers obtained large collections of spoken words. Two of these were dominated by a single group: African Americans from a community in North Carolina, and whites in Northern California. The remaining samples came from mixed communities: Rochester, New York; Sacramento, California; and Washington, DC. These were run through each of the five systems, and the accuracy was determined based on a comparison of the results to those of human translators.

Based on a score called word error rate (which includes inserted and missing words, as well as misinterpretations) all of the systems did well, having a score of less than 0.5. (Apple’s was the worst, and Microsoft’s system the best based on this measure.) In all cases, the recordings of African American speakers ended up with word error rates that were worse than the ones produced from recordings of white speakers—in general, the errors nearly doubled.

The effect was more pronounced among African American males. White men and women had error rates that were statistically indistinguishable, at 0.21 and 0.17, respectively. The rate for African American women averaged 0.30, while for men it rose to 0.41.

How important are these differences? The authors suggest it depends on how you define usability—above a certain percentage of error, it becomes more annoying to fix an automated transcript than to do it yourself, or your phone will end up doing the wrong thing more often than you’re happy with. The authors tested how often individual chunks of text end up with a conservative word error rate of 0.5. They found that over 20 percent of the phrases spoken by African Americans would fail this standard; fewer than 2 percent of those spoken by whites would.

Listen up

So what’s going on? There may be a bit of a geographical issue. California speakers are often considered to be accent free from an American perspective, and the two samples from that state had very low error rates. Rochester had a rate similar to California’s, while the District of Columbia had one closer to the rural North Carolina town. If there is a geographic influence, we’re going to need a much larger sample to separate that out.

After that, the researchers analyzed the language usage itself. Since they didn’t have access to the algorithms used by these systems, they turned to some open source packages that perform similar functions. They measured the software’s understanding of language use via a figure called the perplexity, which is a value derived from the accuracy at which the system can predict the word that will come next in a sentence. And, by this measure, the systems were better at handling the usage of African American speakers. What’s going on?

The researchers found there were two conflicting tendencies. African Americans would, on average, use a smaller total word list than their white counterparts. But their phrasing turned out to be more complicated—in many cases, they dropped words from their sentences when their listeners could easily infer them.

Finally, there’s the matter of how attuned the commercial systems are to African American voices. To explore this, the researchers searched through the transcripts to find cases when African American and white speakers used the same phrases. When those were run through the systems, the word error rate for African American speakers was higher than for whites, suggesting this also contributed to the overall reduced performance.

An effective voice-recognition system has to combine a number of factors—actual word recognition, language usage, and likely meanings—in order to successfully recognize sentences or predict ensuing words. Existing commercial systems appear to fall a bit short of that when it comes to some populations. These systems weren’t set up to be biased; it’s likely that they were simply trained on a subset of the diversity of accents and usages present in the United States. But, as we become ever more reliant on these systems, making them less frustrating for all their users should be a priority.

PNAS, 2020. DOI: 10.1073/pnas.1915768117  (About DOIs).