Uninvited listeners in your speakers

Language assistants are programmed to react to “Alexa”, “Hey Siri” and “OK Google”, but they also respond to a large number of other words, too

July 21, 2020

It is likely that networked loudspeakers with language assistants listen in on what users are saying far more frequently than they should. This is indicated by studies by a team at the Ruhr-Universität Bochum and the Bochum Max Planck Institute for Security and Privacy. The researchers identified numerous English, German and Chinese words that unintentionally activate the language assistants. In this way, parts of very private conversations may end up in the hands of the system manufacturers, such as

Loudspeakers under observation Researchers at the Ruhr-Universität Bochum and the Max Planck Institute for Cyber Security and Privacy Protection have studied which words activate the language assistants created by the manufacturers of networked loudspeakers. It has been shown that they are triggered by a far greater number of phonetic sequences than is desirable.

Maximilian Golla

Loudspeakers under observation Researchers at the Ruhr-Universität Bochum and the Max Planck Institute for Cyber Security and Privacy Protection have studied which words activate the language assistants created by the manufacturers of networked loudspeakers. It has been shown that they are triggered by a far greater number of phonetic sequences than is desirable.

Maximilian Golla

“OK, cool”, “on Sunday” or “daquiri”. As researchers led by Dorothea Kolossa and Thorsten Holz, both professors at the Ruhr-Universität Bochum, have discovered, these and more than 1,000 other words and word combinations erroneously start the networked loudspeakers offered by Google, Amazon or Siri from Apple. In English, depending on the way the words are pronounced, Alexa understands “unacceptable” and “election” as being a command to start listening, while Siri starts up when it hears “a city”.

If the systems incorrectly think that they are being addressed, they record a brief sequence of what has been said and transfer the data to the manufacturer in question, sometimes without the users noticing. If they are activated by mistake, employees at the companies where they are produced transcribe the audio segments and check them for phonetic sequences that inadvertently start their systems. The aim is to make language recognition more reliable.

A test of all the major manufacturers

Activation test for Alexa

"A letter" or "Alexa"? Amazon's voice assistant Alexa can be accidentally triggered by phonetic sequences from TV series, transmitting recordings to the cloud. The Bochum team has demonstrated this, amongst others, with episodes of Game of Thrones.

https://www.youtube.com/watch?v=Bn0VKL-lTVY

The IT experts, including Maximilian Golla, who now conducts research at the Max Planck Institute for Security and Privacy, tested the networked loudspeakers and integrated language assistants produced by Amazon, Apple, Google, Microsoft and Deutscher Telekom, as well as three Chinese models by Xiaomi, Baidu and Tencent. They played them hours of German, English and Chinese audio material, including several episodes of “Game of Thrones”, “Modern Family” and the German crime series “Tatort”, as well as news programmes. Professional audio data records used for training language assistants were also played.

Before conducting the tests, they fitted all the networked loudspeakers with a diode, which registered when the activity display of the language assistant lit up as a visual indication of when the device was switched to active mode. The experimental setup also registered when a language assistant transmitted data to an external recipient. Every time one of the devices switched to active mode, the researchers made a note of which audio sequence had triggered the response. They later made a manual assessment of which terms had activated the language assistant.

In order to understand what makes certain terms unintentional activation words, or “trigger words”, as the researchers call them, they dissected the words into the smallest possible sound units and identified those units that were frequently misinterpreted by the language assistants. On the basis of these findings, they produced new activation words that also trigger the language assistants.

“A balancing act between data protection and technical optimization”

“The devices are programmed with a certain degree of flexibility, so that they are able to understand the people speaking to them. They therefore tend to be activated too often, rather than too little,” explains Dorothea Kolossa. The researchers studied the ways in which the systems evaluate language signals and how they handle unintentional trigger words in greater detail. Their findings show that the devices typically follow a two-stage process. First, the device analyses locally whether a trigger word is included in the language it has heard. If the device thinks that it has heard a trigger word, it uploads the current conversation into the manufacturer’s cloud for further analysis with more computing power. If the cloud analysis identifies the term as a mistaken trigger, the language assistant stays silent, and its activity display simply lights up briefly. However, even then, several seconds of audio recording can end up in the hands of the manufacturers, who analyse it with the aim of avoiding unwanted activation with the term in question in the future.

“From a privacy point of view, this is of course worrying, since in some cases, very private conversations could end up being heard by outsiders,” says Thorsten Holz. “However, in terms of engineering, the procedure makes sense, since the systems can only be improved when data of this nature is used. The manufacturers need to keep a balance between data protection and technical optimization.”

RUB/PH