Why so Siri-ous? Striving to create natural voice agents

Will 'robot' voices one day be indistinguishable from a human being? Photo: istock.

While voice user interface agents like Siri and Alexa are now commonplace, their designers are still striving to make their conversation ever more natural.

But what does ‘natural’ mean for a human-agent conversation? A new study by UBC computer scientists investigates what designers mean by ‘naturalness’ and whether Siri will one day be indistinguishable from a human being.

Lead author and computer science alumna Yelim Kim, and co-author Dr. Dongwook Yoon, an assistant professor with UBC Computer Science and and member of the UBC Language Sciences Initiative, discuss their findings below.

How are voice user interface agents (VUIs) designed to better talk with humans?

YK: Using our voice to communicate is an innate human ability, so voice user interface designers aim to provide a natural conversation experience to users. Our study found 12 ways voice designers characterize naturalness and classified them in three categories: 'Core', 'Social' and 'Transactional'. Some of these elements include human-like aspects, such as conveying appropriate patterns of stress, pauses or intonations that how meaning (‘Core.’)

Designers also identified some ‘beyond-human’ characteristics as important in designing a natural voice agent, such as completing tasks and accessing information much faster than a human. So, what is ‘natural’ in a VUI is contextual.

DY: Designers also want agents to converse with users in a socially appropriate manner ('Social'), for example, using a serious voice tone when delivering negative news (e.g., traffic is bad). Transactional elements help a user get what they need done, including being proactive by leading a conversation about a given task, providing helpful suggestions to the user.

What are some challenges that designers face when striving for naturalness?

DY: Our study revealed seven major challenges. The primary goal of task-oriented applications is to help users with their tasks efficiently. However, designers find that when they want to add characteristics of social conversation, such as expressing sympathy and maintaining an intriguing persona, to enhance naturalness, the dialogues get longer and it conflicts with being efficient.

Another challenge was making the agent’s voice more expressive than a monotonous 'robot' voice. The current tool to achieve expressivity, Speech Synthesis Markup Language, has limited support for changing the sound of the voice agents, and is overly time consuming to use. We concluded that there is a need for more detailed design guidelines and innovative language tool support to solve these challenges. 

Some other major challenges include:

  • That writing for spoken language is difficult, with written text often sounding less natural, or containing too much information for a spoken conversation
  • That handling varied or unexpected user inputs and conversational context is difficult
  • That existing VUI guidelines lack concrete recommendations on how to design for ‘naturalness’

One day will we be unable to tell if we're talking to a human or a robot?

YK: At the 2018 Google I/O conference, Google showcased its voice assistant, Duplex, calling a hair salon and successfully making an appointment. It was a demo, but because Duplex talked to staff so naturally, it was almost indistinguishable from a human, and some people expressed their concerns through popular media about the potential risk of this new technology.

Given the current direction of the industry, I think there will be a day when it's hard to distinguish between voice agents and humans. Voice agents require less cognitive load and are relatively easy to use so they are very helpful for multi-tasking and for people who find it hard to learn new technologies. However, there are also risks that we should be prepared for, including the potential for abuse and deception. Being transparent about what information gets collected and how it is managed is very important and users should have the control of the information they share. 

"Given the current direction of the industry, I think there will be a day when it's hard to distinguish between voice agents and humans."

Chris Balma
balma@science.ubc.ca
604.822.5082
c 604-202-5047