Why so Siri-ous? Striving to create natural voice agents
April 13, 2021
April 13, 2021
While voice user interface agents like Siri and Alexa are now commonplace, their designers are still striving to make their conversation ever more natural.
But what does ‘natural’ mean for a human-agent conversation? A new study by UBC computer scientists investigates what designers mean by ‘naturalness’ and whether Siri will one day be indistinguishable from a human being.
Lead author and computer science alumna Yelim Kim, and co-author Dr. Dongwook Yoon, an assistant professor with UBC Computer Science and and member of the UBC Language Sciences Initiative, discuss their findings below.
YK: Using our voice to communicate is an innate human ability, so voice user interface designers aim to provide a natural conversation experience to users. Our study found 12 ways voice designers characterize naturalness and classified them in three categories: 'Core', 'Social' and 'Transactional'. Some of these elements include human-like aspects, such as conveying appropriate patterns of stress, pauses or intonations that how meaning (‘Core.’)
Designers also identified some ‘beyond-human’ characteristics as important in designing a natural voice agent, such as completing tasks and accessing information much faster than a human. So, what is ‘natural’ in a VUI is contextual.
DY: Designers also want agents to converse with users in a socially appropriate manner ('Social'), for example, using a serious voice tone when delivering negative news (e.g., traffic is bad). Transactional elements help a user get what they need done, including being proactive by leading a conversation about a given task, providing helpful suggestions to the user.
DY: Our study revealed seven major challenges. The primary goal of task-oriented applications is to help users with their tasks efficiently. However, designers find that when they want to add characteristics of social conversation, such as expressing sympathy and maintaining an intriguing persona, to enhance naturalness, the dialogues get longer and it conflicts with being efficient.
Another challenge was making the agent’s voice more expressive than a monotonous 'robot' voice. The current tool to achieve expressivity, Speech Synthesis Markup Language, has limited support for changing the sound of the voice agents, and is overly time consuming to use. We concluded that there is a need for more detailed design guidelines and innovative language tool support to solve these challenges.
Some other major challenges include:
YK: At the 2018 Google I/O conference, Google showcased its voice assistant, Duplex, calling a hair salon and successfully making an appointment. It was a demo, but because Duplex talked to staff so naturally, it was almost indistinguishable from a human, and some people expressed their concerns through popular media about the potential risk of this new technology.
Given the current direction of the industry, I think there will be a day when it's hard to distinguish between voice agents and humans. Voice agents require less cognitive load and are relatively easy to use so they are very helpful for multi-tasking and for people who find it hard to learn new technologies. However, there are also risks that we should be prepared for, including the potential for abuse and deception. Being transparent about what information gets collected and how it is managed is very important and users should have the control of the information they share.
We honour xwməθkwəy̓ əm (Musqueam) on whose ancestral, unceded territory UBC Vancouver is situated. UBC Science is committed to building meaningful relationships with Indigenous peoples so we can advance Reconciliation and ensure traditional ways of knowing enrich our teaching and research.
Learn more: Musqueam First Nation