Why so Siri-ous? Striving to create natural voice agents

April 13, 2021

Will 'robot' voices one day be indistinguishable from a human being? Photo: istock.

While voice user interface agents like Siri and Alexa are now commonplace, their designers are still striving to make their conversation ever more natural.

But what does ‘natural’ mean for a human-agent conversation? A new study by UBC computer scientists investigates what designers mean by ‘naturalness’ and whether Siri will one day be indistinguishable from a human being.

Lead author and computer science alumna Yelim Kim, and co-author Dr. Dongwook Yoon, an assistant professor with UBC Computer Science and and member of the UBC Language Sciences Initiative, discuss their findings below.

How are voice user interface agents (VUIs) designed to better talk with humans?

YK: Using our voice to communicate is an innate human ability, so voice user interface designers aim to provide a natural conversation experience to users. Our study found 12 ways voice designers characterize naturalness and classified them in three categories: 'Core', 'Social' and 'Transactional'. Some of these elements include human-like aspects, such as conveying appropriate patterns of stress, pauses or intonations that how meaning (‘Core.’)

Designers also identified some ‘beyond-human’ characteristics as important in designing a natural voice agent, such as completing tasks and accessing information much faster than a human. So, what is ‘natural’ in a VUI is contextual.

DY: Designers also want agents to converse with users in a socially appropriate manner ('Social'), for example, using a serious voice tone when delivering negative news (e.g., traffic is bad). Transactional elements help a user get what they need done, including being proactive by leading a conversation about a given task, providing helpful suggestions to the user.

What are some challenges that designers face when striving for naturalness?

DY: Our study revealed seven major challenges. The primary goal of task-oriented applications is to help users with their tasks efficiently. However, designers find that when they want to add characteristics of social conversation, such as expressing sympathy and maintaining an intriguing persona, to enhance naturalness, the dialogues get longer and it conflicts with being efficient.

Another challenge was making the agent’s voice more expressive than a monotonous 'robot' voice. The current tool to achieve expressivity, Speech Synthesis Markup Language, has limited support for changing the sound of the voice agents, and is overly time consuming to use. We concluded that there is a need for more detailed design guidelines and innovative language tool support to solve these challenges. 

Some other major challenges include:

  • That writing for spoken language is difficult, with written text often sounding less natural, or containing too much information for a spoken conversation
  • That handling varied or unexpected user inputs and conversational context is difficult
  • That existing VUI guidelines lack concrete recommendations on how to design for ‘naturalness’

One day will we be unable to tell if we're talking to a human or a robot?

YK: At the 2018 Google I/O conference, Google showcased its voice assistant, Duplex, calling a hair salon and successfully making an appointment. It was a demo, but because Duplex talked to staff so naturally, it was almost indistinguishable from a human, and some people expressed their concerns through popular media about the potential risk of this new technology.

Given the current direction of the industry, I think there will be a day when it's hard to distinguish between voice agents and humans. Voice agents require less cognitive load and are relatively easy to use so they are very helpful for multi-tasking and for people who find it hard to learn new technologies. However, there are also risks that we should be prepared for, including the potential for abuse and deception. Being transparent about what information gets collected and how it is managed is very important and users should have the control of the information they share. 


For more information, contact…

Chris Balma

balma@science.ubc.ca 604-822-5082
  • Internet + IT
  • Robotics + AI
  • Computer Science

Musqueam First Nation land acknowledegement

We honour xwməθkwəy̓ əm (Musqueam) on whose ancestral, unceded territory UBC Vancouver is situated. UBC Science is committed to building meaningful relationships with Indigenous peoples so we can advance Reconciliation and ensure traditional ways of knowing enrich our teaching and research.

Learn more: Musqueam First Nation

Faculty of Science

Office of the Dean, Earth Sciences Building
2178–2207 Main Mall
Vancouver, BC Canada
V6T 1Z4
UBC Crest The official logo of the University of British Columbia. Urgent Message An exclamation mark in a speech bubble. Arrow An arrow indicating direction. Arrow in Circle An arrow indicating direction. A bookmark An ribbon to indicate a special marker. Calendar A calendar. Caret An arrowhead indicating direction. Time A clock. Chats Two speech clouds. External link An arrow pointing up and to the right. Facebook The logo for the Facebook social media service. A Facemask The medical facemask. Information The letter 'i' in a circle. Instagram The logo for the Instagram social media service. Linkedin The logo for the LinkedIn social media service. Lock, closed A closed padlock. Lock, open An open padlock. Location Pin A map location pin. Mail An envelope. Mask A protective face mask. Menu Three horizontal lines indicating a menu. Minus A minus sign. Money A money bill. Telephone An antique telephone. Plus A plus symbol indicating more or the ability to add. RSS Curved lines indicating information transfer. Search A magnifying glass. Arrow indicating share action A directional arrow. Spotify The logo for the Spotify music streaming service. Twitter The logo for the Twitter social media service. Youtube The logo for the YouTube video sharing service.