Lessons so far from building Voice Apps

So you hear those special five words in a meeting “We need a voice skill”, and it brings back memories of the “We need an app” and “We need…

Lessons so far from building Voice Apps

So you hear those special five words in a meeting “We need a voice skill”, and it brings back memories of the “We need an app” and “We need a website” statements of years before us. Of course, the statement is true most businesses do or will shortly need to define their voice strategy, but the first time at trying voice most companies will focus on the wrong thing to build and how they should make it.

Photo by rawpixel on Unsplash

Here are a few lessons learnt from building voice experiences over the last 18 months that may help.

All interaction is conversational

You walk into a shop and ask the butcher for some meat; the pub and it is a beer from the bar person. Though these are clearly conversational they are no different than when you visit a website. When a website loads, the website is metaphorically saying “Hey, here are all the things you can do, which one do you want?”. You as the user then reply “Thanks, I would like to login”, on a laptop you action this by moving the mouse to the login button and clicking. On mobile, you tap the login button on a screen with your finger or in a voice world say, “Login” and your voice fingerprint is used to authenticate you.

by Ben White on Unsplash

So, remember all interaction is conversational, and you can get started by having a conversation with someone and just writing down your interactions, and there is the skeleton of your voice experience to get started.

Devices, devices, devices

If you are coming to voice from a web or mobile world, screen size was your arch enemy. In a voice world, currently this is devices. There are so many devices and every category is being “voice enabled”, here are a few to bear in mind:

· Smart speaker

· Fridge/Washing machine

· TV

· Earphones

· Lightbulbs

· Car

· Phone

· Drive-thru

· Laptop

· Tablet

· Wristwatch

· Phone call

Knowing which device(s) your users will use will enable you to shape the experience, much like how developing for mobile is different from developing for TV. Building for voice on a watch is different to a speaker in your kitchen.

by Crew on Unsplash

A phone is usually in proximity to you and your voice; a smart speaker is generally in the corner of the room, the TV is in an open shared space with lots of sounds. Earphones provide intimacy for personal information, a desk speaker at near keyboard height can work in conjunction with other devices on your desk.


Where is the person interacting with your voice experience, if in the kitchen their hands could be holding food, at the dinner table then background music maybe playing low. In the workplace, expect multiple and possible unknown voices to be interacted with, or when in a car, the focus should be on the road so make interactions short and concise.

Photo by Sadık Kuzu on Unsplash

The time of day plays a part also, saying good morning and good evening is lovely but what if it is 2 am should your voice experience speak at a lower volume? If music was playing loud 10 minutes before maybe not, but if music was playing 8 hours earlier and since then a child has interacted you now have more context to the environment. No parent wants a voice shouting at 2 am on a school night.


When using the web or mobile your eyes are looking at something as you engage with it. With voice there can be an experience with no visual interface, giving only audible cues to interact with, though these can be words, sentences, chimes or light effects if a visible device is present.

As our attention span has decreased with the mobile revolution, when there isn’t even a screen to look at, our attention will reduce further. Driving the importance for verbal confirmations, if a user is making a finalising action such as a payment or transfering an item you may want to confirm, “Just to confirm, I should book two tickets to The Greatest Showman on Tuesday?”.


Forgiveness, when a website or app goes down, or a subset of functionality stops working you can see it; there is ample visual space to inform the user of an issue. With voice, you get a sentence or even just a chime or light. Therefore, detailed messaging is massively important to notify the user with what action to take.

Photo by Andres Urena on Unsplash

We have seen how frustrated people get when services like Alexa don’t recognise what you say. If your voice experience isn’t working or is unable to understand what the user is saying you need to give the user an excellent error response that will reassure them to try again or try again later.


The testing of voice interactions becomes essential in making sure you retain the attention of the user. We are in a time of limited attention, where even a 5-word sentence is too long, and you need to optimise down to 3 words (proven from experience).

With voice, the most significant part of testing is understanding what your users will say and how will they say it, you can then understand and build up the utterances they will interact with. These can be gathered from survey’s, user interviews, beta testing as well as learnings once your voice experience goes live.


Once your voice experience is out in the wild you need to tune, just like we have done for years with websites and apps, A/B testing is crucial in both what your experience is saying and the range of expected responses it gets back. With voice, you should gather a beta user group as quickly as possible to gain valuable insight and continue to iterate to improve your voice experience.

Like solving any problem with technology, voice is another channel and has its challenges and rewards that need to be taken into consideration as web and mobile have before it. Just remember we have been conversating through voice for thousands of year, so this should be nicer in the long term than rounded corners on a screen.

by "My Life Through A Lens" on Unsplash

Have you built for voice, what lessons have you learnt so far ?

Other Reads