Moving from speech recognition to voice recognition

Conversational Design has (almost) arrived. Voice commands as input device are everywhere. People use it to control their smartphone. Amazon Echo is a success. Speaking offers a speed-to-task completion that beats typing.

And right now it is easily undone by a toddler.

That’s my two-year-old daughter, Avery. She is a typical, boisterous and full-of-energy toddler. When I attempt to use voice commands – to draft a text message, search for a Doc McStuffins YouTube video through Apple TV, whatever – she invariably talks over me. Loudly. And then whatever device I’m using fails miserably.

Speech recognition software has gotten much better recently and so we’re starting to realize the benefits of speaking to our devices. We can all speak faster than we can type. This creates efficiency in how we long it takes us to complete a task. But for this to work, it requires a quiet space.

I’ve been thinking a lot lately about how it will look (er, I mean sound) when we are all talking to our devices. It will get quite loud. The workplace will need to change. We will need to have more privacy to work - to speak with and to our computers. This may, or may not be practical (workspace being a pricey expense). The open space floor plans of most offices will be rejiggered out of necessity to accommodate conversational interfaces.

Out of the office - commuting, traveling, at home, wherever - I see two continuing problems with interacting with our devices. The first is societal. We frown on people talking in public, whether it be on the train, in an elevator, or waiting in line. We consider it rude when the individual vocally intrudes on the group.

Working on this post while riding commuting to work. Yes, I get to ride a ferry every day.

The second continuing problem is technology. Our devices need to identify not just the words being spoken, but whom is speaking them. Until they can lock in on the user's voice to the exclusion of all other voices, the interface will continue to fail us. It needs to understand that I'm issuing commands and not be distracted by Avery practicing the Happy Birthday song.

That should be the goal. Getting devices to go from speech recognition to actual voice recognition. When that happens, maybe we'll have something.