A few stray observations on voice assistants

To keep the title of this piece short, I have used the somewhat generic term ‘voice assistants’ instead of something like speech recognition and intelligent personal assistant applications. Now that I’ve hopefully made that clear, here are a few thoughts I’ve been having on the subject, in no particular order of importance.

Every time I interact with people who are excited by voice assistants and the underlying technology, they often like to include references to past science fiction series and films, ‘pioneering’ the voice-based human-computer interaction. I’m more and more of the opinion that those series and films have been a bad influence on tech people, that they gave them the wrong idea about what the future of computing should be about. On several occasions, I’ve been baffled by how few of my tech enthusiast interlocutors failed to recognise that in series like Star Trek or Space: 1999, the voice-based interaction with computers is essentially a dramatic device, a way to deliver information to the viewers that is quick and effective. Instead of having boring close-ups of displays where you see queries typed by a Starfleet officer and the computer’s responses, it’s easier to conceive the computer as another character — a very erudite one — who can be queried on the spot and whose response is equally fast. Sometimes this is taken even further by the introduction of an android, a computer with a human shape.

In other cases, presenting a voice-based human-computer interaction is a sci-fi trope to convey a general idea of technological advancement, especially when combined with the absence (or very reduced footprint) of physical tech gadgets/devices. Voice assistants might be ‘the future’, but their current state is little past the mimicking stage of this fictional interaction. That is, we’re aping what we saw in those sci-fi shows, but we’re still at a stage where the form is nice, yet the substance is lacking. There’s little depth beyond the surface. Speech recognition is passable, but reliability is still poor, and the scope of actionable tasks is still limited. It’s like playing one of the early text adventure games, where the parser can’t interpret commands that are more sophisticated than GO EAST, TAKE LAMP, OPEN DOOR, etc. Sure, “we’ll get there” someday, but I can’t shake the feeling that it’s not worth the amount of energies Silicon Valley is pouring into this, and the amount of data we’re feeding to machines to improve Artificial Erudition (I’m still not ready to call it Artificial Intelligence, sorry).
I have this theory about the current limited usefulness of voice assistants, and their relatively slow rate at getting better. I tentatively call this theory ‘the Google Glass fallacy’. It has been pointed out how Google Glass has turned out to be a failed attempt as a general-purpose device aimed at the general public, but a more successful one in limited, specialised applications and environments. I believe voice assistants have started with the wrong foot — as I wrote on Twitter yesterday, I think that if voice assistants had been originally designed having people with disabilities as first and sole target audience (instead of lazy tech dudes), and then gradually extended to everyone else, today they’d be a bit better.

Joe Cieplinski’s 3‑tweet response to that really resonated with me, because it touches on one aspect that inspired my observation in the first place. Here’s his response (emphasis mine):

I think you may be on to something there. Another problem that enthusiasts who think “voice will one day replace your screen” never consider is that those with hearing difficulties would be locked out entirely. I’ve always felt that voice will find its place, but never be the “only” way to interface with computers. Even the folks who wrote Star Trek knew that. It’s also worth noting that there are two different things at play here. Voice recognition, and then artificial intelligence. There’s no reason the two have to be permanently linked. We could just as easily type to Siri or Alexa. Or show it images to interpret.

[Link to tweet 1 | Link to tweet 2 | Link to tweet 3]

I think that in the creation and initial development of these voice assistants, there hasn’t been given enough thought to the ‘assistive’ part, because the design mainly referenced able-bodied people. Simplifying, there’s a big difference when your goal is to develop a tool that makes your life-as-an-able-bodied-person easier (read: spoiled) instead of a tool that makes the life of a disabled person more tolerable. Your able-bodied person’s ‘friction’ is bullshit compared to the real friction of a person with any disability. A useful virtual assistant is one that, first and foremost, addresses a few crucial types of impairments. Design with that in mind, give precedence to solving problems related to the interaction between a person with impairments, develop against those, test against those, then worry about perfectly healthy twenty-somethings who are too inconvenienced to manually select the music they want to play.
When my dad was still around, every now and then I used to tease him into discussing tech-related topics. He was an extraordinarily intellectually curious person, always willing to learn new things, and often approaching questions with great common sense. We only had the chance to talk about voice assistants once, briefly. I was explaining to him the technology and the current capabilities of Siri, Alexa, Cortana, Google Assistant and the like. So, what do you think? – I remember asking.

He fell silent for a bit, then he said: These things can be really useful to people who, for one reason or another, need assistance in their lives. I mean, real assistance: because they’re blind, or can’t move, or are simply too busy to use their hands. For someone like me they’re mostly useless. You know me, I’m quicker if I just use my hands.

– I use Siri to set a timer when I cook. Sometimes for a reminder. Nothing else.

– Simple things, he nodded. – I probably wouldn’t even use it while cooking. I’d just clean my hands, take the phone, set the timer myself.

– Yeah, there are times when I still have to do that anyway. Siri doesn’t always understand what I say.

He slowly shook his head, then said: – Reliability must be put first with these assistants. They ought to understand you at once, and if they don’t, they ought to allow you to correct them as quickly as possible. Otherwise they’re just like that subordinate at the office who is supposed to help you do the work, but he doesn’t understand or misunderstands what you want him to do, and you end up doing more work to fix the misunderstandings.

– Yes, that’s Siri right there.

We laughed, then he observed: Now, if this Siri misunderstands you, you are absolutely able to take matters into your hands, and you just do the thing. You just open the app for the weather forecast, or you set the timer yourself, or you type your internet search. You do that in no time. Now imagine those who truly need this kind of technology in their lives, they are already frustrated enough by their condition. When these assistants fail, it’s even worse. They need them to work. If tech companies want to help these people, they have to work hard at this stuff, or just drop it. Half good doesn’t work here.

— Or, you know, at least recognise your limits and rethink the project. Make something that’s really good at one thing…

– Yes, like something that’s really good at reading things for blind people. You develop one piece, then maybe another company develops something that’s really good at taking your dictation — but really good, something that gets you even if you stutter, or have a lisp… Then maybe one day these two companies collaborate, put the pieces together, and make something very good at more than one thing… and everybody wins.

– Instead, everyone wants to compete, each company trying to offer a finished product that does many things out of the box, and they’re all more or less mediocre.

I miss these chats with my dad.
I wrote this in Siri’s fuzziness and friction — October 2015. Nothing has improved on this front, and this kind of criticism can be extended to other assistants:

And indeed, Siri is the kind of interface where, when everything works, there’s a complete lack of friction. But when it does not work, the amount of friction involved rapidly increases: you have to repeat or rephrase the whole request (sometimes more than once), or take the device and correct the written transcription. Both actions are tedious — and defeat the purpose. It’s like having a flesh-and-bone assistant with hearing problems. Furthermore, whatever you do to correct Siri, you’re never quite sure whether your correcting action will have an impact on similar interactions in the future (it doesn’t seem to have one, from my experience). Then, there’s always what I usually consider the crux of the matter when interacting with Siri: the moment my voice request is misunderstood, it’s typically faster for me to carry out the action myself via the device’s Multi-touch interface, rather than repeat or rephrase the request and hope for the best.

[…]

Siri’s scope is still rather limited. What is the reward for my continued use of this technology despite its immaturity? That sometime in the future it’ll be able to properly write a text message or a reminder? Time is too precious a resource for me to keep trying to have Siri understand simple requests. Not only does the friction in interacting with this particular fuzzy interface have to disappear, but the scope, applications and usefulness of Siri must expand as well — it has to offer enough flexibility and reliability to engage the user. It has to offer more, to provide an advantage over performing the same tasks manually. Otherwise, I think it’s difficult to expect users to invest time and energy in something that still feels non-essential.
While this other article — Siri, wake up — is five years old. Five years. It has aged rather well, which is not a compliment to Siri. In this MacRumors article from September 2017, among other things, there’s this:

Siri is powered by deep learning and AI, technology that has much improved her speech recognition capabilities. According to Wired, Siri’s raw voice recognition capabilities are now able to correctly identify 95 percent of users’ speech, on par with rivals like Alexa and Cortana.

Not my experience at all. I’ve tried Siri, Google Assistant, and Cortana, interacting in English (which is not my first language, yet I believe my pronunciation to be fairly decent) and performing the same requests. Google Assistant and Cortana both performed better and more consistently on this front. Cortana (on Windows Phone 8.1, Windows 10 Mobile, and iOS) even understood me while whispering to it at dead of night. There’s more: even Dragon Dictation under iOS 4.2.1 on my old iPhone 3G was able to correctly understand 99.8% of the text for a short email I was preparing to send.

[…] Joswiak says Apple’s aim from the beginning has been to make Siri a “get‑s**t‑done” machine. “We didn’t engineer this thing to be Trivial Pursuit!” he told Wired. Apple wants Siri to serve as an automated friend that can help people do more.

Maybe it should have been engineered to be (also) Trivial Pursuit. At least it would be good at the Artificial Erudition part of this whole machine learning thing. At this stage, Siri is indeed an automated friend, but a quirky, unhelpful one. Apple is in a difficult position here, because they have decided to integrate this half-baked assistant in too many points of their ecosystem to just pull out now. And the pace of improvement of this automated friend is frustratingly slow.
Speaking of quirky, unhelpful automated friends: apparently, Alexa laughs at you, unprompted. Nick Heer comments:

But why is this possible at all? Is there some sort of hidden maniacal laughter mode? Is that something people would ever want to trigger intentionally, let alone have the device invoke accidentally? Is this a prank? And could you trust Amazon’s virtual assistant to not do anything like this again?

Today it’s an unprompted laugh, tomorrow may be something else, equally unexpected, but perhaps not as innocuous. If I have to renounce a big slice of privacy for these ever-listening devices (I’m not putting any of them in my home, by the way), is it too much to ask for some useful assistance in return?
This has just popped into my head as I was about to publish the article: How nice and useful it would be to have the ability to define ‘shortcuts’ with these assistants, so that common, repetitive tasks can be carried out with shortened queries. Stupid example: you’re often making pizza, or heating a pre-packaged meal in the oven, and you always set a timer to the same amount of time. You could ask the assistant to Define task. The assistant would respond: Name of the task?, and you’d say Pizza time; then the assistant would ask: What do you want me to do for ‘Pizza time’, and you’d reply: Set a timer for 25 minutes. Once confirmed, every time you say Hey [Assistant], let’s do ‘Pizza time’, the assistant carries out the pre-recorded task.

Yes, yes, the idea needs refinement (I just thought of it, bear with me), and it would also need a step further towards sophistication: a more context-aware assistant, the ability to perform a longer exchange than a simple challenge/response, the ability to store tasks in a database on the device and/or the cloud, and a better parser, so that when a particular phrase is invoked, it triggers the assistant to expect a task shortcut. But we’ll have to make them stop laughing at us, first.

Riccardo Mori

Writer & Translator

A few stray observations on voice assistants