This is a topic I have been thinking about for some time. Obviously because of the nature of Yakki and what we are trying to do. Talking about Yakki the other day, I got the question: "Why are we asking the user to ostensibly change their behavior?" A very good question. What's in it for them? Why is this important and how can this move towards the future?

Not Every Interface Needs a Voice

The convenience of dictating instead of typing is clear, though not universally applicable. In some situations, writing can be more efficient or suitable for the task.

I think about the example of the elevator. In the elevator, you could voice what floor you want to go to, the elevator systems would "listen" and take you to the floor. Or you could press the button to the floor that you want to go to, and that will be way faster and more efficient.

When Dictation Shines

In other scenarios, it's way more efficient to dictate. For example, instead of having to write an email, dictating is probably way more efficient. It's faster but it also comes with its own set of challenges.

Your brain processes information differently when you have to think about what you are writing compared to when you have to dictate it. Is there a one-to-one equivalence from writing to dictating where dictating is 3.75 times faster, with no reviews? I don't think so, at least not initially.

Typing is a skill that we have developed over years of practice and study, while dictating is not as familiar. We use our voice for very specific cases, and there is little overlap in how the brain operates in both instances. But if we want to be fair, writing has never been a completely linear process (for most of us, mortals).

Changing How You Think

In my case, I'm very accustomed to writing, and my thought process when writing is much more thorough than when I'm dictating. To make dictation as effective as my writing, I would need to essentially change the way I think.

There are some scientific publications backing that up (meaning that my perception is not totally wrong). The paper "Speech Recognition Technology and Students With Writing Difficulties: Improving Fluency" implies that dictation doesn't just speed up input; it changes how the brain composes text.

By removing the manual task of typing, "working memory" is freed up for higher-level ideation, leading to longer and linguistically richer drafts. Comparing both, the paper reached the conclusion that dictated papers were substantially longer and "richer," containing more complex arguments and explanative discourse.

The Transcription Bottleneck

The theory is that typing creates a "transcription bottleneck." When you remove the bottleneck via voice, you achieve a "brain dump" of higher quality ideas. These ideas may require cleaning and tweaking, but their overall depth is richer.

Still, there is a significant advantage in dictating in most scenarios, and there are a lot of potential upsides in terms of speed and how flowy you can be compared with your usual writing process.

The Hybrid Workflow

My way of working essentially matches the findings in these other papers, "Revisions in written composition: Introducing speech-to-text to children with reading and writing difficulties", studying revision patterns for speech-to-text (STT) users.

One of the key findings is that users typically adopt a "hybrid" workflow—dictating bursts of ideas and then switching to the keyboard for the precision work of editing.

The Real Numbers

At the end of the day, the equivalence is not 3.75 times because there are also a lot of corrections or instances where I have to go back and rethink what I was trying to say to reformulate it in a better way.

The most aggressive quantitative data comes from the medical field, where efficient documentation is a crisis issue. These studies are currently the most cited in professional circles because of their economic impact.

This paper, "A multi-country study comparing typed to automatic speech recognition-based medical documentation speeds", after studying the work of thousands of physicians, defines a number for the "error-adjusted speed". When adjusting for the time required to fix errors in the transcript, dictation was still 2.5x faster than typing.