It appears on everything from presentation slides to thinkpieces: the ultimate justification for prioritising voice search; the ultimate proof that soon, we’ll be living in a much more vocal world.
The thing about statistics is that so often, as they’re cited and re-cited, their context is lost. And sometimes, that context can be very important to understanding what a statistic actually means.
In my recent article on the future of voice search: 2020 and beyond, I devoted a decent chunk of time to deconstructing the “50% by 2020” statistic and running the numbers to gauge how realistic it is. I pointed out that the original prediction, while popularly attributed to comScore (and for the life of me, I can’t seem to find where that came from), was in fact made by then-Chief Scientist at Baidu, Andrew Ng, in an interview with Fast Company. It was also meant to encompass image search as well as voice.
However, for the sake of keeping things concise (ish), I didn’t delve as deeply as I could have done into the original context of Andrew Ng’s voice search prediction and why it makes so little sense as a benchmark for where voice search is headed in countries like the UK and the US.
But the statistic is widespread enough, and I see few enough people drawing attention to this (no-one, really) that I thought I would devote a follow-up post to explaining why we need to stop blindly repeating the “50% by 2020” stat.
Ng’s prediction was about voice search in China
As I mentioned, at the time of Ng’s interview with Fast Company (which he gave in September 2014), he held the position of Chief Scientist at Baidu. This is important, as the bulk of the interview was devoted to talking about Baidu’s prowess with deep learning, image recognition and speech recognition.
Baidu is a Chinese search engine catering to Chinese users, and though Ng previously worked at Google and has a global perspective on the search market, he was referring specifically to Baidu when he made his prediction. Here is the full quotation from the article:
At the moment, around 10% of Baidu search queries are done by voice, with a much smaller percentage carried out using images. If progress continues at its current rate, however, Ng forecasts that “in five years’ time at least 50% of all searches are going to be either through images or speech.”
This forecast was very specific to where Baidu, and China, was with regard to voice search in late 2014. At that time, voice search already made up 10% of – I presume – all searches on Baidu. This is more impressive when you know that Baidu only launched voice search in 2012 [Chinese-language source], meaning that it jumped to 10% of search volume in just two years.
Google, by contrast, has offered mobile voice search since late 2008, rolling it out to all “Google Experience” Android phones with the Android 1.1 update in 2009. (Google even beat Baidu to the punch with Chinese voice search, releasing voice search in Mandarin for the Nokia S60 in November 2009). Yet it took Google until 2016 to announce that voice search made up 20% of all search queries carried out via the Google mobile app and on Android.
In other words, voice search in China took off far, far faster than voice search in the west – and Ng’s prediction was made based on that rapid growth.
How voice in China differs from the west
Why was voice search in China growing so rapidly? Fast Company‘s feature offers some context: the rapid adoption of voice search resulted from a combination of factors, including a massive population of internet users who were coming online for the very first time, and specifically, connecting to the internet via mobile. Not having been brought up on text-based search on desktop computers, China’s internet users found it natural to start searching via voice instead.
“The primary catalyst for a step-change in how search works today is the rise of smartphones and tablets, which are taking away more and more market share from traditional PCs. This is particularly evident in countries like Baidu’s birthplace China, where many users are connecting to the Internet for the first time–primarily by way of mobile devices. Of the 632 million Internet users in China as of June this year, 83% accessed the web with a mobile phone, according to figures from China Internet Network Information Center.
“Most of these users haven’t organically learned how to use text-based search as it’s evolved from Ask Jeeves to DuckDuckGo over the past several years. That presents an opportunity to re-think basic assumptions about search, and it extends beyond developing markets.”
Voice is also in many ways a quicker and more convenient method of inputting Chinese characters than typing. An article by J. Walter Thompson Intelligence on ‘China’s voice tech tipping point’ posits that “fully functional voice technology would find an instant market” in China, before adding that “Spoken Chinese, however, has proven difficult for computers to decipher.”
Baidu is on top of this challenge as well, though, and has focused heavily on improving its speech recognition accuracy: in 2017, as Google celebrated reaching the coveted 95% accuracy threshold for voice recognition (putting it on a par with human speech), Baidu was at 97% – and is gunning for 99%.
As speech-recognition accuracy goes from 95% to 99%, we'll go from barely using it to using all the time! https://t.co/TfjqJLDTPJ
— Andrew Ng (@AndrewYNg) December 16, 2016
While Google has undoubtedly made great strides with improving its voice experience in recent years, Baidu’s pursuit of near-perfect voice accuracy has been relentless – because Baidu knows that to succeed in China, voice is paramount.
Is Baidu on track to fulfil Ng’s prediction?
Now that we’ve established the context of Andrew Ng’s voice search prediction – that it was made relatively soon after the launch of voice search in China, in the midst of a rapid surge in adoption – it begs the question: will it come true for Baidu?
Ng’s prediction hinges on a big “if”: “If progress continues at its current rate”. By “progress”, Ng might have been referring to pure adoption of voice search in China, but he was likely also talking about progress with Baidu’s voice technology as a whole: speech processing, deep learning and artificial intelligence.
Statistics about voice search in China are thin on the ground, but there has been no shortage of announcements regarding Baidu’s voice capabilities. Baidu launched its first voice-activated smart speaker, the Raven H, in 2017, quickly following it up with two more speakers at different price points several months later. It has its own conversational AI assistant, DuerOS, which reached an install base of 100 million devices earlier this month.
And perhaps most impressively – or alarmingly – of all, Baidu has revealed that its Deep Voice text-to-speech system can imitate a person’s voice accurately using just three seconds of sample data. All of this stems from Baidu’s extensive research and investment into voice recognition and AI.
Ng predicted on Twitter that once voice recognition technology reaches the 99% accuracy threshold, “we’ll go from barely using it to using it all the time”. It’s not clear whether he was referring specifically to China or to the whole world, but by all accounts Baidu is significantly closer to reaching the 99% mark – it could even reach it by 2020.
Once it does, maybe we’ll hear an announcement that voice queries have reached 50% of search volume on Baidu. Or maybe not. One of the other important things to note about Andrew Ng’s original prediction is not just that it was made in 2014, but that the world of voice was very different then.
In 2014, the Amazon Echo had yet to launch, and search seemed like the most interesting thing you could do with voice. Now, priorities have shifted; companies aren’t focusing on voice search any more. There’s a reason that Google hasn’t released an updated voice search statistic since 2016, and that Baidu’s voice announcements have all been focused on smart speakers and AI. The focus, even in China, is not on voice search any more.
Only the marketing industry is still stuck on this statistic, and it’s starting to look increasingly ridiculous with every repetition.
So can we please, as an industry, move on?