For all the attention lavished on Siri, the often-clever voice-driven virtual assistant on Apple’s iPhone, Google’s mobile search app lately has impressed a lot more people. That’s partly thanks to Google Now, its own virtual assistant that’s part of that app, which some observers think is more useful than Siri.
But the success of Google’s mobile search stems at least as much from a big improvement over the past year in Google’s speech recognition efforts. That’s the result of research by legendary Google Fellow Jeff Dean and others in applying a fast-emerging branch of artificial intelligence called deep learning to recognizing speech in all its ambiguity and in noisy environments. Replacing part of Google’s speech recognition system last July with one based on deep learning cut error rates by 25% in one fell swoop.
As I wrote in a recent article on deep learning neural networks, the technology tries to emulate the way layers of neurons in the human neocortex recognize patterns and ultimately engage in what we call thinking. Improvements in mathematical formulas coupled with the rise of powerful networks of computers are helping machines get noticeably closer to humans in their ability to recognize speech and images.
Making the most of Google’s vast network of computers has been Dean’s specialty since he joined Google an almost inconceivable 14 years ago, when the company employed only 20 people. He helped create a programming tool called MapReduce that allowed software developers to process massive amounts of data across many computers, as well as BigTable, a distributed storage system that can handle millions of gigabytes of data (known in technical terms as “bazillions.”) Although conceptual breakthroughs in neural networks have a huge role in deep learning’s success, sheer computer power is what has made deep learning practical in a Big Data world.
Dean’s extreme geekitude showed in a recent interview, when he gamely tried to help me understand how deep learning works, in much more detail than most of you will ever want to know. Nonetheless, I’ll warn you that some of this edited interview still gets pretty deep, as it were. Even more than the work of Ray Kurzweil, who joined Google recently to improve the ability of computers to understand natural language, Dean’s work is focused on more basic advances in how to use smart computer and network design to make AI more effective, not on the application to advertising.
Still, Google voice search seems certain to change the way most people find things, including products. So it won’t hurt for marketers and users alike to understand a bit more about how this technology will transform marketing, which after all boils down to how to connect people with products and services they’re looking for. Here’s a deeply edited version of our conversation:
Q: What’s “deep” about deep learning?
A: “Deep” typically refers to the fact that you have many layers of neurons in neural networks. It’s been very hard to train networks with many layers. In the last five years, people have come up with techniques that allow training of networks with more layers than, say, three. So in a sense it’s trying to model how human neurons respond to stimuli.
We’re trying to model not at the detailed molecular level, but abstractly we understand there are these lower-level neurons that construct very primitive features, and as you go higher up in the network, it’s learning more and more complicated features.
Q: What has happened in the last five years to make deep learning a more widely used technique?
A: In the last few years, people have figured out how to do layer-by-layer pre-training [of the neural network]. So you can train much deeper networks than was possible before. The second thing is the use of unsupervised training, so you can actually feed it any image you have, even if you don’t know what’s in it. That really expands the set of data you can consider because now, it’s any image you get your hands on, not just one where you have a true label of what that image is [such as an image you know is a cheetah]. The third thing is just more computational power. …