More than 13,000 artificial intelligence mavens flocked to Vancouver this week for the world’s leading academic AI conference, NeurIPS. The venue included a maze of colorful corporate booths aiming to lure recruits for projects like software that plays doctor. Google handed out free luggage scales and socks depicting the colorful bikes employees ride on its campus while IBM offered hats emblazoned with “I ❤️A👁.”
Tuesday night, Google and Uber hosted well-lubricated, over-subscribed parties. At a bleary 8:30 the next morning, one of Google’s top researchers gave a keynote with a sobering message about AI’s future.
Blaise Aguera y Arcas praised the revolutionary technique known as deep learning that has seen teams like his get phones to recognize faces and voices. He also lamented the limitations of that technology, which involves designing software called artificial neural networks that can get better at a specific task by experience or seeing labeled examples of correct answers.
“We’re kind of like the dog who caught the car,” Aguera y Arcas said. Deep learning has rapidly knocked down some longstanding challenges in AI—but doesn’t immediately seem well suited to many that remain. Problems that involve reasoning or social intelligence, such as weighing up a potential hire in the way a human would, are still out of reach, he said. “All of the models that we have learned how to train are about passing a test or winning a game with a score [but] so many things that intelligences do aren’t covered by that rubric at all,” he said.
Hours later, one of the three researchers seen as the godfathers of deep learning also pointed to the limitations of the technology he had helped bring into the world. Yoshua Bengio, director of Mila, an AI institute in Montreal, recently shared the highest prize in computing with two other researchers for starting the deep learning revolution. But he noted that the technique yields highly specialized results; a system trained to show superhuman performance at one videogame is incapable of playing any other. “We have machines that learn in a very narrow way,” Bengio said. “They need much more data to learn a task than human examples of intelligence, and they still make stupid mistakes.”
Bengio and Aguera y Arcas both urged NeurIPS attendees to think more about the biological roots of natural intelligence. Aguera y Arcas showed results from experiments in which simulated bacteria adapted to seek food and communicate through a form of artificial evolution. Bengio discussed early work on making deep learning systems flexible enough to handle situations very different from those they were trained on, and made an analogy to how humans can handle new scenarios like driving in a different city or country.
In an earlier deep learning article, we talked about how inference workloads—the use of already-trained neural networks to analyze data—can run on fairly cheap hardware, but running the training workload that the neural network “learns” on is orders of magnitude more expensive.
In particular, the more potential inputs you have to an algorithm, the more out of control your scaling problem gets when analyzing its problem space. This is where MACH, a research project authored by Rice University’s Tharun Medini and Anshumali Shrivastava, comes in. MACH is an acronym for Merged Average Classifiers via Hashing, and according to lead researcher Shrivastava, “[its] training times are about 7-10 times faster, and… memory footprints are 2-4 times smaller” than those of previous large-scale deep learning techniques.
In describing the scale of extreme classification problems, Medini refers to online shopping search queries, noting that “there are easily more than 100 million products online.” This is, if anything, conservative—one data company claimed Amazon US alone sold 606 million separate products, with the entire company offering more than three billion products worldwide. Another company reckons the US product count at 353 million. Medini continues, “a neural network that takes search input and predicts from 100 million outputs, or products, will typically end up with about 2,000 parameters per product. So you multiply those, and the final layer of the neural network is 200 billion parameters … [and] I’m talking about a very, very dead simple neural network model.”
At this scale, a supercomputer would likely need terabytes of working memory just to store the model. The memory problem gets even worse when you bring GPUs into the picture. GPUs can process neural network workloads orders of magnitude faster than general purpose CPUs can, but each GPU has a relatively small amount of RAM—even the most expensive Nvidia Tesla GPUs only have 32GB of RAM. Medini says, “training such a model is prohibitive due to massive inter-GPU communication.”
Instead of training on the entire 100 million outcomes—product purchases, in this example—Mach divides them into three “buckets,” each containing 33.3 million randomly selected outcomes. Now, MACH creates another “world,” and in that world, the 100 million outcomes are again randomly sorted into three buckets. Crucially, the random sorting is separate in World One and World Two—they each have the same 100 million outcomes, but their random distribution into buckets is different for each world.
With each world instantiated, a search is fed to both a “world one” classifier and a “world two” classifier, with only three possible outcomes apiece. “What is this person thinking about?” asks Shrivastava. “The most probable class is something that is common between these two buckets.”
At this point, there are nine possible outcomes—three buckets in World One times three buckets in World Two. But MACH only needed to create six classes—World One’s three buckets plus World Two’s three buckets—to model that nine-outcome search space. This advantage improves as more “worlds” are created; a three-world approach produces 27 outcomes from only nine created classes, a four-world setup gives 81 outcomes from 12 classes, and so forth. “I am paying a cost linearly, and I am getting an exponential improvement,” Shrivastava says.
Better yet, MACH lends itself better to distributed computing on smaller individual instances. The worlds “don’t even have to talk to one another,” Medini says. “In principle, you could train each [world] on a single GPU, which is something you could never do with a non-independent approach.” In the real world, the researchers applied MACH to a 49 million product Amazon training database, randomly sorting it into 10,000 buckets in each of 32 separate worlds. That reduced the required parameters in the model more than an order of magnitude—and according to Medini, training the model required both less time and less memory than some of the best reported training times on models with comparable parameters.
Of course, this wouldn’t be an Ars article on deep learning if we didn’t close it out with a cynical reminder about unintended consequences. The unspoken reality is that the neural network isn’t actually learning to show shoppers what they asked for. Instead, it’s learning how to turn queries into purchases. The neural network doesn’t know or care what the human was actually searching for; it just has an idea what that human is most likely to buy—and without sufficient oversight, systems trained to increase outcome probabilities this way can end up suggesting baby products to women who’ve suffered miscarriages, or worse.