These are troubled times for artificial intelligence developers: never has there been such potential in the field of machine learning, which relies on users’ personal information for training—however, data regulation and public perception of digital privacy have never been sterner, either.
The 2018 Cambridge Analytica scandal was a watershed moment: personal data from 87 million Facebook users were covertly used for political campaigning. This event, and the frequent news of security breaches in social networks, in operating systems and in cloud servers have eroded public trust. Earlier this year, Google admitted that its employees listen to recordings of conversations held between clients and the company’s smart speaker. Technologists are on a quest for privacy-protecting artificial intelligence, which has led to the proposal of new techniques like federated learning. Will they work?
A truly ‘smart’ device is not merely equipped with instructions to carry out certain tasks—it should be programmed to learn how to solve them. Machine learning actually enables a computer to make decisions without knowing why it takes them, as the software just guesses the most likely correct answer based on its observation of hundreds, thousands or millions of similar scenarios. Predictably, the ‘wisest’ algorithms have been trained on the largest datasets, normally comprised of decisions taken by humans, gifted with intuition and ingenuity. But there is another relevant variable, besides the number of analysed examples. Context is crucial, too.
“We could train a language model for Gboard—Android smartphones’ predictive keyboard—on, say, Wikipedia data, but it would be terrible, because people don’t type text messages anything like they write Wikipedia articles,” says Brendan McMahan, a senior researcher at Google AI, the company’s artificial intelligence branch. Due to this phenomenon, machine learning software has become hungry for data from the very users it purports to serve.
Until recently, consumers would blithely volunteer such information. That was before the Cambridge Analytica scandal, before Mark Zuckerberg’s hearing in the United States Congress on digital privacy and, in Europe, before the General Data Protection Regulation (GDPR). Today, most consumers have realised that free-of-charge digital services often trade covertly in the currency of personal data. Take FaceApp, software designed to digitally age smartphone portraits for some harmless fun. In July, the program sparked worldwide controversy after allegations were made that it stole all personal images from users’ phones. This turned out to be false, but media coverage of FaceApp’s vague privacy policies—similar to those of popular sites like Twitter—was enough to shine a light on the unsolved issue of digital privacy.
Cryptography and federated learning
How is one to train a machine learning model in the age of privacy? One option is to compute only with encrypted data, by a technique called homomorphic encryption. This enables a central server to collect and process personal data in a format that is unreadable to anyone supervising the model. However, cryptography is “incredibly slow for complex functions” like artificial intelligence training, according to digital privacy expert Andrew Trask, from the University of Oxford (UK).
McMahan and his team at Google AI have proposed a decentralised system they call federated learning, which trains algorithms directly on users’ devices. “It’s not a silver bullet. It doesn’t apply to all problems, let alone solve all problems”, he cautioned at a recent global AI summit. However, their solution is a viable and attractive option which has proven its worth since it was rolled out in 2016.
Federated learning can train an artificial intelligence while dispensing with the need to gather private information on a centralised server. In fact, users’ data never leave their device at all—instead, the program comes to them. The Google AI team has spent years improving its predictive keyboard like this. First, users receive the latest version of Gboard as a software update. The model trains on each smartphone while private text messages are typed out. The result of every training session is then condensed into a small update which is sent back to Google’s central server, leaving no trace of the analysed texts. Only then does the model aggregate all the lessons it has learned to produce an upgraded program.
The company claims these download, training and upload sessions don’t drain the battery or slow down the phone, because they only happen on “eligible” devices, which must be idle, charging and on WiFi. It remains to be seen, however, whether this process can be managed on the growing number of connected devices, most of which have less computing power than a smartphone.
Beyond cell phones
Every day, so-called ‘smart’ objects join the Internet of Things: watches, speakers, household appliances and even clothes. On top of such consumer gadgets, the developing field of neurotechnology is already producing brain-machine interfaces which connect to the human nervous system. These devices gather personal data—thoughts, encoded as brainwaves—and send those to a central server. Could federated learning protect such sensitive physiological information?
“With the small, deployable processors that are being developed, I wouldn’t say hardware is really the bottleneck anymore. It’s challenging, but certainly not a blocker,” says Trask, who in 2017 founded Open Minded, a project to promote decentralized machine learning. “Far more challenging and harder to upgrade is the network infrastructure required to send models to remote locations,” he warns. According to Trask, the advent of 5G is “the thing to pay attention to” in this field.
However, federated learning alone does not guarantee anonymity, according to Morten Dahl, an expert in privacy-protecting machine learning at Dropout Labs (Paris, France). In order to mask who trained which version of the model (out of all the ones aggregated by the central server), an additional privacy mechanism called secure multi-party computation is required. This encryption technique is already used, for instance, by banks when putting together financial data from different sources, or at research hospitals to aggregate patients’ medical histories. Google AI employs it in the final step of its federated training for Gboard.
A final layer of security called differential privacy guarantees that the original data can’t be reverse engineered from the final trained model. Both Dahl and Trask agree that these three techniques together—federated learning, secure computation and differential privacy—will be at the heart of artificial intelligence development. “Federated learning will dominate the market first because it’s simple,” says Trask. “Together with the other two methods, it is the fastest, most secure and most widely deployable option on the market.”Bruno Martín