These days companies are using more and more of our data to improve their products and services. And it makes a lot of sense. If you think about it, it’s better to measure what your users like than to guess and build products that no one wants to use. However, this is also very dangerous. It undermines our privacy because the collected data can be quite sensitive, causing harm if it would leak so companies love data to improve their products, but we as users. We want to protect our privacy. These contradicting needs can be satisfied with a technique called differential privacy. It allows companies to collect information about their users without compromising the privacy of an individual. But let’s first take a look at why we would go through all this trouble. Companies can just take our data. Remove our names and call it a day, right, Well, not quite first of all. This anonymization process usually happens on the servers of the companies that collect your data, so you have to trust them to really remove the identifiable records and secondly. How anonymous is Anonymize data really in 2006 Netflix started a competition called the Netflix Price. Competing teams had to create an algorithm. That could predict how someone would rate a movie to help with this challenge. Netflix provided a dataset containing over 100 million ratings submitted by over 480 thousand users for more than 17 thousand movies. Netflix, of course, anonymized this data set by removing the names of users and by replacing some ratings with fake and random ratings. Even though that sounds pretty anonymous, it actually wasn’t to computer. Scientists from the University of Texas published a paper in 2008 that said that they had successfully identified people from this data set by combining it with data from Imdb. These types of attacks are called linkage attacks and it happens when pieces of seemingly anonymous data can be combined to reveal real identities. Another more creepy example would be the case of the governor of Massachusetts in the mid-1990s this its Group Insurance Commission decided to publish the hospital visits of state employees. They anonymized this data by removing names, addresses and other fields that could identify people, however, computer scientists. Latanya Sweeney decided to show how easy it was to reverse this. She combined the published health records with voter registration records and simply reduced the list. There was only one person in the medical data that lived in the same zip code had the same gender and the same date of birth as the governor, thus exposing his medical records in a later paper. She noted that eighty-seven percent of all. Americans can be identified with only three pieces of information, Zip code, birthday and gender so much for anonymity. Clearly, this technique isn’t enough to protect our privacy differential privacy on the other hand, neutralizes these types of attacks to explain how it works. Let’s assume that we want to get a view on how many people do something embarrassing like, for example, picking their nose to do that. We set up a service with the question. Do you pick your nose and with the Sno buttons below it? We collect all these answers on a server somewhere, but instead of sending the real answers. We’re going to introduce some noise, let’s say that! Bob is a nose picker and that he clicks on the yes button before we send his response to the server. Our differential privacy algorithm will flip a coin if it’s heads, the algorithm sends. Bob’s real answer to our server. If it’s tails, the algorithm flips the second coin and sends. Yes if it’s tails or no. If it’s heads back on our server, we see the data coming in, but because of the added noise, we can’t really trust individual records our record for. Bob might say that he’s a no speaker, but there is at least a one in four chance that he’s actually not a no speaker but that the answer was simply the effect of the coin toss that the algorithm performed. This is plausible deniability. You can be sure of people’s answers. So you can judge them on it. This is particularly interesting if you’re collecting data about illegal behavior such as drug use, for instance, now, because you know how the noise is distributed, you can compensate for it in with a fairly accurate view on how many people are actually know speakers. Now, of course, the coin toss algorithm is just an example and a bit too simple. Real-world algorithms use the Laplace distribution to spread data over a larger range and increase the level of anonymity in the paper, the algorithmic foundations of differential privacy. It is noted that differential privacy promises that the outcome of a survey will stay the same. Whether or not you participate in it. Therefore, you don’t have any reason not to participate in the survey. You don’t have to fear that your data. In this case? Your nose-picking habits will be exposed, All right, so now we know what differential privacy is and how it works, but let’s take a look at who is already using it. Apple and Google are two of the biggest companies who are currently using it. Apple started rolling out differential privacy in iOS 10 and Mac Os Sierra. They use it to collect data on what websites are using a lot of power. What images are used in a certain context? And what words people are typing that aren’t in the keyboard’s dictionary. Apple’s implementation of differential privacy is documented, but not open-source. Google, on the other hand has been developing an open-source library for this the user in Chrome to do Studies on browser malware and in maps to collect data about traffic in large cities. But overall there aren’t many companies who have adopted differential privacy and those who have only use it for a small percentage of their data collection. So why is that well, for starters? Differential privacy is only usable for large data sets because of the injected noise, using it on a tiny data set will likely result in inaccurate data. And then there is also the complexity of implementing it. It’s a lot more difficult to implement differential privacy, compared to just reporting the real data of users and anonymize it in the old-fashioned way. So the bottom line is that differential privacy can help companies to learn more about a group of users without compromising the privacy of an individual within that group adoption, however, is still limited, but it’s clear that there is an increasing need in ways to collect data about people without compromising their privacy. That’s it for this video. If you want to learn more head over to the simply explained playlist to watch more videos and as always. Thank you very much for watching you.