This was triggered by one of those “Top ten concerts” memes that went around Facebook, where you post your 9 favourite concerts and a 10th that is “a lie” – i.e. one you haven’t been to. There was a similar idea with a twist asking people: “What’s something you’ve done that you’re reasonably confident you’re the only person on my friends list who has done it?”
On the face of it the “Top ten concerts” post doesn’t seen to be revealing anything overly personal, but it got me pondering: What might a Machine Learning Algorithm infer from these kinds of top 10 lists? It is rich static data, and there are no complex feedback loops, so it is ideal territory for Machine Learning. Look on this as a thought experiment to understand what an apparently innocent list might disclose.
Machine Learning is one of the most common forms of what is popularly called Artificial Intelligence (AI), though asking if Machine Learning is AI is a good way to start an argument with a specialist in the area, (and that is before you start talking about the difference between Machine Learning and Deep Learning). All of that is a topic for another day – AI pedants might want to look away right now…
A raw list of artists isn’t much for an algorithm to get its teeth into, so we need to enrich it with some additional data. For each artist it is fairly straight forward to obtain genre, countries where the artist had hits, the lyrical content, the era they were actively touring, and also grab any “related artists.” We can get all of this from public data sources (by way of example, this Quora question lists many Music APIs). This all helps with getting ready to build connections, and train the classifiers in the Machine Learning algorithms so that we can start making some inferences.
Now those top 10 lists are starting to give us some interesting information. Obviously we can see what genre of music you like (which, as it turns out, is a good indication of your political leanings and many other things). We also get an indication of your geographic location (based on where the artists have charted and toured), an indication of your gender, and an indicator of likely age. Some of these things we may already have from your Facebook profile, but it is always useful to have data from multiple sources, especially if you are someone who has set up your Facebook profile to be ‘mysterious’. Oh, and don’t worry if you took your kids to a boyband concert and you added that to your list, the algorithms can shuffle that noise out, and maybe even be able to pick out your children’s ages too.
Now, about that lie on the list – the 10th artist. Initially I thought this might ruin things, but actually it just introduces a small error into the data (10%), which isn’t destructive. It turns out that people apply their own social filters to the bands they choose to list, even for that tenth one. Usually the ‘lie’ is picked up quite easily as it fails to match on genre, era or geography, or it just simply doesn’t fit into the pattern of other lists in the dataset. The lie is a signal too and, with a bit of extra time and investigation, I suspect someone’s ‘lie’ item might actually tell you more about them than the real items do.
We also have some extra data we could add for training the algorithms: The data from users with open privacy settings on Facebook. This could be used to train the system and then be applied to the data from the more private users we are looking to make predictions from, who have posted answers as public status updates. The patterns learnt can be used to predict the gaps for users who have stricter privacy settings.
At this point, it is worth clearing up a misunderstanding about predictive analytics: There is a dangerous line in moving from “general” predictions to predictions about a specific individual, and I will come back to that in a moment. The likelihood that the top 10 list will give us an exact prediction of age, gender, location and political leaning for a specific user is fairly low. However, the predictive power across a group of users with a specific type of top 10 list becomes quite high. More than that, we can analyse the lists and pick out artist names that have particularly strong predictive power. We don’t need 100% accurate answers for things like targeting ads – If we can build something good enough to get a 5-10 percent increase in yield we have our payoff.
How good your predictions are depends on how good your dataset is, and that includes how large it is, and what the quality of the data is. Facebook obviously has access to the full dataset, regardless of privacy settings, though you would expect (hope?) it would comply with privacy rules in each country. What I have described here is a minor set of predictions, but according to a document obtained by The Australian, Facebook have been telling advertisers that they can detect when teens are “stressed”, “defeated”, “overwhelmed”, “anxious”, “nervous”, “stupid”, “silly”, “useless” and a “failure”. In other words, Facebook is able to tell advertisers when teens are at their most vulnerable. This isn’t new news, but I am still not sure many users realise quite how much they reveal in what they write on Facebook. That case is particularly sensitive since most countries have greater restrictions on how data about children can be collected and used. If this is being done for teens, then you can reasonably assume that adult data is being collected and used to at least this degree.
I used the simple example of the top 10 list, but we obviously share much more than that online. The issue isn’t just with Facebook (many App creators have large treasure troves of data, and could even use app permissions to trawl through libraries to make marketing decisions). Even things like the timings of our posts can reveal something about us (especially if you are the President of the United States).
While the use of algorithms to predict individual behaviour is being accelerated by advances in computing and software, it is far too easy to get ahead of the underlying science. Moving from generalised group predictions, to individual diagnostics, crosses between very different areas of data science. To move between the two, we need to switch from looking at correlations, to understanding the underlying causes. That is something that no algorithm today can easily give us.
AI is notoriously a black box, and while developments are changing that, even if we understand how the AI is making its predictions, that doesn’t mean that we understand how what it is observing is working. For simple, generalised predictions used in things like marketing, that isn’t too much of an issue; we don’t need to know how things work, we just need them to appear to work (though hold that thought). However, for specific predictions that have life-changing consequences for the individuals, we need to do much better than that. There was the case of a Wisconsin man who was sentenced based on software predictions last year, and more recently mentioned in The New York Times. The “Sent to Prison by a Software Program’s Secret Algorithms” headline is a slight piece of misdirection, he wasn’t convicted by the software, but the judge made the sentencing decision based on its guidance that he was a “high risk” to the community. This sort of individual determination by algorithm is highly problematic; is the software distinguishing between someone who is a high risk, and someone who looks like a person who is a high risk. These are two very, very different things, which can be hard to distinguish. If I copy the shopping habits of a mass murderer, I am clearly not suddenly more likely to commit a murder. However to an algorithm looking at recent data, I look no different, and could be flagged as a high risk. This “aping” of behaviour also has a mirror in how the algorithms learn.
Since algorithms pickup the signals from human behaviours, they also copy any errors or biases embedded in those behaviours. If we train the algorithms on data from a racist and unjust system, the algorithms learn to ape that racism, and to embed those inequalities. In the human world, there are no “perfect systems” to build models from.
The real problem with using Machine Learning to make predictions about human behaviours or beliefs is that humans don’t exist in stasis. We are constantly evolving and changing, and we are often wrong, pending correction. More than that, we evolve and change in tightly coupled, interdependent ways. We react to social trends, advertisers and government policies in a coordinated dance. Recently there has been much investigation here in the UK into the rapid drop off in teenage pregnancies. One of the most strongly indicated behavioural drivers for teenagers not getting pregnant is, wait for it, teenagers are no longer getting pregnant. As it is no longer seen as a social norm in the segments where it was most common, attitudes and behaviours have changed. Certainly something triggered that change (separating correlations from causality is proving very hard, but changing attitudes to education appears to be a key one), but once a tipping point was reached, the norm shifted rapidly and we now have the lowest conception rates for under 18s since 1969. That sort of Gordian Knot https://en.wikipedia.org/wiki/Gordian_Knot is almost impossible for today’s A.I. to wrap its head around. If indeed it has a head, but embodiment and AI is another kettle of fish.
So, today, we find ourselves in a strange hinterland, where AI can be applied to our digital detritus to reveal more about us that we might know about ourselves. It can do so with such apparent reliability and certitude that its findings can appear as solid as court judgements, or at least be used in them. However in reality they are still only reliable enough to provide the kind of finger-in-the-air guidance that might be of use to marketeers and the curious. The fact is that as soon as they are applied to real-world situations they change behaviours so much that they are no longer reliable. This makes high risk predictions highly questionable: If criminals find that drivers of a particular type of car are more likely to get shorter sentences, they can start driving that kind of car. The consequences of using machine learning for predictions in human systems is highly problematic for that reason, if no other. At another level, if we start to use AI to jail (or free) our criminals, the dataset that is “the free population” changes, and our models are no longer valid. If AI predicts what you might reveal about your secrets, or how you might lie, then they simply aren’t secrets anymore, and perhaps the lie isn’t even a lie.
Thankfully we have things like data protection laws to control how our data is used, which will become significantly more strict as the European GDPR legislation comes into force. But that is just for now. Where will the temptation of easy answers lead us?