JPM Kübler

Socially (IR)Responsible Algorithms: How the internet can betray our privacy - Seminar Wrap-Up

Social media has changed the way we interact and communicate. It provides us with great opportunities to meet with friends and colleagues all over the world. It delivers interesting information on a daily base and makes us continuously discover new things. It keeps us up to date and helps us to navigate through a rocky and often overwhelmingly complex world. By nurturing us with the necessary knowledge and giving us the needed bonds with our peers, social media has become a vital part of our daily life. 

And still, as our own digital fingerprint within the social media realm may tell complete strangers more about us than we are willing to share with the public, it bares the potential to horribly betray us. In 2013, Kosinski et al. published a widely noticed study in the Proceedings of the National Academy of Science that demonstrated how individual likes of Facebook fan pages can be used to predict personal traits such as our individual age, gender, political and sexual orientation, eating and drinking habits or our very own heritage and racial profile. While the authors intended to warn the public about the possible side effects of the happy social media universe, dark forces made profit from these insights and started to collect information of what people liked on Facebook. Alexander Kogan’s app “This Is Your Digital Life” used a loophole in Facebook’s API and crawled information about following behavior from more than 80 million Facebook user profiles -  in many cases even without the specific consent of the involved profile owners, as the app did not only access the information of the specific app user, but also of all his/her friends. Kogan then shared this data with Cambridge Analytica which claims to have used the data for various political campaigns within the context of the Brexit referendum, the 2016 Republican primaries and the subsequent 2016 US Presidential elections. While few hard facts are known about what Cambridge Analytica could achieve with the data, the company’s CEO Alexander Nix explained in various keynotes that Cambridge Analytica similarly used the data to predict personal traits and to use this information subsequently to target users with specifically designed political advertisements.

In the aftermath of the 2016 US presidential elections and its mostly unexpected outcome, Cambridge Analytica’s activities have been put into the spotlight of public attention. While the company has been seized for malpractice, the heat on its stakeholders and Facebook increased. Five years after the initial scandal, public awareness about the possibility to predict personal traits with the help of a social media user’s footprint has cumulated in heavy media coverage and multiple widely acclaimed documentaries such as e.g. “The Great Hack” or “The Social Dilemma”.

Despite the large public attention to the possible mis-use of social media data, we see social media engagement still to increase. While Facebook usage declines, younger target groups switched their attention to other platforms such as e.g. Instagram or TikTok. Many users believe that the changes in structure and communication style make these platforms less vulnerable to information betrayal. And even though communication styles switched from text-based information more to images and videos, both popular platforms require users to follow accounts to receive content and information. 

Still, what many users seem to ignore or not realize, is that information about who is following an account is still publicly observable. This implies that one may again collect user followership information and pair this information with personal traits to build a prediction algorithm that forecasts a user’s personal traits based on the accounts a user is following on a platform. In other words: What Kosinski et al. showed in 2013 may still be very feasible in today’s new social media world.

Therefore, we decided to use one of our very own research seminars at the MCM to understand how much personal information of a user can be predicted with the help of his or her social media usage. To do so, we first replicated the study by Kosinki et al. (2013) in the context of Instagram. While Kosinki et al. could rely on large sample of 40,000 participants, we needed to constrain ourselves to a much shorter sample. So, we conducted a survey with approx. 2,000 Instagram users in which we asked participants to indicate which popular accounts they followed on Instagram. Users could choose between 200 accounts. Furthermore, we asked participants to answer a survey that measured, amongst other factors, personal traits (like e.g. the 5-factor OCEAN model), sexual orientation, gender, age, drug usage, political preferences, race and location within Germany. Following Kosinki et al. (2013) we then predicted the traits with the help of information on which accounts participants followed on Instagram. Relying on holdout sample validations, we can show - even though our sample is much smaller than the one of the initial study – that we similarly well predict major personal traits. Our replication thus already shows that once you have enough personal information and pair it with social media data, predictions become easy. In other words, with enough survey data, we could also deliver reliable predictions for social media users who did not participate in our study. 

The video below gives a great summary of the survey work and the prediction accuracy obtained by our students. 

One may now claim that social media data only becomes dangerous once one has a large enough training data set with enough personal information. Or in other words: If you don’t have enough survey data, you can’t predict something. This made us question if you really need survey data to obtain enough personal information to feed the follower prediction model. So we started looking around for alternative personal information sources which we may use to train our algorithm. Surprisingly, we found that many Instagram users happily share such information with the public. Not only that specific hashtags or types of posts may allow you to predict someone’s preferences, many users also often provide more sensitive and concrete information within their profiles. Consequently, we crawled Instagram user bios and looked for sensitive information. We were awed to find that many users happily share information on where they live, their birth year or age, gender, their main interests, sexual orientation and sometimes even their drug habits right in their Instagram bio. Crawling more than 200,000 user profiles we similarly built a large training set and combined the bio information with information about which public or well know accounts these people followed on Instagram. Again, we replicated the Kosinski et al. (2013) approach and developed a predictive model. The holdout sample validation indicated that the predictive power of these models was comparable to the findings of our replication study, showing that training the algorithm with publicly available information instead of survey data delivers similar results.

Just to understand what we could predict and how we adapted our approach to the new data sources within Instagram, check the two following videos.

All participants were similarly shocked to see that even though public awareness of social media’s potential of information betrayal is high, people seem to not understand how easily critical information can be acquired and used to deliver valid and reliable predictions of someone’s private traits. 

A key issue here is that one does not even need any more survey-based information for training. Instead, training data may be directly obtained from privacy insensitive users which may then finally be used to predict personal traits of people who in fact do not share their personal traits, but become predictable through what they like and follow on social media.