Soroush Vosoughi from MIT’s Laboratory for Social Machines and Harvard’s Berkman Klein Center talks with us about his research into how false spreads differently than true news in Twitter. His article “The spread of true and false news online“, co-authored with Deb Roy and Sinan Aral, was published in the journal Science on March 9, 2018.
Thanks to @jjaron for some clarifying comments to my first diagram. Here's a second one that fills in some more detail. The outer circles point to areas where we need further research! pic.twitter.com/KhJth1zFnp
— Deb Roy (@dkroy) March 15, 2018
- “Fake news spreads faster than true news on Twitter—thanks to people, not bots” in Science Magazine
- “Cover stories: Visualizing the spread of true and false news on social media” in Science Magazine
- “Answers to Some Frequently Asked Questions About ‘The Spread of True and False News Online’” by Soroush Vosoughi, Deb Roy, Sinan Aral (March 28th, 2018)
- “Parallel narratives: Clinton and Trump supporters really don’t listen to each other on Twitter” (Vice News article mentioned in episode)
▲ Selection bias: political junkies
▲ Negation and linguistic features
▲ Why Twitter “likes” weren’t analyzed
▲ Opportunities for and requirements of further experimental research
How to Cite
Leigh, D., & Watkins, R. (2018, April 2). How Misinformation Spreads Online – Soroush Vosoughi (Version 1). figshare. https://doi.org/10.6084/m9.figshare.6081206.v1
Hosts / Producers
Doug Leigh & Ryan Watkins
What’s The Angle? by Shane Ivers
Hi, I’m Soroush Vosoughi. I’m a postdoc at the MIT Media Lab; also a fellow at Harvard Beckman Client Center. I’m a MIT lifer; I’ve been at MIT for many, many years. I came to Boston to study at MIT in 2004 as an undergrad, and I ended up getting my bachelor’s there and then my masters and then my PhD, and then I decided to stay a couple years for postdoc. And now actually I’m on the job market this year, so I’m going around giving job talks; hopefully finding a position for next fall.
Q1: I was a second year PhD student in 2013 when the Boston Marathon bombings happened. At that point I was still exploring research ideas for my PhD thesis, but one kind of area that I was getting closer to making my PhD topic was on creating computational models of language learning. I’ve always been interested in linguistics and natural language processing from a computational point of view, so I’ve been interested in, for example, coming up the computational models of how children learn language.Click here for more of the transcript
And I did a little work on that for my master’s thesis, and so I wanted to kind of dig deeper on that topic for my PhD, but then the bombings happening in April 2013. And as you remember, probably, MIT was at the center of some of the stuff that happened. There was a MIT police officer that got shot and killed on campus at MIT, Officer Sean Collier. And there was a man hunt; Cambridge went on lock down. There was a lot of crazy stuff happening that week, from Monday when the bombings happened until Thursday, I believe, and suspects were apprehended … or Friday morning I think. In that week Twitter and other social media platforms like Facebook and Twitter – and Reddit – became my main source of information and not just for me and most of my friends who were locked down in MIT grad dorms. They went on social media to figure out what the hell is happening. And it was great because you would get real time updates from people on the ground; you know like, they could say something like “oh I just so you know a shoot out here, and here’s a picture” or “I just heard that it was another bomb there.” And so you’d get real time update on what’s happening, but what I found not like a couple of days after things settled down and people kind of looked back at what was happening, we realized that a lot of the stuff we were the reading on social media was not true. Some of it was harmless in a way but some of it could have been really damaging, and some of it was damaging. And so after I realized, you know, basically that half of – or a good chunk of what I was being exposed to on Twitter was false … and other social media platforms … I decided to change my topic for my dissertation and use my skills – so my background’s in natural language processing and machine learning, applied to big data – and so I wanted to use these skills that I had to study basically the spread of rumors on social media. And to be more specific, I was interested in creating a tool – engineering a tool – that would detect stories that spread around real world events, like the Boston Marathon bombings, so like a real time or near real time rumor detection system, and then I wanted my system to be able to verify these stories as fast as it could.
Q2: Well, the size is pretty straightforward, that’s just the number of retweets. But the depth is how many hops on the followership graph that thing spreads and so if person A tweets something and person B (re)tweets person A – because person means falling person a person A – and if person C retweets person B because person C is following person B – and so on and so forth, depth captures that: how many hops it goes, right? And it’s basically word of mouth, like I had this from a friend of a friend of a friend: so that’s three depth, right? Then the other matter of structural virality is somewhat related to depth: it captures the general structure of the cascade. And so – it’s is a little technical – but I can tell the technical definition and also what it actually means non-technically, but the technical definition is the average shortest distance between all the nodes in a tree; that’s the virality. But what it actually means is it captures whether something – whether a tree like a cascade, a diffusion tree or a diffusion cascade, has a for example a broadcast shape or a kinda viral peer to peer shape. So it really is good at distinguishing between broadcast diffusion and peer to peer diffusion. And so, if you have … if there’s an account that has a million followers and they tweet something and a million people of the follower – all million – retweet that, that has a very broadcast shape, right? It’s just one person and then boom. Whereas let’s say my account only has a hundred followers: I tweet something, but then because it’s something so interesting if you know it basically goes viral not because I have a lot of followers but it because a friend of mine retweets, another friend retweets, and then it just becomes … it goes really deep and that it reaches different areas of their followership net, ah graph, and that has that has a very different shape. That has a peer to peer diffusion shape. And so a cascade with a high structural virality score is more like a second example I give: the more broadcast-like the diffusion is the lower the virality of that is.
Q3: One hypothesis could be that – false news spreads faster than true news because people who spread false news have more followers or or anything like that. Or like they’re more engaged on Twitter or something like that. And so we looked at various account characteristics: things like the number of followers, number of followees, how old their accounts are, how engaged they are on Twitter, and things like that – and whether they’re verified or not – and we compared the accounts that spread false news with the accounts that spread true news – we compared characteristics – and what we found is that the difference doesn’t explain it. If fact, the opposite of what you’d expect is true, which is that accounts that spread true news are actually more likely to be verified and have slightly more number of followers. So it’s not that these characteristics explain why false news spreads faster than true news; that was the basic finding. One thing I wanna make clear is that we did look at bots, but one thing we didn’t look at – and it could, for example, and again we didn’t look at it, so that we don’t know that is true or not, but it could potentially be an explanation of why false news spreads faster than true, is that you know we didn’t look at for example troll accounts, right? Or or any kind of kind of a cyborg accounts … these are accounts that are not bots but, there’s like a person who’s controlling a few accounts you tools that let them amplify their message, right? And/or like people gaming Twitter’s algorithms, like, so that certain things will be trending or something like that. So these are things we didn’t look at, and they could very well explain – like we don’t know, so anything we say speculative ,so we don’t know about it would explain the behavior we are seeing or not. If it is true that they are responsible for this, it’s very sophisticated behavior for them to be … and I mean like having bots is not super sophisticated, but like a lot of these other things is sophisticated behavior … and if it is the case that this is the reason why, you know, what we’re seeing what we’re seeing, then it shows like it really coordinated, sophisticated effort behind it. And again, we didn’t look at this, so I don’t know if it’s true one way or the other, so anything we say is speculative.
Q4: The best way to do this would be to survey people, actually,. To ask them all you saw this thing how do you feel about that allow you to share ideas or anything or like any kind questions you could ask about people’s behavior on false or true news; that would be the best way to address this question. But obviously we didn’t do that for this study, and for many different reasons it’s not and it’s not easy to run these surveys on Twitter. And that we decided to look at people’s; actually people what they’re saying in response to these articles as a proxy for *their reaction to these stories and … One way you could do this is to run this through sentiment analysis to get kind of positive/negative but we wanted to get more more a specific emotions than just positive and negative, and so we used to a basic emotions – like there are eight basic emotions that capture most of people’s emotions, so like everything you can think of would fall under one of these eight emotions. And these are things like fear, disgust, anticipation and you know … joy and sadness are also part of it. So, anyways there are eight emotions. so we basically the same way people create sentiment classifiers we did that for these eight emotions so just extended it slightly to cover more emotions to get a better sense of exactly – not whether some response is negative or positive but what is it exactly that makes a negative – is that fear, is it disgust, you know … things like that. And so that was that was our attempt to understand people’s reasoning – or people’s reaction – to these news articles without actually having to create surveys.
And so, yeah, the surprise finding didn’t really surprise us that much, mostly because our previous analysis showed that false news is apparently more novel than true news, and obviously you would expect people to be more surprised when they see something novel. The disgust as interesting because we didn’t expect that, but it kind of makes sense. There’s been other research that, you know, in in the feel of communication that show that people are more likely to share something – share news piece – that has negative content that positive content. They didn’t look at social media; they looked, I believe, just like New York Times … like people sharing articles, like either with their family members or an email or something … and it seems like people just like to share things with negative thing … a negative feel to it more than they like to share positive stories. And so that kinda made us think that this disgust might be one reason – or this negative feelings people have towards false stories – might be one reason why they’re more likely sharing; because of this other research that kinda shows that these kind of negative emotions do have an effect on people’s sharing behaviors. So, yeah, it was an interesting finding – we didn’t expect it – but it was interesting.
Q5: This is outside the scope of the study and so, again, this is me speculating but I – my hypothesis is that there are three kinds of people who share false news. One, they do it maliciously, on purpose. The second type, they do it because, you know, they like juicy gossip, interesting things. And they don’t care what is true or not – so not that they are on purpose sharing something they just wanna share something interesting. And the third type, they do it out of you know I’m gonna call it ignorance, but but I mean is that they just don’t have the time or don’t, you know, they don’t look at data source that closely and they share something and it might end up being false. And I do believe that these three groups of people basically account for – like there’s enough; look if you do a division I believe there’s enough people in each category and it will be interesting to – for further study – to divide the and they look at these three different groups of people, and I bet their behavior is very different. And the ways that they … you need to intervene to stop in this thread of misinformation is very different for each group. So, behavioral intervention methods could potentially apply to groups number two and three, but the first: you need very different kind of systems. You need you need to detect these malicious agents – and, you know I don’t know what you would do – but at least you know the approach you take these people is gonna be very different. I think the three different types of people … like the third type will probably will not be as interested in engaging really around the conversation and the replies, but the first two might be. And so again, something we didn’t look at, but it will be really interesting for future research to look at that.
Q6: The coauthors in my study are my postdoc adviser, professor Deb Roy and Professor Sinan Aral, who was on my PhD committee. And so after my PhD defense we [unintelligible] rumors, as I mentioned – we had discussions, so he decided to collaborate with me on this project as well.
Our laboratory – the Laboratory for Social Machine, which is headed by Professor Deb Roy … we had a special relationship with Twitter when it comes to data access. Not that we have; so we have the access to data that everyone else could get access to, it’s just that it costs a lot of money to get access to that kind of data. And so, you can think of it as kind of a sponsorship: so Twitter is sponsoring us by giving us free access to their data and – I mean definitely having Deb as someone involved with Twitter really helped get that you know get that sponsorship … Twitter also sponsors our lab with money as well so, not just data access but also also financial gifts from Twitter, and so dad dad relationship definitely helped us a lot to get access to these kind of data.
So, Twitter sponsors our lab, but they don’t tell us what to work on. So, they give us access to data … and so this was something that I decided to work on just because of my PhD thesis, and they weren’t kind of either for it or against it: they supported any sort of research that came out of our lab. So, it wasn’t like Twitter said “look at this, or don’t look at this”; so, it was just something we did.
Now days, most of the interesting data lives, you know, it corporations like Google, Microsoft, Facebook and especially Facebook and Twitter … and so any kind of research you want to do you have to you know work with these corporations nowadays to get access to interesting data. And certain platforms like Twitter is easy – much easier – to work with because this stuff is public, whereas with Facebook there is a lot more privacy concerns around data.
Q7: You know, I think this kind of research needs to be – if we don’t do anything like about this, the problem’s gonna persist, and I’d rather the research on detecting and studying false news and rumors, I’d rather that research be done in the open. Where people can actually see the results. And maybe disagree with it, and that they can build on top of it … even if that means that others would get access to, you know, something that would let them game the system. But you know then you push back and you do something better. So, yes, it is kind of – you are creating an arms race but I that’s the kind of transparency is the best way forward this kind of work.
My other authors would probably disagree, or maybe they would agree, I don’t know … but my personal opinion is that I don’t think the platforms should do anything that would be even closely … could closely the because censorship. So I don’t think they should for example remove news that think is fake or anything like that; I don’t think I should be doing that. But what they could be doing is they could be coming … like, providing these information quality metrics. So, if they let you see whatever you’re saying – but give me a sense of, for example, “this source that you’re reading from,” you know “according to Snopes, 80 percent of what they said in the past was false,” and then you decide whether you want to retweet that. I think providing more information about what people are reading and the sources of these things is really helpful, and something that the platforms could do, but I think actually censoring information and not letting people see certain things, that’s something I don’t agree with.
Q8: One thing I am planning on doing is taking the dataset that we’ve collected and thinking really – from an engineering point of view – thinking about, allright: here’s here we have this larger data set across many different topics. Which is … could we create an engineering tools that could detect these false stories before they’re actually verified, or before they’re shown to be to true or false by these fact-checking organizations? I think that would be a very interesting extension.
And, the most – from my point of view – the extension to this work that I’m most interested about is actually looking at behavioral intervention experiments. So I’m less interested in now detecting false news and rumors, but more interested in figuring out whether there are certain ways you could intervene to dampen the spread of rumors and false news. I mean “intervene” I don’t mean like that platforms intervening, or there be policy you know that would censor people. I mean algorithmic interventions. So maybe having tools that detect false news spreading and figure out who is on the path, you know, to be exposed, and maybe let them know the truth before they’re exposed to the falsehood, or something like that. Or maybe come up with an information like quality metric that lets people know when they’re reading something how good the quality of that information is, and maybe that’ll help them think twice before they retweet something that has a low quality rating. So these kind of intervention methods are what I’m really interested in as an extension of this study.
Q09: We had this study, basically,about the 2016 presidential election, and using our access to the Twitter data we looked at people in our dataset who had commented about the election, and at the end of the election when everything was over we mapped them, basically, on Twitter. And we had some way of determining those people were supporters of Trump or Clinton or Sanders and what not. And so when we mapped these people we saw these really clear echo chambers … so, for example there was a cocoon with very little connections – like a tightly connected cocoon with very little connections outside of the cocoon. And interestingly enough, we looked at journalists on Twitter, and we saw that mostly they’re outside of these cocoons. And so was kind of a communication gap between that cocoon and the rest of Twitter including journalists. So we did look at cocoons and filter bubbles through that lens, through the lens of the 2016 election.
One thing that we would be really interested to do in the future is to combine that research on – and by the way that was published in Vice News if you’re interested in reading that. It’s called “Journalists and Trump voters live in separate online bubbles.” So that work is still under review. We’re gonna publish it in an academic journal, but because it was so interesting, we wanted to kind of have it out there as a news article. And, yeah, so we’re really interested in the future maybe combining some of the work on a false news and our work on mapping these tribal networks and these echo chambers, but we haven’t done that kind of intersection yet.
Q10: In terms of the general news, you know, as you said, it’s garnered so much attention that I don’t even know like who else is … like a lot of these articles that didn’t even talk to us; they just read the paper and they wrote an article. Or they’re quoting from some other article or something like that. And so, yeah, it is really interesting actually to look at the … look at the coverage. And we’re actually joking that, you know, that title of paper is “The spread of true and false news online” and we’re joking of doing a follow-on project called “The spread of of spread of true and false news online online,” basically. And so I think, yeah, there’s there’s been some misunderstandings in the media, but I think overall – at least in the academic circles you know – the message has been received that I think has been mostly positive. And there is already talks of collaborations and kind of moving forward with behavioral intervention metrics that I described. And so I think that the coverage even though they had some negatives, I think overall has been good to raise awareness and also has created a lot of opportunities for collaboration, which I think is very valuable.
Q11: I think they definitely have a lot to offer, but I’m really, really worried about the ethical implications of using an algorithm for everything, mostly because nowadays most algorithms are black boxes, and you don’t really know why an algorithm made a decision. And so I think any time an algorithm is used case like this it should be really easy to ask the algorithm why it made a decision. You can’t just have an algorithm as a black box saying “oh, this is right this is wrong” or “this guy’s guilty, this guy’s not” without – it being – without us being able to understand why it made that decision. And you know that’s a very active area of machine learning which is, you have these, for example, neural network – deep learning system – that does something: can you get it to explain to you why did what it did? That’s an activity of machine learning which kind of connects to this whole thing, which is of getting systems to explain their decision making process. And so and I think, yes algorithms have a lot to offer but they have to be very transparent and very easy to interpret. As you know algorithms are never 100 percent accurate, and algorithms’ performance really relies on the training data that was used to train it. And so, if there was any bias in that training data you might be, like, the outcome of the algorithm that predictions of the algorithm might be biased. And so, I think yes algorithms are useful but they need to be more transparency and better ways to understand exactly what is going on inside these algorithms.
Q12: There’s a whole movement of what is called “AI in governance,” and I think it’s become somewhat inevitable that in the next few decades there’s going to be more and more machine learning and AI use. Even now as we speak for example in certain states in the US the sentences given out by the courts are informed by machine learning algorithms. And so there is going to more and more of these algorithms being used in different areas of governance. And so there’s this whole new movement emerging which is basically a correlation of machine learning scientists and computer scientists, basically, and policy and law, and you know public service – to try to really – for both sides to understand each other. You know, these things as they started running society, we don’t want them to be these black boxes that no one knows exactly how to work out, why they’re making these decisions. And so we want we want transparency, and we want both sides of this equation – the policy makers and the computer scientists – to be aware of each other and so this whole movement of AI in governance is something that is really shaping up to be a big movement in computer sciences and and social sciences and something people might be interested in learning more about.