Detecting Fake News with the Help of an Algorithm

Researchers at the University of Michigan have recently developed an algorithm that can identify fake news stories better than humans. The algorithm uses linguistic clues to differentiate between factual and inaccurate stories.

The algorithm could be used by major news aggregators and social media sites like Google News and Facebook to spot and combat misinformation.

Fighting fake news

After the 2016 election, “fake news” became a political buzzword as many speculated that fabricated news stories spread on Facebook influenced the results of the election.

News stories reporting false or questionable information have proliferated on social media sites in the past few years. These stories, created either as click-bait or with the intention to sway public opinion, have triggered controversy in politics and caused major problems for social media platforms struggling to regulate the massive amount of data and news stories circulated on their websites.

Since the 2016 election, Facebook has taken measures to fight the distribution of false content on their platforms. They’ve tried banning prominent users, such as right-wing commentator Alex Jones, working with third-party fact-checkers, and allowing users to flag inaccurate stories.

These efforts have had limited success to this point, raising the need for new strategies to enter the fold. An algorithm that can automatically and accurately identify fake news stories offers an appealing tool.

“There has been a significant effort lately in the research community to address this problem,” said Rada Mihalcea, a professor of electrical engineering and computer science at U-M and the lead researcher on the project.

“However, most of the work, including recent challenges around fake news, have been focused on understanding stance and on claim and fact verification.”

“From what I know, this is the first system that addresses the automatic identification of fake news stories in their entirety, and as they typically appear online,” she continued.

Other research have been more limited, looking to identify click-baits, or learning the distinction between satirical and real news, according to Mihalcea.

Currently, fake news sites primarily rely on human fact-checkers, which takes time. With the overwhelming influx of news stories shared online, this means that most fake stories are not caught, and if they are, they have already been read by enough people to have made an impact.

Automatic verification could help news aggregators and social media sites find fake news stories earlier, and perhaps more accurately, than human regulators.

Mihalcea said that her team’s algorithm could be used by both users and social media sites to flag stories and distinguish between trustworthy and untrustworthy media. It has already shown that it can identify fake news stories at a 76-percent success rate, which is a significant margin of error, but higher than the human success rate of 70 percent.

How does it work?

The new algorithm takes a fairly unique approach to identifying fake news stories. It uses linguistic analysis, which means that it examines quantifiable characteristics in each article’s writing style and content, from its grammatical structure, to its use of punctuation and the complexity of its language.

“We started by collecting a dataset of news — both fake and legitimate — which can be used to learn the characteristics of fake news,” said Mihalcea. “We represent the data using a number of features — sequences of words, punctuation, word categories, syntactic relations, and others.

“For instance, one such feature could be a number reflecting the number of times we see the word ‘story,’ another could be the number of times we see words in a subject-verb relation, and so on. These representatives are then fed into the learning algorithm, which eventually decides how to weight them for the final classification.”

Mihalcea explained that these clues are often different than those that humans look for. For example, the algorithm identifies certain keywords that signify accuracy or inaccuracy, which humans might not instinctively look for.

“In this and other research we have done on deception, we have found for instance that the use of the word ‘I’ is associated with truth,” she said. “It is easy for an algorithm to count the number of times ‘I’ is said, and find the difference.

“People however do not do such counting naturally, and while it may be easy, it would distract them from the actual understanding of the text.”

Training an algorithm to detect deception requires identifying a large set of linguistic clues drawn from a significant sample of fake news stories. This presents a challenge, as fake news stories appear and disappear quickly, come in many genres, and can often be confused with satire.

The team avoided this problem by drafting their own fake news sources. They hired outside writers to take real news stories and reverse-engineer them into fakes. The writers were recruited using the crowdsourcing internet marketplace Amazon Mechanical Turk.

Mihalcea noted that this process is consistent with how fake news stories are typically created in the real world.

By the end of the process, the team had a set of 500 real and fake news stories to feed to the algorithm. After the algorithm performed a linguistic analysis on these items, they tested it with real and fake news stories pulled from the internet.

The algorithm can currently identify fraudulent stories at a 76 percent rate, which is good, but there is room for improvement.

Mihalcea noted that there is evidence that feeding the algorithm more data may make the algorithm more effective. They plotted the performance of the algorithm as a function of the amount of data fed into it, creating a “learning curve,” which allowed them to see if the algorithm stops learning after a certain amount of data.

“What we observed is that more data is likely to bring increase in performance, so a natural next step would be to collect more news stories, both fake and legitimate, as a way of improving the algorithm effectiveness,” she said.

The team’s work in developing the algorithm comes at a pivotal point in political and media history. Nearly half of Americans now primarily get their news online, and over two-thirds say they get at least some of their news through social media.

But the internet remains a vastly unregulated source of information.

“The web — including social media — plays a huge role in today’s society, as it is a major source of information that people use to make decision,” Mihalcea said.

“Consider for instance recent political events, or the discussions around vaccination, and so forth. In this environment, where everyone can put ‘news’ out there, it is important for people to have a means to distinguish between what’s trustworthy and what’s not.”