Social media flu tracker gets local filter
- By Mike Cipriano
- Mar 26, 2014
A group of researchers from Johns Hopkins and George Washington universities has developed a method of influenza detection using data from social media filtered by location.
Researchers David A. Broniatowski, Mark Dredze and Michael Paul have created an algorithm that distinguishes relevant tweets from insignificant or unrelated online chatter. Tweets that do not pertain to the actual infection have distorted the accuracy of social media surveillance in the past by covering signs of influenza’s pervasiveness.
According to the authors, their approach has been tested at both the national and municipal levels, and their results were evaluated by a municipal health agency.
The data was collected using free tools provided by Twitter to access streams of public data. The researchers compiled data from two such streams. The first was a general stream, representing a 1 percent uniform sample of all Twitter messages. The second was a customized health stream, which collects tweets containing any of 269 related keywords provided by the researchers.
One of the algorithm’s most important components is its geolocation feature, which identifies where each tweet is coming from. Unlike past studies, the researchers used their recently developed system, called Carmen, to also focus on a specific area: New York City.
Carmen was able to identify the locations of 22 percent of the collected tweets. It also collects information from the users’ biographic profiles by filtering for words such as “New York, NY”, “NYC” and nonsensical phrases such as “Candy Land.” Of the 56,000 tweets evaluated, the researchers concluded New York’s location filter resulted in being 61 percent accurate, to within 50 miles of the city.
The researchers also used binary-classification models as part of data filtering. One of the filters involved keyword filtering and a support vector machine to distinguish tweets that are relevant or irrelevant to health.
This new study differs from one previously conducted by researchers at Columbia University and the National Center for Atmospheric Research with Google Flu Trends (GFT). It was an attempt to identify flu outbreaks by tracking people’s Google searches about the infection.
GFT, however, was found to have greatly overestimated the number of flu cases from 2012-13, according to an article published in the journal Science. Northeastern University computer and political scientist David Lazer argued while one biggest problems for GFT was interpreting the social media information, it “still stands as a triumph of big data engineering.”
Researchers also reported in the journal Nature in February 2013 that Google Flu Trends was estimating about twice the number of flu cases as reported by the Center for Disease Control and Prevention.