Using NLP To Uncover Truth in the Age of Fake News and Bots
Cutting through the noise and misinformation shared online is crucial for creating accurate sentiment models.
Join the DZone community and get the full member experience.
Join For FreeThe modern political landscape is full of division. This is nothing new, and there have always been a number of factors contributing to the cut and thrust of political discourse. But today, political sentiment is influenced by more dynamic and immediate forces that can be used as tools in the information war. Traditional modes of communication, such as print media, political campaigns and advertisements are, of course, still prominent, but the modern information landscape contains the added variables of the web and, more significantly, social media.
We are now in an age in which sentiment on any number of topics can be uncovered through the analysis of enormous amounts of data, ranging from the traditional, such as polls, election results and expert analysis, to alternative data sets, such as social media platforms. To ensure we get a true picture of any sentiment, however, we must be confident that the information we analyze is credible, which is becoming increasingly difficult to identify. As a data scientist with extensive experience in building sentiment models using natural language processing (NLP), I’d like to share my experience in uncovering the truth in today’s increasingly challenging information landscape.
Seeking Validation
When it comes to creating a model for sentiment analysis, the value of alternative data sources cannot be understated. Social media platforms provide a wealth of information that can be analyzed and categorized in real-time, whereas traditional means, such as opinion polls and news sources offer snapshots in time.
At the beginning of any sentiment analysis project, it’s essential to decide on the data sources that will feed the model, as well as the methodology of creating the indicator, i.e. ensuring its output is the validated reality. As with any data science project, this requires a long period of discovery, data wrangling and validation. It’s also important to ensure all data is anonymized in advance of performing the analysis, in full compliance with data protection and privacy regulations.
All projects begin with an idea. Essential to transforming ideas into workable solutions is collaboration and validation with subject matter experts. Indicators geared towards specific markets, such as real estate, for example, require expertise from these industries to ensure the methodologies behind them are sound. Once the models are run, variations can be queried and adjustments made in line with feedback, improving the performance of the models.
For sentiment analysis that focuses on different countries, it is essential that the language of the target population is fully understood. If we are to decide which social media posts are expressing a negative or positive sentiment, we have to ensure that slang and dialect are also taken into account, which can be done at a local, regional, and national level. For these use cases, linguists and native language experts and data scientists are essential.
Ultimately, our job when analyzing social media posts is to identify credible engagements with a topic, such as elections to political office. When it comes to social media platforms, such as Twitter, Telegram and WeChat, a credible source does not have to be an expert, it just has to be a real person engaging with the topic of discussion—but in the age of the bot, this is where things can become difficult.
Finding Fakes
Increasingly, bots and fake news accounts dedicated to spreading misinformation and disinformation are being used to influence our perception of reality. This is where sentiment indicators that can sift through the noise and deliver true insights become invaluable.
NLP is used for political indicators and financial indicators. For both, it is essential to avoid bots and fake news accounts. However, when it comes to political use cases, such as election results, there are far more real and fake users engaging with topics, which means there is more data to analyze. In my work creating sentiment analysis indicators, I have found that many accounts are bots, which must be removed from the data pipeline that feeds models.
Through NLP, which harnesses input from subject matter experts and linguists, bots can be detected and discounted from the discourse under analysis, i.e. removed from the indicator. Twitter is, of course, the most popular platform so I will use this as my example use case.
Deciding which accounts are bots involves a number of stages. Firstly, Twitter provides metadata on accounts, which provides an initial layer of analysis. Following some further validation work, the next layer is where the model must ascribe a sentiment to a Tweet. This requires the creation of a term-document matrix, in which negative, positive, and neutral sentiments can be determined through text analysis. State-of-the-art analysis of NLP methods, such as Bidirectional Encoder Representations from Transformers (BERT), can then be used to detect the context, syntax and semantics in text, enabling further accuracy when determining the sentiment related to a subject. Again, this is where earlier work with subject matter experts, on which terms are ascribed their values, comes into play.
For an economic indicator, the term “increase production” in a Tweet would be positive when discussing a major exporter of crude oil but negative when relating to crude oil prices. This is why other terms within the same Tweet must also be considered, as well as the relationship between terms and the context in which they are used. Through an analysis of all the sentiments within a Tweet, the model will provide a score that is either positive or negative—with neutral results discounted from the final output.
No Black Boxes
When developing an indicator, it’s essential that the underlying technology, data, and methodology used to construct the model are entirely explainable. Being able to explain every stage of the process, from data collation and validation to processing and fine-tuning, provides confidence to users that the model is not missing key data, was not constructed with bias, and ascribes sentiment in a logical and fair manner.
Ultimately, the end result is an indicator but, as with all machine learning models, the output only accounts for around two per cent of the work that goes into creating a model. Showing the workings is not only the best practice but is also crucial for ensuring continuous improvement and accelerating the development of more compelling solutions.
Opinions expressed by DZone contributors are their own.
Comments