There are numerous sources of information on the net. They can be more or less biaised or reliable and it is sometimes difficult to know where the truth is.

The GDELT project follows news fom thousands of news sites in real time, all over the world. It records the URL, the topic, the publication time, the site and several additional metadata describing each article. Every 15 min the site is updated to release the latest activity in the world of medias.

This rich dataset provides interesting information about the behavior and interaction between news sites. We have started several data science projects to extract valuable information about news from this data. From these first analyses, many questions arise. What is the influence of particular sites over the others? Can we group news sources from the information they provide, can we spot some abnormal behavior, can we classify the reliability of news sources? What is the behavior of news sites when a controversial information appears? Similarly for the news, can we spot particular patterns of appearance that would help us classify the news (without reading its content)? Can we detect fake news?

All these questions needs an in-depth analysis of the GDELT dataset, with the latest data processing methods and machine learning algorithms.

The student will perform data analysis using methods from a wide range of data science techniques. This includes:

  • data cleaning,
  • basic statistical analyses to understand the news environment and the news propagation,
  • unsupervised and/or supervised learning to automatically detect fake or controversial news,
  • visualization methods.

The project will be implemented in Python and the student should be familiar with programming.