Supervision: Benjamin Ricaud

Project type: Master thesis

Finished

In this project we want to understand dynamic processes on networks and we will be using Wikipedia as an enormous open source dataset. Wikipedia is providing regular dumps of the pages from which we plan to build a network of hyperlinks between pages. In addition, the number of visits per page per hour is recorded and available for analysis. We will combine both information to track visitors activity and try to understand how people look for information and explore this source of knowledge.

The project is a joint collaboration between the LTS2 lab and the Kiwix association. The student may have the opportunity to contribute to the Wikipedia community through the Kiwix project.

The student will adopt a data scientist approach in exploring the dataset. He will be asked to:

  • apply data processing methods, network theory and unsupervised learning to build automatic tools for extracting information from Wikipedia data. For example: 1) extract the most popular groups of pages in predefined categories and over time, 2) detect popularity patterns, in time and on the network, 3) create an automatic summary of events that triggered a increase of visits on related pages, 4) spot periodic events, 5) compare the popularity of events such as sports events.

  • interpret the results, create visualizations (interactive or not) that give insight about wikipedia: about the hyperlinks graph structure (most connected pages, community detection, connected components...) and about the visitors activity (most visited pages per month, most visited topics, average number of visits…).

  • create a Github repository and make his/her code open-source.

The project is open, with many great directions of exploration. The student will have the choice to focus on a particular direction of his own.

The student should have experience programming in Python and/or Scala, in data analysis and machine learning. Familiarity with Git and contribution to open-source projects would be an asset.