In this project we want to understand dynamic processes on networks and we will be using Wikipedia as an enormous open source dataset. Wikipedia is providing regular dumps of the pages from which we plan to build a network of hyperlinks between pages. In addition, the number of visits per page per hour is recorded and available for analysis. We will combine both information to track visitors activity and try to understand how people look for information and explore this source of knowledge.

The goal of this master project is:

  • to create a set of tools for the extraction of information from the Wikipedia dumps,
  • to analyse the wikipedia data, the structure of the network of hyperlinks and the activity of visitors on the pages.

The student will adopt a data scientist approach in exploring the dataset. He will create visualisations (interactive or not) that give insight about wikipedia: about the hyperlinks graph structure (most connected pages, community detection, connected components...) and about the visitors activity (most visited pages per month, most visited topics, average number of visits…).

In addition, the student may have the opportunity to contribute to the Wikipedia community through the Kiwix project. This project aims at diffusing Wikipedia to places with a limited internet access. The idea is to select subsets of Wikipedia that can fit on USB sticks or other small devices with reduced storage capacity in order to be shared more easily. For example they edit a subset of Wikipedia focused on medicine, to be used by physicians in poor countries. The outcome of the project could help selecting relevant articles when choosing a subset of Wikipedia pages on a particular topic.

The student should have experience programming in Python and an interest in data science. Familiarity with and contribution to open-source projects would be an asset. The student should be highly motivated as the project will require a lot of work.