Large Scale Graph Mining: Visualization, Exploration, and Analysis

Welcome the page dedicated to the tutorial "Large Scale Graph Mining: Visualization, Exploration, and Analysis" to be presented at TheWebConf 2021! You will find here all the information you need to attend the tutorial. This tutorial is organized by Benjamin Ricaud, Nicolas Aspert and Volodymyr Miz.


What happens inside social networks impacts our everyday life and is of high interest for researchers, data journalists and the general public. These networks, as well as other large online networks of pages or knowledge graphs, contain a rich but overwhelming amount of information. Due to their size and the limited API access, the extraction and analysis of information within these huge networks are challenging. In this hands-on tutorial, we propose an introduction to the data mining of large networks and the analysis of activity inside them.
The tutorial is made of two parts. The first one is an overview of key concepts in (large) graph analysis, an introduction to the main exploration tools in Python and visualization using Gephi as well as a short introduction to machine learning on graphs. It covers a basic set of important tools to start exploring large graphs. During the second part, participants will form teams and focus on a particular large real-world graph either proposed by the organizers or by the participants themselves. The exploration will be guided, alternating short presentations of techniques for the exploration of large networks, using APIs, and interactions of the organizers with the teams.

Learning objectives

Understanding the main concepts for exploring large graphs (first part). Knowing solutions to cope with a large amount of graph data (filtering neighbors randomly or make use of  node and edge attributes). Advantages, drawbacks and particularities of graph structures (hubs, small world). How to handle a dynamic graph. Tips for visualizing a network (Gephi). Getting familiar with the sampling and exploration of a large online graph (e.g. social network) using an API.


Program part I:

  • Introduction, setting up the environment, general presentation, building a graph from data using Python modules Pandas and Networkx.
  • Standard network properties (small world, hubs, centrality, page rank, degree distribution), experiments with Python module Networkx.
  • Graph visualization with Gephi. Layouts, visualizing node properties with color, size. Communities, centrality, page rank. Limits of visualization.
  • Principles of graph exploration and sampling. Reducing to a subgraph of interest with graph sampling, experiments on small toy graph models with Python library Little ball of Fur (Random walks, snowball sampling, Forest Fire, and more advanced Spikyball).
  • Conclusion and debriefing of Part I. Challenges, problems, data bottlenecks in large graphs and how to overcome them.

Program Part II:

  • Some unsupervised and semi-supervised machine learning on graphs: clustering and community detection, label propagation, combining the graph structure with data on nodes (attributed graph). How to apply to large graphs: relation with part I) on graph sampling.
  • Exploring online data: online graph sampling via an API where access is limited. Example of Wikipedia and social networks (Reddit pushshift API or Twitter).
  • Mini project on real and fresh large graph data, using an API and combining what has been learned during the day.


The audience should be familiar with coding in Python. Basic knowledge of Git, requests to APIs is desirable but not mandatory, as well as basic knowledge about graphs (node, edge, attributes and properties). A short introduction on how to build a graph from CSV files using python modules Pandas and Networkx will be provided during the first hour. For those who can not run Python on their computer, we will ensure most of the tutorial can be run using the online tool myBinder.

Tutorial Material

Jupyter notebooks for the tutorial are available here:

Gephi and visualization tutorial is here: