The goal of this project is to build a social network of the characters appearing in a finite corpus of digital documents, as well as exploring the structure of the obtained social network.
As a first step, the student(s) will elaborate algorithms retrieving information about the social relationships between entities identified as network nodes (persons, characters, places). This will be done through an automatized mining of text documents (text files or web pages). During the second step, the student(s) will record every meaningful relation in a network database. This will require to develop tools to extract semantic information from the corpus. At last, the student(s) will present a web-interactive visualisation of the network. The student(s) are expected to take part in all the three steps above. After a literature review, they will implement and test different methods, and suggest improvements based on the results.
The possible corpus is the fictional universe of Star Wars and the relations between characters. In this case, the Wookieepedia database will be the corpus. There, each character is described by a bio that can be used as starting points for the data mining. With its thousands of characters, the analysis of the Star Wars network is a good case study. The project will lead to a procedure that permits to extract and study social relationships from any texts and documents. The Lab has such projects like the Venice Time Machine on rails (analysis of the archives of Venice over several centuries).
Skills required: Strong programming skills (to be discussed): Python (part 1 & 2) or JVM-based languages (part 1 & 2), Web technologies (part 3). Prior experience with databases (SQL, and/or NoSQL), knowledge of graph databases will be a plus. Interests in semantic web, visualization.