Datasets

Lists

  1. Awesome public datasets
  2. mldata
  3. Quora
  4. Stanford SNAP
  5. State of the art of benchmark datasets
  6. Network repository
  7. openNASA datasets (with PyNASA)
  8. Internet Archive
  9. Google BigQuery Public Dataset
  10. AWS Public Datasets
  11. Multivac platform

Graph models

  1. Erdos-Renyi graphs (1959)
  2. Stochastic blockmodels [Faust-Wasserman ‘92]
  3. Lancichinetti-Fortunato-Radicchi (LFR) graphs (2008)

Constructed graphs (various modalities)

Image

  1. MNIST: 70’000 samples, 784 features, 10 classes
  2. ImageNet
  3. YFCC100M: 100M images from Flickr
  4. NORB
  5. CIFAR: 10 and 100 classes
  6. Species

Music

  1. GTZAN: 1000 x , 10 classes
  2. 1 million songs
  3. FMA

Audio

  • AudioSet

Video

  • Youtube dataset

Text

  1. 20NEWS (sklearn): 18’000 samples, 34’000 words, 20 classes
  2. Reuters (JMLR, NIST, sklearn): 800’000 samples, (2000 most frequent), 103 (50 or 55) classes [1]
  3. WebKB: 8’000 documents, 20’000 terms, 7 classes
  4. Google Books Ngrams
  5. Matt cured wikipedia vocabulary, e.g. for word2vec.
  6. Wikipedia article dump: ~3 billion words.
  7. DBpedia
  8. Enron email corpus

Others

  1. arXiv
    1. graph visualization
    2. tf-idf search
  2. pings: data from wondernetwork
  3. cosmology (graph of galaxies): Cosmic Web, data from Illustris

Bioinformatic

  1. Merck / DPP4: 8193 x 2796
  2. Genes ??

Natural graphs

Real world networks

  1. electrical grid
  2. transportation network
  3. communication networks
  4. topological features: airports, seaports, rivers, lakes, roads, railroads, cities, land borders
  5. Connectomes: open connectome project
  6. Urban Road Network Data

Social networks

  1. Wikipedia: 4M x
  2. Reddit comments
  3. WWW (hyperlinks)
  4. Facebook
  5. LinkedIn
  6. Twitter
    1. Twittertrails: story trustworthiness analysis
  7. LittleSis: business and government
  8. BlogCatalog: relationships between bloggers

Point clouds

  1. Stanford bunny
  2. Growing plant (data on the NAS)

Brain (fMRI / EEG)

Node classification

Governments

Social media (from Daniel)

Genome

From Pascal

Another good place where you can find a collection of links to network resources (including data repositories) is the Awesome Network Analysis list curated by François Briatte.
If you are looking for network data to use in teaching, I would also recommend having students collect social media data. For graduate students, R packages like twitteR and Rfacebook may be a good way to do this. For undergraduate students, I recommend NodeXL, an intuitive and easy to use Excel addon that can grab data from Facebook, Twitter, YouTube, and other sources.

References

[1] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overf i tting. Journal of Machine Learning Research, 15:1929–1958, 2014.

Leave a Reply