The strength and robustness of a machine learning algorithm often lies in the quality of the dataset used to train it. Therefore, it would suffice to say that to gain true mastery within these fields, it is imperative that a person gains experience over a variety of machine learning problems that deal with a variety of datasets - ranging from image processing to speech recognition.
Here, we have highlighted our favourite picks amongst the open data sets available on the internet. We hope you have a great time exploring and experimenting with these!
1) MNIST Dataset
One of the most popular deep learning datasets out there, MNIST is a dataset of handwritten digits and consists of a training set of more than 60,000 examples, with a test set of 10,000. A part of the much larger NIST library, these examples were re-mixed, with the original samples being normalized to fit into a 28 x 28 pixel bounding box.
Popularly used as a classification problem, researchers have managed to achieve “near-human performance” with this dataset, thus making MNIST one of the standard libraries used for image classification.
2) STL-10 Dataset
The STL-10 dataset is an image recognition dataset with a corpus of 100000 unlabelled images and 500 training images that can be used to develop unsupervised feature learning, deep learning, self-taught learning algorithms. Its large class of unlabelled examples is a great way of training an image recognition model, especially so given the fact that the images are of a high resolution (96 x 96 pixels). A challenging dataset that can help further the advancement in the creation of unsupervised learning methods, it has 10 different classes - airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.
Another image recognition dataset but targeting fashion products, with a corpus of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes that range from T-shirt to Bag as well as Dress. The makers of the Fashion - MNIST created the dataset to replace the original MNIST dataset that, according to them, is way too simple to be considered the benchmark dataset for testing machine learning models.
While it retains the 28 x 28 pixel size and greyscale image style of the original dataset, it introduces a complexity in the classes that makes it more suitable for CVs and other models of classification.
4) Yelp Open Dataset
A subset of Yelp’s businesses, reviews, and user data, it consists of millions of user reviews, businesses attributes and over 200,000 pictures from 11 metropolitan areas. Released by Yelp to assist the learning community, this is a common dataset used for NLP challenges all across the world. It comes in a JSO as well as SQL format. It can also be useful for sentiment analysis and graph mining projects.
5) IMDB Reviews
A great dataset for sentiment analysis, the IMDB reviews dataset consists of 25,000 highly polar movie reviews for training and 25,000 for testing with raw texts and an already processed bag of words provided. A great resource for movie buffs, the large volume of examples available makes it a great dataset for binary classification.
Recommendation & Ranking Systems
6) Netflix Prize Dataset
Netflix hosted an open competition to select the best collaborative filtering algorithm to predict user ratings for films, based on user ratings without any other information on the user or films. The training dataset consists of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form <user, movie, date of grade, grade>, with the test dataset consisting of 2,817,131 triplets that contain only <user, movie, date of grade>. The winning team for the competition was BellKor's Pragmatic Chaos, which won the prize money of USD $1,000,000 for developing a rating system that bested Netflix’s own algorithm by 10%.
7) Million Song Dataset
A large dataset that’s available on Kaggle, it was created for the Million Song Database Challenge that aims to create the best offline evaluation of a music recommendation system.
The dataset consists of the official indexing of songs, the official ordering of user IDs, the visible half of the listening histories of the 110K evaluation users, the mapping from songs to tracks. The training set consists of the full history of 1 Million users whereas the test data consists of 110K users.
8) Free Music Archive (FMA)
The FMA is an open and easily accessible dataset that can be used for browsing, searching, and organizing large music collections. Useful for music analysis, it includes full-length and HQ audio, pre-computed features, and track and user-level metadata.
The dataset consists of the following files -
- csv: per track metadata such as ID, title, artist, genres, tags and play counts, for all 106,574 tracks.
- csv: all 163 genre IDs with their name and parent (used to infer the genre hierarchy and top-level genres).
- csv: common features extracted.
- csv: audio features provided by Spotify for a subset of 13,129 tracks.
9) Google Audioset
This is a large dataset of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Termed as a sound and vocabulary dataset, it can be used to create models for audio event detection.
10) Arcade Universe
Arcade Universe is an artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects that can be used for image classification purposes.
Our suggestions should keep every deep learning enthusiast busy, but if you’re looking for more, check out our data science bootcamp. It might just be what you are looking for.
Written by Manavika Phukan, a 23-year-old Data Scientist and Co-Founder of Handtribe. She holds Post Graduate degree in Data Science from St. Xavier's College, Mumbai. Loves Python, reading webcomics, and k-dramas.