Machine Learning | Datasets

Here's a list of datasets to play around with. Please help us add to this by sharing on our site suggestions forum

UC Irvine Repository: A searchable repository of close to 200 datasets and browsable by categories like data types, default tasks, and number of attributes. It covers a diverse area of topics from engineering to life sciences. The repository was founded in 1987 and has since become one of the world's primary source of machine learning datasets.

CMU StatLib Datasets: Around 100 sets of data from book and article publications.

Toronto Delve Datasets: A smaller repository of roughly 20 datasets covering fairly practical topics like bank-customer decisions and census data. Only drawback is you need to download an extra utilties software to manipulate and evaluate the data.

DBPedia: Dataset put together by the community using content from Wikipedia. The data is in the form of N-Triples in CSV format covering domains such as geography, people, companies, music and books. There's also ongoing work to publish and interlink data from the web based on Tim Berners-Lee's Linked Data principles.

Freebase: A huge collection of data of over 5 million topics, 3000 types and 30,000 properties. It's an open database built by the community available for anyone to query. Freebase offers both web services and data dumps to access the information.

Economagic: Over 200,000 government economic data files accompanied by charts and excel files for each. (courtesy of Brian Donhauser)