Table Of Contents

Previous topic

Library Documentation

Next topic

MLProblems

This Page

Datasets

The datasets package provides a common framework for downloading and loading datasets. It is perfect for someone who wishes to experiment with a Leaner and wants quick access to many arbitrary datasets.

The package include a module for each currently supported dataset. The module docstring should give a reference of work that produced the dataset or that used this particular version of the dataset.

The package also has a datasets.store module implements simple functions to obtain MLProblems from those datasets.

The modules currently included are:

  • datasets.store: provides functions for obtaining MLProblems from the supported datasets.
  • datasets.abalone: Abalone dataset module.
  • datasets.adult: Adult dataset module.
  • datasets.bibtex: Bibtex dataset module.
  • datasets.binarized_mnist: binarized version of MNIST module.
  • datasets.cadata: Cadata dataset module.
  • datasets.cifar10: CIFAR-10 dataset module.
  • datasets.connect4: Connect-4 dataset module.
  • datasets.corel5k: Corel5k dataset module.
  • datasets.corrupted_mnist: Corrupted MNIST dataset module.
  • datasets.corrupted_ocr_letters: Corrupted OCR letters dataset module.
  • datasets.dna: DNA dataset module.
  • datasets.face_completion_lfw: Labeled Faces in the Wild, face completion dataset module.
  • datasets.housing: Housing dataset module.
  • datasets.heart: Heart dataset module.
  • datasets.letor_mq2007: LETOR 4.0 MQ2007 dataset module.
  • datasets.letor_mq2008: LETOR 4.0 MQ2008 dataset module.
  • datasets.majmin: MajMin dataset module.
  • datasets.mediamill: Mediamill dataset module.
  • datasets.medical: Medical dataset module.
  • datasets.mnist: MNIST dataset module.
  • datasets.mnist_basic: MNIST basic dataset module.
  • datasets.mnist_background_images: MNIST background-images dataset module.
  • datasets.mnist_background_random: MNIST background-random dataset module.
  • datasets.mnist_rotated: MNIST rotated dataset module.
  • datasets.mnist_rotated_background_images: MNIST rotated background-images dataset module.
  • datasets.mturk: MTurk dataset module.
  • datasets.mushrooms: Mushrooms dataset module.
  • datasets.newsgroups: 20-newsgroup dataset module.
  • datasets.nips: NIPS dataset module.
  • datasets.occluded_faces_lfw: Labeled Faces in the Wild, occluded faces dataset module.
  • datasets.occluded_mnist: Occluded MNIST dataset module.
  • datasets.ocr_letters: OCR letters dataset module.
  • datasets.rcv1: RCV1 dataset module.
  • datasets.rectangles: Rectangles dataset module.
  • datasets.rectangles_images: Rectangles images dataset module.
  • datasets.sarcos: SARCOS dataset module.
  • datasets.scene: Scene dataset module.
  • datasets.web: Web dataset module.
  • datasets.yahoo_ltrc1: Yahoo! Learning to Rank Challenge, Set 1 dataset module.
  • datasets.yahoo_ltrc2: Yahoo! Learning to Rank Challenge, Set 2 dataset module.
  • datasets.yeast: Yeast dataset module.

Dataset store

The datasets.store module provides a unique interface for downloading datasets and creating MLProblems from those datasets.

It defines the following variables:

  • datasets.store.all_names: set of all dataset names
  • datasets.store.classification_names: set of dataset names for classification
  • datasets.store.regression_names: set of dataset names for regression
  • datasets.store.collaborative_filtering_name: set of dataset names for regression
  • datasets.store.distribution_names: set of dataset names for distribution estimation
  • datasets.store.topic_modeling_names: set of dataset names for topic modeling
  • datasets.store.multilabel_names: set of dataset names for multilabel classification
  • datasets.store.multiregression_names: set of dataset names for multidimensional regression
  • datasets.store.ranking_names: set of dataset names for ranking problems
  • datasets.store.nlp_names: set of dataset names for natural language processing
  • datasets.store.image_classification_names: set of dataset names for image classification problems

It also defines the following functions:

  • datasets.store.download: downloads a given dataset
  • datasets.store.get_generic_problem: returns train/valid/test generic MLProblems from some given dataset name
  • datasets.store.get_classification_problem: returns train/valid/test classification MLProblems from some given dataset name
  • datasets.store.get_regression_problem: returns train/valid/test regression MLProblems from some given dataset name
  • datasets.store.get_collaborative_filtering_problem: returns train/valid/test collaborative filtering from some given dataset name
  • datasets.store.get_distribution_problem: returns train/valid/test distribution estimation MLProblems from some given dataset name
  • datasets.store.get_topic_modeling_problem: returns train/valid/test topic modeling MLProblems from some given dataset name
  • datasets.store.get_multilabel_problem: returns train/valid/test multilabel classification MLProblems from some given dataset name
  • datasets.store.get_multiregression_problem: returns train/valid/test multidimensional regression MLProblems from some given dataset name
  • datasets.store.get_ranking_problem: returns train/valid/test ranking MLProblems from some given dataset name
  • datasets.store.get_nlp_problem: returns train/valid/test NLP MLProblems from some given dataset name
  • datasets.store.get_image_classification_problem: returns train/valid/test image classification problems from some given dataset name
  • datasets.store.get_k_fold_experiment: returns a list of train/valid/test MLProblems for a k-fold experiment
  • get_semisupervised_experiment: returns new train/valid/test MLProblems corresponding to a semi-supervised learning experiment
  • get_object_recognition_experiment: returns new train/test MLProblems corresponding to an object recognition experiment
datasets.store.download(name, dataset_dir=None)[source]

Downloads dataset name.

name must be one of the supported dataset (see variable all_names of this module).

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, a subdirectory will be created and the dataset will be downloaded there. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.delete(name)[source]

Remove the dataset from the hard drive

datasets.store.get_classification_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test classification MLProblems from dataset name.

name must be one of the supported dataset (see variable classification_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_generic_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test generic MLProblems from dataset name.

name must be one of the supported dataset.

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_regression_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test regression MLProblems from dataset name.

name must be one of the supported dataset (see variable regression_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_collaborative_filtering_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test collaborative filtering from dataset name.

name must be one of the supported dataset (see variable collaborative_filtering_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_distribution_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test distribution estimation MLProblems from dataset name.

name must be one of the supported dataset (see variable distribution_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_topic_modeling_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test topic modeling MLProblems from dataset name.

name must be one of the supported dataset (see variable topic_modeling_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_multilabel_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test multilabel classification MLProblems from dataset name.

name must be one of the supported dataset (see variable multilabel_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_multiregression_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test multidimensional regression MLProblems from dataset name.

name must be one of the supported dataset (see variable multiregression_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_ranking_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test ranking MLProblems from dataset name.

name must be one of the supported dataset (see variable ranking_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_nlp_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test NLP MLProblems from dataset name.

name must be one of the supported dataset (see variable nlp_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_image_classification_problem(name, dataset_dir=None, load_to_memory=True, **kw)[source]

Creates train/valid/test for image classification (ClassificationProblems with input images) from dataset name.

name must be one of the supported dataset (see variable image_classification_names of this module).

Option load_to_memory determines whether the dataset should be loaded into memory or always read from its files.

If environment variable MLPYTHON_DATASET_REPO has been set to a valid directory path, this function will look into its appropriate subdirectory to find the dataset. Alternatively the subdirectory path can be given by the user through option dataset_dir.

datasets.store.get_k_fold_experiment(datasets, k=10, seed=1234)[source]

Creates a k-fold experiment from a list of MLProblems datasets.

k determines the number of folds, and seed is for the random number generator that will shuffle all the examples before creating the folds.

The output is a list of k triplets (train,valid,test), which determine the experiment to be run for each test fold. valid is also an individual fold and train corresponds to the concatenation of the remaining folds.

datasets.store.get_semisupervised_experiment(trainset, validset, testset, labeled_frac=0.1, label_field=1, seed=1234)[source]

Creates a semi-supervised experiment from training, validation and test MLProblems.

The test set is returned untouched. The training and validation sets are regenerated so that the ratio of validation/training labeled data size is the same as in the original datasets.

labeled_frac is the total fraction of labeled data in the training and validation sets. Only the training set will contain unlabeled data.

label_field is the index for the examples’ label field.

seed is for the random number generator that will select which examples to keep labeled and which to put in the validation set.

datasets.store.get_object_recognition_experiment(trainset, validset, testset, n_train_per_class=30, at_most_n_test_per_class=50, seed=1234)[source]

Creates an object recognition experiment from training, validation and test MLProblems.

A single paire of training and test sets are regenerated, with the number of training examples per class and the maximum number of test examples per class.

Option n_train_per_class is the number of training examples per class.

Option at_most_n_test_per_class is the maximum number of test examples per class.

Option seed is the seed to use for the random number generator that will select which examples to put in the training and test sets.

20-newsgroups

Module datasets.newsgroups gives access to the 20-newsgroups dataset.

The original dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/ has been preprocessed to limit the vocabulary to the 5000 most frequent words. The binary bag-of-word representation has then been computed for each document. The original training set also has been separated into a new training set and a validation set.

Reference:
20 Newsgroups (web page where the original dataset was obtained)
Jason Rennie
Classification using Discriminative Restricted Boltzmann Machines (for the train/valid split and preprocessing)
Larochelle and Bengio
datasets.newsgroups.load(dir_path, load_to_memory=False)[source]

Loads the 20-newsgroups dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs have been put in binary format, and the vocabulary has been restricted to 5000 words.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.newsgroups.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Abalone

Module datasets.abalone gives access to the Abalone dataset.

The Abalone dataset is obtained here: http://www.csie.ntu.edu.tw/%7Ecjlin/libsvmtools/datasets/regression.html#abalone.

datasets.abalone.load(dir_path, load_to_memory=False)[source]

Loads the Abalone dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
datasets.abalone.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Adult

Module datasets.adult gives access to the Adult dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.adult.load(dir_path, load_to_memory=False)[source]

Loads the Adult dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.adult.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Bibtex

Module datasets.bibtex gives access to the Bibtex dataset.

Reference:
Random k-labelsets for Multi-Label Classification
Tsoumakas, Katakis and Vlahavas
datasets.bibtex.load(dir_path, load_to_memory=False)[source]

Loads the Bibtex dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.bibtex.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Binarized MNIST

Module datasets.binarized_mnist gives access to the binarized version of the MNIST dataset.

References:
On the Quantitative Analysis of Deep Belief Networks
Salakhutdinov and Murray
The MNIST database of handwritten digits
LeCun and Cortes
datasets.binarized_mnist.load(dir_path, load_to_memory=False)[source]

Loads a binarized version of MNIST.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
datasets.binarized_mnist.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Cadata

Module datasets.cdata gives access to the CAData (California housing prices) dataset.

The cadata dataset is obtained here: http://www.csie.ntu.edu.tw/%7Ecjlin/libsvmtools/datasets/regression.html#cadata.

datasets.cadata.load(dir_path, load_to_memory=False)[source]

Loads the CAData (California housing prices) dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
datasets.cadata.obtain(dir_path)[source]

Downloads the dataset to dir_path.

CIFAR-10

Module datasets.cifar10 gives access to the CIFAR-10 dataset.

Reference:
Learning multiple layers of features from tiny images
Alex Krizhevsky
datasets.cifar10.load(dir_path, load_to_memory=False, load_as_images=False)[source]

Loads the CIFAR-10 dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
  • 'targets'
  • 'class_to_id'
datasets.cifar10.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Connect-4

Module datasets.connect4 gives access to the Connect-4 dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.connect4.load(dir_path, load_to_memory=False)[source]

Loads the Connect-4 dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.connect4.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Convex

Module datasets.convex gives access to the Convex image classification dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.convex.load(dir_path, load_to_memory=False)[source]

Loads the Convex dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.convex.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Corel5k

Module datasets.corel5k gives access to the Corel5k dataset.

Reference:
Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary
Duygulu, Barnard, de Freitas, Forsyth
datasets.corel5k.load(dir_path, load_to_memory=False)[source]

Loads the Corel5k dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.corel5k.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Corrupted MNIST

Module datasets.corrupted_mnist gives access to the corrupted MNIST dataset.

This is a multi-label classification dataset, where the task is to reconstruct the original binary image of an MNIST digit from a corrupted version of it. The corruption was generated by randomly flipping a subset of the pixels in the original image.

The original dataset from http://yann.lecun.com/exdb/mnist/ has been preprocessed so that the inputs are binary. Corrupted MNIST was generated by Volodymyr Mnih.

References:
The MNIST database of handwritten digits
LeCun and Cortes
datasets.corrupted_mnist.load(dir_path, load_to_memory=False)[source]

Loads the corrupted MNIST dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs and targets have been converted to a binary format.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.corrupted_mnist.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Corrupted OCR letters

Module datasets.corrupted_ocr_letters gives access to the corrupted version of the OCR letters dataset.

This is a multilabel classification dataset, with binary targets. The task is to remove noise from images of 4 characters obtained from the OCR letters dataset (see datasets.ocr_letters). The noise include lines crossing the image and single pixels randomly switched to 1.

datasets.corrupted_ocr_letters.load(dir_path, load_to_memory=False)[source]

Corrupted OCR letters dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs and targets are binary.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.corrupted_ocr_letters.obtain(dir_path)[source]

Downloads the dataset to dir_path.

DNA

Module datasets.dna gives access to the DNA dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.dna.load(dir_path, load_to_memory=False)[source]

Loads the DNA dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.dna.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Heart

Module datasets.heart gives access to the Heart (SPECT) dataset.

The Heart dataset is obtained here: http://archive.ics.uci.edu/ml/machine-learning-databases/spect.

datasets.heart.load(dir_path, load_to_memory=False)[source]

Loads the Heart dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.heart.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Housing

Module datasets.housing gives access to the Housing dataset.

The Housing dataset is obtained here: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#housing.

datasets.housing.load(dir_path, load_to_memory=False)[source]

Loads the Housing dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
datasets.housing.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Labeled Faces in the Wild (face completion)

Module datasets.face_completion_lfw gives access to the Labeled Faces in the Wild, face completion dataset.

This is a multi-dimensional regression dataset, with outputs in [0,1]. The task is to complete the right part of a face give its left part.

The original dataset, Labeled Faces in the Wild comes from http://vis-www.cs.umass.edu/lfw/. This face completion variant of the dataset was generated by Volodymyr Mnih.

References:
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
Huang, Ramesh, Berg and Learned-Miller
datasets.face_completion_lfw.load(dir_path, load_to_memory=False)[source]

Labeled Faces in the Wild, face completion dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs and targets have been converted to be in the [0,1] interval.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.face_completion_lfw.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Labeled Faces in the Wild (occluded faces)

Module datasets.face_completion_lfw gives access to the Labeled Faces in the Wild, occluded faces dataset.

This is a multi-dimensional regression dataset, with outputs in [0,1]. The task is to remove occlusions from images of faces. The occlusions were generated by overlapping random characters on the image. The characters were obtained from the OCR letters dataset (see datasets.ocr_letters).

The original dataset, Labeled Faces in the Wild comes from http://vis-www.cs.umass.edu/lfw/.

References:
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
Huang, Ramesh, Berg and Learned-Miller
datasets.occluded_faces_lfw.load(dir_path, load_to_memory=False)[source]

Labeled Faces in the Wild, occluded faces dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs and targets have been converted to be in the [0,1] interval.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.occluded_faces_lfw.obtain(dir_path)[source]

Downloads the dataset to dir_path.

LETOR 4.0 MQ 2007

Module datasets.letor_mq2007 gives access to the LETOR 4.0 MQ2007 dataset, a learning to rank benchmark.

The LETOR 4.0 datasets are obtained here: http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4download.aspx.

IMPORTANT: the evaluation for this benchmark will require the use of the official evaluation script, which can be downloaded at http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Evaluation/Eval-Score-4.0.pl.txt. Alternatively, function letor_evaluation in this module can be used.

datasets.letor_mq2007.load(dir_path, load_to_memory=False, fold=1)[source]

Loads the LETOR 4.0 MQ2007 dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

This dataset comes with 5 predefined folds, which can be specified with option fold (default = 1).

Defined metadata:

  • 'input_size'
  • 'scores'
  • 'n_queries'
  • 'length'
datasets.letor_mq2007.obtain(dir_path)[source]

Downloads the dataset to dir_path.

datasets.letor_mq2007.letor_evaluation(outputs, evaluation_set, fold=1, dir_path=None)[source]

Returns the lists of precisions and NDCG performance measures, based on some given outputs and the evaluation_set ('train',``’valid’`` or 'test').

Precisions and NDCG are measured at cutoffs from 1 to 10. The last element of each list is the mean average precision (MAP) and mean NDCG.

LETOR 4.0 MQ 2008

Module datasets.letor_mq2008 gives access to the LETOR 4.0 MQ2008 dataset, a learning to rank benchmark.

The LETOR 4.0 datasets are obtained here: http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4download.aspx.

IMPORTANT: the evaluation for this benchmark will require the use of the official evaluation script, which can be downloaded at http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Evaluation/Eval-Score-4.0.pl.txt. Alternatively, function letor_evaluation in this module can be used.

datasets.letor_mq2008.load(dir_path, load_to_memory=False, fold=1)[source]

Loads the LETOR 4.0 MQ2008 dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

This dataset comes with 5 predefined folds, which can be specified with option fold (default = 1).

Defined metadata:

  • 'input_size'
  • 'scores'
  • 'n_queries'
  • 'length'
datasets.letor_mq2008.obtain(dir_path)[source]

Downloads the dataset to dir_path.

datasets.letor_mq2008.letor_evaluation(outputs, evaluation_set, fold=1, dir_path=None)[source]

Returns the lists of precisions and NDCG performance measures, based on some given outputs and the evaluation_set ('train',``’valid’`` or 'test').

Precisions and NDCG are measured at cutoffs from 1 to 10. The last element of each list is the mean average precision (MAP) and mean NDCG.

MajMin

Module datasets.majmin gives access to the MajMin dataset.

Reference:
A Web-Based Game for Collecting Music Metadata
Mandel and Ellis
datasets.majmin.load(dir_path, load_to_memory=False)[source]

Loads the MajMin dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.majmin.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Mediamill

Module datasets.mediamill gives access to the Mediamill dataset.

Reference:
The challenge problem for automated detection of 101 semantic concepts in multimedia
Snoek, Worring, van Gemert, Geusebroek, Smeulders
datasets.mediamill.load(dir_path, load_to_memory=False)[source]

Loads the Mediamill dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.mediamill.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Medical

Module datasets.medical gives access to the Medical dataset.

This is a multi-label classification dataset, where medical test reports must be labeled according to 45 disease codes.

Reference:
Random k-labelsets for Multi-Label Classification
Tsoumakas, Katakis and Vlahavas
datasets.medical.load(dir_path, load_to_memory=False)[source]

Loads the Medical dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.medical.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST

Module datasets.mnist gives access to the MNIST dataset.

The original dataset from http://yann.lecun.com/exdb/mnist/ has been preprocessed so that the inputs are between 0 and 1. The original training set also has been separated into a new training set and a validation set.

References:
The MNIST database of handwritten digits
LeCun and Cortes
Classification using Discriminative Restricted Boltzmann Machines (for the train/valid split)
Larochelle and Bengio
datasets.mnist.load(dir_path, load_to_memory=False, load_as_images=False)[source]

Loads the MNIST dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs have been normalized between 0 and 1.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST basic

Module datasets.mnist_basic gives access to the MNIST basic dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.mnist_basic.load(dir_path, load_to_memory=False)[source]

Loads the MNIST basic dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist_basic.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST background-images

Module datasets.mnist_background_images gives access to the MNIST background-images dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.mnist_background_images.load(dir_path, load_to_memory=False)[source]

Loads the MNIST background-images dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist_background_images.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST background-random

Module datasets.mnist_background_random gives access to the MNIST background-random dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.mnist_background_random.load(dir_path, load_to_memory=False)[source]

Loads the MNIST background-random dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist_background_random.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST rotated

Module datasets.mnist_rotated gives access to the MNIST rotated dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.mnist_rotated.load(dir_path, load_to_memory=False)[source]

Loads the MNIST rotated dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist_rotated.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MNIST rotated background-images

Module datasets.mnist_rotated_background_images gives access to the MNIST rotated background-images dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.mnist_rotated_background_images.load(dir_path, load_to_memory=False)[source]

Loads the MNIST rotated background-images dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mnist_rotated_background_images.obtain(dir_path)[source]

Downloads the dataset to dir_path.

MTurk

Module datasets.mturk gives access to the MTurk dataset.

Reference:
Learning tags that vary within a song
Mandel, Eck and Bengio
datasets.mturk.load(dir_path, load_to_memory=False)[source]

Loads the MTurk dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.mturk.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Mushrooms

Module datasets.mushrooms gives access to the Mushrooms dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.mushrooms.load(dir_path, load_to_memory=False)[source]

Loads the Mushrooms dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.mushrooms.obtain(dir_path)[source]

Downloads the dataset to dir_path.

NIPS

Module datasets.nips gives access to the NIPS 0-12 dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.nips.load(dir_path, load_to_memory=False)[source]

Loads the NIPS 0-12 dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'length'
datasets.nips.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Occluded MNIST

Module datasets.occluded_mnist gives access to the occluded MNIST dataset.

This is a multi-label classification dataset, where the task is to reconstruct the original binary image of an MNIST digit from an occluded version of it. The occlusion was performed by zeroing out a random patch in the original image.

The original dataset from http://yann.lecun.com/exdb/mnist/ has been preprocessed so that the inputs are binary. Occluded MNIST was generated by Volodymyr Mnih.

References:
The MNIST database of handwritten digits
LeCun and Cortes
datasets.occluded_mnist.load(dir_path, load_to_memory=False)[source]

Loads the occluded MNIST dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

The inputs and targets have been converted to a binary format.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.occluded_mnist.obtain(dir_path)[source]

Downloads the dataset to dir_path.

OCR letters

Module datasets.ocr_letters gives access to the OCR letters dataset.

The OCR letters dataset was first obtained here: http://ai.stanford.edu/~btaskar/ocr/letter.data.gz.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.ocr_letters.load(dir_path, load_to_memory=False, load_as_images=False)[source]

Loads the OCR letters dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.ocr_letters.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Rectangles

Module datasets.rectangles gives access to the Rectangles dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.rectangles.load(dir_path, load_to_memory=False)[source]

Loads the Rectangles dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.rectangles.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Rectangles images

Module datasets.rectangles_images gives access to the Rectangles images dataset.

Reference:
An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation
Larochelle, Erhan, Courville, Bergstra and Bengio
datasets.rectangles_images.load(dir_path, load_to_memory=False)[source]

Loads the Rectangles images dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.rectangles_images.obtain(dir_path)[source]

Downloads the dataset to dir_path.

SARCOS

Module datasets.sarcos gives access to the SARCOS dataset.

This is a multi-dimensional regression dataset, with outputs in [0,1]. The task is an inverse dynamics problem for a seven degrees-of-freedom SARCOS anthropomorphic robot arm.

The inputs have varying range, so PCA is recommended.

References:
LWPR: An O(n) Algorithm for Incremental Real Time Learning in High Dimensional Space
Vijayakumar and Schaal

The Gaussian Processes Web Site
datasets.sarcos.load(dir_path, load_to_memory=False)[source]

SARCOS inverse dynamics dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.sarcos.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Web

Module datasets.web gives access to the Web dataset.

Reference:
Tractable Multivariate Binary Density Estimation and the Restricted Boltzmann Forest
Larochelle, Bengio and Turian
datasets.web.load(dir_path, load_to_memory=False)[source]

Loads the Web dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'targets'
  • 'length'
datasets.web.obtain(dir_path)[source]

Downloads the dataset to dir_path.

Yahoo! Learning to Rank Challenge, Set 1

Module datasets.yahoo_ltrc gives access to Set 1 of the Yahoo! Learning to Rank Challenge data. The queries correspond to query IDs, while the inputs already contain query-dependent information.

Reference:
Yahoo! Learning to Rank Challenge Overview
Chapelle and Chang
datasets.yahoo_ltrc1.load(dir_path, load_to_memory=False, home_made_valid_split=False)[source]

Loads the Yahoo! Learning to Rank Challenge, Set 1 data.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Option home_made_valid_split determines whether the original training set should be further split into a “home made” train/valid split (default: False). If True, the dictionary mapping will contain 4 keys instead of 3: 'train' (home made training set), 'valid' (home made validation set), 'test' (original validation set) and 'test2' (original test set).

Defined metadata:

  • 'input_size'
  • 'scores'
  • 'n_queries'
  • 'n_pairs'
  • 'length'
datasets.yahoo_ltrc1.obtain(dir_path)[source]
This dataset must be downloaded manually first through the Yahoo! Webscope Program at:
http://webscope.sandbox.yahoo.com/

Then, this function should be called to generate the necessary preprocessing of the data.

Yahoo! Learning to Rank Challenge, Set 2

Module datasets.yahoo_ltrc gives access to Set 2 of the Yahoo! Learning to Rank Challenge data. The queries correspond to query IDs, while the inputs already contain query-dependent information.

Reference:
Yahoo! Learning to Rank Challenge Overview
Chapelle and Chang
datasets.yahoo_ltrc2.load(dir_path, load_to_memory=False, home_made_valid_split=False)[source]

Loads the Yahoo! Learning to Rank Challenge, Set 2 data.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Option home_made_valid_split determines whether the original training set should be further split into a “home made” train/valid split (default: False). If True, the dictionary mapping will contain 4 keys instead of 3: 'train' (home made training set), 'valid' (home made validation set), 'test' (original validation set) and 'test2' (original test set).

Defined metadata:

  • 'input_size'
  • 'scores'
  • 'n_queries'
  • 'n_pairs'
  • 'length'
datasets.yahoo_ltrc2.obtain(dir_path)[source]
This dataset must be downloaded manually first through the Yahoo! Webscope Program at:
http://webscope.sandbox.yahoo.com/

Then, this function should be called to generate the necessary preprocessing of the data.

Yeast

Module datasets.yeast gives access to the Yeast dataset.

Reference:
A kernel method for multi-labelled classification
Elisseeff and Weston
datasets.yeast.load(dir_path, load_to_memory=False)[source]

Loads the Yeast dataset.

The data is given by a dictionary mapping from strings 'train', 'valid' and 'test' to the associated pair of data and metadata.

Defined metadata:

  • 'input_size'
  • 'target_size'
  • 'length'
datasets.yeast.obtain(dir_path)[source]

Downloads the dataset to dir_path.