Previous topic

Misc

Next topic

Visualize

This Page

IO

Module misc.io includes useful functions for loading and saving datasets, result tables or objects in general.

This module contains the following functions:

  • load_from_file: Loads a dataset from a file without allocating memory for it.
  • load_from_files: Loads a dataset from a list of files without allocating memory for them.
  • ascii_load: Reads an ASCII file and returns its data and metadata.
  • libsvm_load: Reads a LIBSVM file and returns its data and metadata.
  • libsvm_load_line: Converts a line from a LIBSVM file in an example.
  • save: Saves an object into a file.
  • load: Loads an object from a file.
  • gsave: Saves an object into a gzipped file.
  • gload: Loads an object from a gzipped file.
  • nb_lines: Counts the number of lines in a file.

and the following classes:

  • ASCIIResultTable: Object that loads an ASCII table and implements many useful operations.
  • IteratorWithFields: Iterable which separates the rows of a NumPy array into fields.
  • MemoryDataset: Iterable over some data put in memory as a NumPy array.
  • FileDataset: Iterable over a file whose lines are converted in examples.
  • FilesDataset: Iterator over list of files whose content is converted in examples.
  • izipIterable: Iterable constructed from itertools.izip .
class misc.io.ASCIIResult(values, fields)[source]

Object representing a line in an ASCIIResultTable.

class misc.io.ASCIIResultTable(file, separator='t', fields=None)[source]

Object that loads an ASCII table and implements many useful operations.

The first row of in the ASCII table’s file is assumed to be a header providing names for each field of the table. The remaining rows correspond to the results. Each field (column) of the table must be separated by character separator (default is '     ').

If the file doesn’t contain a first line header, the list of field names can be explicitly given using option fields).

sort(field, numerical=False)[source]

Sorts the rows of the table based on the value of the field at position field. field can also be a string field name. If numerical is True, then the numerical values are used for sorting, otherwise sorting is based on the string value.

filter(filter_func)[source]

Filters the rows of the table by keeping those for which the output of function filter_func is True. This will overwrite any previous filtering function (i.e. filtering functions are not sequentially composed).

class misc.io.IteratorWithFields(data, fields)[source]

An iterator over the rows of a NumPy array, which separates each row into fields (segments)

This class helps avoiding the creation of a list of arrays. The fields are defined by a list of pairs (beg,end), such that data[:,beg:end] is a field.

class misc.io.MemoryDataset(data, field_shapes, dtypes, length=None)[source]

An iterator over some data, but that puts the content of the data in memory in NumPy arrays.

Option 'field_shapes' is a list of tuples, corresponding to the shape of each fields.

Option dtypes determines the type of each field (float, int, etc.).

Optionally, the length of the dataset can also be provided. If not, it will be figured out automatically.

class misc.io.FileDataset(filename, load_line)[source]

An iterator over a dataset file, which converts each line of the file into an example.

The option 'load_line' is a function which, given a string (a line in the file) outputs an example.

class misc.io.FilesDataset(filenames, load_file)[source]

An iterator over dataset files, wich converts each file of the list into an example.

The option 'load_file' is a function which, given a string (the content of a file) outputs an example.

misc.io.load_from_file(filename, load_line=<function load_line_default at 0x10cc35488>)[source]

Loads a dataset from a file, without loading it in memory.

It returns an iterator over the examples from that fine. This is based on class FileDataset.

misc.io.load_from_files(filenames, load_file=<function load_file_default at 0x10cd15320>)[source]

Loads a dataset from a list of files, without loading them in memory.

It returns an iterator over the examples from these fines. This is based on class FilesDataset.

misc.io.ascii_load(filename, convert_input=<type 'float'>, last_column_is_target=False, convert_target=<type 'float'>)[source]

Reads an ASCII file and returns its data and metadata.

Data can either be a simple NumPy array (matrix), or an iterator over (numpy array,target) pairs if the last column of the ASCII file is to be considered a target.

Options 'convert_input' and 'convert_target' are functions which must convert an element of the ASCII file from the string format to the desired format.

Defined metadata:

  • 'input_size'
misc.io.libsvm_load_line(line, convert_non_digit_features=<type 'float'>, convert_target=<type 'str'>, sparse=False, input_size=-1, input_type=<type 'numpy.float64'>)[source]

Converts a line (string) of a LIBSVM file into an example.

This function is used by libsvm_load(). If sparse is False, option 'input_size' is used to determine the size of the returned 1D array (it must be big enough to fit all features).

misc.io.libsvm_load(filename, convert_non_digit_features=<function default_convert_non_digit_features at 0x10d4dde60>, convert_target=<type 'str'>, sparse=False, input_size=None, compute_targets_metadata=True, input_type=<type 'numpy.float64'>)[source]

Reads a LIBSVM file and returns the list of all examples (data) and metadata information.

In general, each example in the list is a two items list [input, target] where

  • if sparse is True, input is a pair (values, indices) of two vectors (vector of values and of indices). Indices start at 0;
  • if sparse is False, input is a 1D array such that its elements at the positions given by indices are set to the associated values, and the other elemnents are 0;
  • target is a string corresponding to the target to predict.

If a feature:value pair in the file is such that feature is not an integer, value will be converted to the desired format using option convert_non_digit_features. This option must be a callable function taking 2 string arguments, and will be called as follows:

output = convert_non_digit_features(feature_str,value_str)

where feature_str and value_str are feature and value in string format. Its output will be appended to the list of the given example.

The input_size can be given by the user. Otherwise, will try to figure it out from the file (won’t work if the file format is sparse and some of the last features are all 0!).

The metadata ‘targets’ (i.e. the set of instantiated targets) will be computed by default, but it can be ignored using option compute_targets_metadata=False`.

The type of input (float64, int32, etc.) can also be specified with option input_type (default=float64).

Defined metadata:

  • 'targets' (if compute_targets_metadata is True)
  • 'input_size'
misc.io.save(p, filename)[source]

Pickles object p and saves it to file filename.

misc.io.load(filename)[source]

Loads pickled object in file filename.

misc.io.gsave(p, filename)[source]

Same as save(p,filname), but saves into a gzipped file.

misc.io.gload(filename)[source]

Same as load(filname), but loads from a gzipped file.