loader

Implements a general purpose data loader for python non-sequential machine learning tasks. Several common data transformations are provided in this module, e.g., tfidf, whitening, etc.

API

exception loader.Bad_directory_structure_error[source]

Raised when a data directory specified, does not contain a subfolder specified in the folders argument to read_data_sets.

class loader.DataSet(features, labels=None, num_examples=None, mix=False)[source]

General data structure for mini-batch gradient descent training involving non-sequential data.

Parameters:
  • features – A dictionary of string label names to data matrices. Matrices may be HotIndex, scipy sparse csr_matrix, or numpy arrays.
  • labels – A dictionary of string label names to data matrices. Matrices may be HotIndex, scipy sparse csr_matrix, or numpy arrays.
  • num_examples – How many data points.
  • mix – Whether or not to shuffle per epoch.
Returns:

Attributes

Methods

features

A dictionary of feature matrices.

index_in_epoch

The number of datapoints that have been trained on in a particular epoch.

labels

A dictionary of label matrices

mix_after_epoch(mix)[source]

Whether or not to shuffle after training for an epoch.

Parameters:mix – True or False
next_batch(batch_size)[source]
Return a sub DataSet of next batch-size examples.
If shuffling enabled:
If batch_size is greater than the number of examples left in the epoch then a batch size DataSet wrapping back to beginning will be returned.
If shuffling turned off:
If batch_size is greater than the number of examples left in the epoch, points will be shuffled and batch_size DataSet is returned starting from index 0.
Parameters:batch_size – int
Returns:A DataSet object with the next batch_size examples.
num_examples

Number of rows (data points) of the matrices in this DataSet.

show()[source]

Pretty printing of all the data (dimensions, keys, type) in the DataSet object

showmore()[source]

Print a sample of the first up to twenty rows of matrices in DataSet

class loader.DataSets(datasets_map, mix=False)[source]

A record of DataSet objects with a display function.

Methods

show()[source]

Pretty print data attributes.

showmore()[source]

Pretty print data attributes, and data.

class loader.HotIndex(matrix, dimension=None)[source]

Index vector representation of one hot matrix. Can hand constructor either a one hot matrix, or vector of indices and dimension.

Attributes

Methods

dim

The feature dimension of the one hot vector represented as indices.

hot()[source]
Returns:A one hot scipy sparse csr_matrix
shape

The shape of the one hot matrix encoded.

vec

The vector of hot indices.

class loader.IndexVector(matrix, dimension=None)[source]

Attributes

Methods

exception loader.Mat_format_error[source]

Raised if the .mat file being read does not contain a variable named data.

exception loader.Sparse_format_error[source]

Raised when reading a plain text file with .sparsetxt extension and there are not three entries per line.

exception loader.Unsupported_format_error[source]

Raised when a file is requested to be loaded or saved without one of the supported file extensions.

loader.center(X, axis=None)[source]
Parameters:X – A matrix to center about the mean(over columns axis=0, over rows axis=1, over all entries axis=None)
Returns:A matrix with entries centered along the specified axis.
loader.export_data(filename, data)[source]

Decides how to save data by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:
  • filename – A file of an accepted format representing a matrix.
  • data – A numpy array, scipy sparse matrix, or HotIndex object.
loader.import_data(filename)[source]

Decides how to load data into python matrices by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:filename – A file of an accepted format representing a matrix.
Returns:A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.
loader.is_one_hot(A)[source]
Parameters:A – A numpy array or scipy sparse matrix
Returns:True if matrix is a sparse matrix of one hot vectors, False otherwise

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.eye(3)
>>> loader.is_one_hot(x)
True
>>> x *= 5
>>> loader.is_one_hot(x)
False
>>> x = numpy.array([[1, 0, 0], [1, 0, 0], [1, 0, 0]])
>>> loader.is_one_hot(x)
True
>>> x[0,1] = 2
>>> loader.is_one_hot(x)
False
loader.l1normalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{ik}|}\)

axis=0 normalizes each column of X by norm of said column. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{kj}|}\)

axis=None normalizes entries of X by norm of X. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k \sum_p |X_{kp}|}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

loader.l2normalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{ ik}^2}}\)

axis=0 normalizes each column of X by norm of said column. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{kj}^2}}\)

axis=None normalizes entries of X by norm of X. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k \sum_p X_{kp}^2}}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

loader.load(filename)[source]

Calls import_data. Decides how to load data into python matrices by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:filename – A file of an accepted format representing a matrix.
Returns:A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.
loader.makedirs(datadirectory, sub_directory_list=('train', 'dev', 'test'))[source]
Parameters:
  • datadirectory – Name of the directory you want to create containing the subdirectory folders. If the directory already exists it will be populated with the subdirectory folders.
  • sub_directory_list – The list of subdirectories you want to create
Returns:

void

loader.maxnormalize(X, axis=1)[source]

axis=1 normalizes each row of X by norm of said row. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{i:})}\)

axis=0 normalizes each column of X by norm of said column. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{ :j})}\)

axis=None normalizes entries of X norm of X. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X)}\)

Parameters:
  • X – A scipy sparse csr_matrix or numpy array.
  • axis – The dimension to normalize over.
Returns:

A normalized matrix.

loader.maybe_download(filename, work_directory, source_url)[source]

Download the data from source url, unless it’s already here. From https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py

Parameters:
  • filename – string, name of the file in the directory.
  • work_directory – string, path to working directory.
  • source_url – url to download from if file doesn’t exist.
Returns:

Path to resulting file.

loader.pca_whiten(X)[source]

Returns matrix with PCA whitening transform applied. This transform assumes that data points are rows of matrix.

Parameters:
  • X – Numpy array, scipy sparse matrix
  • axis – Axis to whiten over.
Returns:

loader.read_data_sets(directory, folders=('train', 'dev', 'test'), hashlist=(), mix=False)[source]
Parameters:
  • directory – Root directory containing data to load.
  • folders – The subfolders of directory to read data from by default there are train, dev, and test folders. If you want others you have to make an explicit list.
  • hashlist – If you provide a hashlist these files and only these files will be added to your DataSet objects. It you do not provide a hashlist then anything with the privileged prefixes labels_ or features_ will be loaded.
Returns:

A DataSets object.

loader.save(filename, data)[source]

Calls :any`export_data`. Decides how to save data by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:
  • filename – A file of an accepted format representing a matrix.
  • data – A numpy array, scipy sparse matrix, or HotIndex object.
loader.tfidf(X, norm='l2row')[source]
Parameters:
  • X – A document-term matrix.
  • norm – Normalization strategy: ‘l2row’: normalizes the scores of rows by length of rows after basic tfidf (each document vector is a unit vector), ‘count’: normalizes the scores of rows by the the total word count of a document. ‘max’ normalizes the scores of rows by the maximum count for a single word in a document.
Returns:

Returns tfidf of document-term matrix X with optional normalization.

loader.toIndex(A)[source]
Parameters:A – A matrix of one hot row vectors.
Returns:The hot indices.

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.array([[1,0,0], [0,0,1], [1,0,0]])
>>> loader.toIndex(x)
array([0, 2, 0])
loader.toOnehot(X, dim=None)[source]
Parameters:
  • X – Vector of indices or HotIndex object
  • dim – Dimension of indexing
Returns:

A sparse csr_matrix of one hots.

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.array([0, 1, 2, 3])
>>> loader.toOnehot(x) 
<4x4 sparse matrix of type '<type 'numpy.float64'>'...
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])
>>> x = loader.HotIndex(x, dimension=8)
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]])
loader.unit_variance(X, axis=None)[source]
Parameters:X – A matrix to transfrom to have unit variance (over columns axis=0, over rows axis=1, over all entries axis=None)
Returns:A matrix with unit variance along the specified axis.
loader.untar(fname)[source]