loader¶

Implements a general purpose data loader for python non-sequential machine learning tasks. Several common data transformations are provided in this module, e.g., tfidf, whitening, etc.

Loader Tutorial

Loading, Saving, and Testing¶

save

Classes¶

DataSet

DataSets

HotIndex

Data Transforms¶

Exceptions¶

Bad_directory_structure_error

Mat_format_error

Sparse_format_error

Unsupported_format_error

API¶

exception loader.Bad_directory_structure_error[source]¶: Raised when a data directory specified, does not contain a subfolder specified in the folders argument to read_data_sets.

class loader.DataSet(features, labels=None, num_examples=None, mix=False)[source]¶

General data structure for mini-batch gradient descent training involving non-sequential data.

Parameters:	features – A dictionary of string label names to data matrices. Matrices may be `HotIndex`, scipy sparse csr_matrix, or numpy arrays. labels – A dictionary of string label names to data matrices. Matrices may be `HotIndex`, scipy sparse csr_matrix, or numpy arrays. num_examples – How many data points. mix – Whether or not to shuffle per epoch.
Returns:

Attributes

Methods

features¶: A dictionary of feature matrices.

index_in_epoch¶: The number of datapoints that have been trained on in a particular epoch.

labels¶: A dictionary of label matrices

mix_after_epoch(mix)[source]¶

Whether or not to shuffle after training for an epoch.

Parameters:	mix – True or False

next_batch(batch_size)[source]¶

Return a sub DataSet of next batch-size examples.

If shuffling enabled:: If batch_size is greater than the number of examples left in the epoch then a batch size DataSet wrapping back to beginning will be returned.
If shuffling turned off:: If batch_size is greater than the number of examples left in the epoch, points will be shuffled and batch_size DataSet is returned starting from index 0.

Parameters:	batch_size – int
Returns:	A `DataSet` object with the next batch_size examples.

num_examples¶: Number of rows (data points) of the matrices in this DataSet.

show()[source]¶: Pretty printing of all the data (dimensions, keys, type) in the DataSet object

showmore()[source]¶: Print a sample of the first up to twenty rows of matrices in DataSet

class loader.DataSets(datasets_map, mix=False)[source]¶

A record of DataSet objects with a display function.

Methods

show()[source]¶: Pretty print data attributes.

showmore()[source]¶: Pretty print data attributes, and data.

class loader.HotIndex(matrix, dimension=None)[source]¶

Index vector representation of one hot matrix. Can hand constructor either a one hot matrix, or vector of indices and dimension.

Attributes

Methods

dim¶: The feature dimension of the one hot vector represented as indices.

hot()[source]¶

Returns:	A one hot scipy sparse csr_matrix

shape¶: The shape of the one hot matrix encoded.

vec¶: The vector of hot indices.

class loader.IndexVector(matrix, dimension=None)[source]¶

Attributes

Methods

exception loader.Mat_format_error[source]¶: Raised if the .mat file being read does not contain a variable named data.

exception loader.Sparse_format_error[source]¶: Raised when reading a plain text file with .sparsetxt extension and there are not three entries per line.

exception loader.Unsupported_format_error[source]¶: Raised when a file is requested to be loaded or saved without one of the supported file extensions.

loader.center(X, axis=None)[source]¶

Parameters:	X – A matrix to center about the mean(over columns axis=0, over rows axis=1, over all entries axis=None)
Returns:	A matrix with entries centered along the specified axis.

loader.export_data(filename, data)[source]¶

Decides how to save data by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:	filename – A file of an accepted format representing a matrix. data – A numpy array, scipy sparse matrix, or `HotIndex` object.

loader.import_data(filename)[source]¶

Decides how to load data into python matrices by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:	filename – A file of an accepted format representing a matrix.
Returns:	A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.

loader.is_one_hot(A)[source]¶

Parameters:	A – A numpy array or scipy sparse matrix
Returns:	True if matrix is a sparse matrix of one hot vectors, False otherwise

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.eye(3)
>>> loader.is_one_hot(x)
True
>>> x *= 5
>>> loader.is_one_hot(x)
False
>>> x = numpy.array([[1, 0, 0], [1, 0, 0], [1, 0, 0]])
>>> loader.is_one_hot(x)
True
>>> x[0,1] = 2
>>> loader.is_one_hot(x)
False

loader.l1normalize(X, axis=1)[source]¶

axis=1 normalizes each row of X by norm of said row. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{ik}|}\)

axis=0 normalizes each column of X by norm of said column. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{kj}|}\)

axis=None normalizes entries of X by norm of X. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k \sum_p |X_{kp}|}\)

Parameters:	X – A scipy sparse csr_matrix or numpy array. axis – The dimension to normalize over.
Returns:	A normalized matrix.

loader.l2normalize(X, axis=1)[source]¶

axis=1 normalizes each row of X by norm of said row. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{ ik}^2}}\)

axis=0 normalizes each column of X by norm of said column. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{kj}^2}}\)

axis=None normalizes entries of X by norm of X. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k \sum_p X_{kp}^2}}\)

Parameters:	X – A scipy sparse csr_matrix or numpy array. axis – The dimension to normalize over.
Returns:	A normalized matrix.

loader.load(filename)[source]¶

Calls import_data. Decides how to load data into python matrices by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).

Parameters:	filename – A file of an accepted format representing a matrix.
Returns:	A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.

loader.makedirs(datadirectory, sub_directory_list=('train', 'dev', 'test'))[source]¶

Parameters:	datadirectory – Name of the directory you want to create containing the subdirectory folders. If the directory already exists it will be populated with the subdirectory folders. sub_directory_list – The list of subdirectories you want to create
Returns:	void

loader.maxnormalize(X, axis=1)[source]¶

axis=1 normalizes each row of X by norm of said row. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{i:})}\)

axis=0 normalizes each column of X by norm of said column. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{ :j})}\)

axis=None normalizes entries of X norm of X. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X)}\)

Parameters:	X – A scipy sparse csr_matrix or numpy array. axis – The dimension to normalize over.
Returns:	A normalized matrix.

loader.maybe_download(filename, work_directory, source_url)[source]¶

Download the data from source url, unless it’s already here. From https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py

Parameters:	filename – string, name of the file in the directory. work_directory – string, path to working directory. source_url – url to download from if file doesn’t exist.
Returns:	Path to resulting file.

loader.pca_whiten(X)[source]¶

Returns matrix with PCA whitening transform applied. This transform assumes that data points are rows of matrix.

Parameters:	X – Numpy array, scipy sparse matrix axis – Axis to whiten over.
Returns:

loader.read_data_sets(directory, folders=('train', 'dev', 'test'), hashlist=(), mix=False)[source]¶

Parameters:

directory – Root directory containing data to load.
folders – The subfolders of directory to read data from by default there are train, dev, and test folders. If you want others you have to make an explicit list.
hashlist – If you provide a hashlist these files and only these files will be added to your DataSet objects. It you do not provide a hashlist then anything with the privileged prefixes labels_ or features_ will be loaded.

Returns:

A DataSets object.

loader.save(filename, data)[source]¶

Calls :any`export_data`. Decides how to save data by file extension. Raises Unsupported_format_error if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.

Parameters:	filename – A file of an accepted format representing a matrix. data – A numpy array, scipy sparse matrix, or `HotIndex` object.

loader.tfidf(X, norm='l2row')[source]¶

Parameters:	X – A document-term matrix. norm – Normalization strategy: ‘l2row’: normalizes the scores of rows by length of rows after basic tfidf (each document vector is a unit vector), ‘count’: normalizes the scores of rows by the the total word count of a document. ‘max’ normalizes the scores of rows by the maximum count for a single word in a document.
Returns:	Returns tfidf of document-term matrix X with optional normalization.

loader.toIndex(A)[source]¶

Parameters:	A – A matrix of one hot row vectors.
Returns:	The hot indices.

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.array([[1,0,0], [0,0,1], [1,0,0]])
>>> loader.toIndex(x)
array([0, 2, 0])

loader.toOnehot(X, dim=None)[source]¶

Parameters:	X – Vector of indices or `HotIndex` object dim – Dimension of indexing
Returns:	A sparse csr_matrix of one hots.

Examples

>>> import numpy
>>> from antk.core import loader
>>> x = numpy.array([0, 1, 2, 3])
>>> loader.toOnehot(x) 
<4x4 sparse matrix of type '<type 'numpy.float64'>'...
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])
>>> x = loader.HotIndex(x, dimension=8)
>>> loader.toOnehot(x).toarray()
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]])

loader.unit_variance(X, axis=None)[source]¶

Parameters:	X – A matrix to transfrom to have unit variance (over columns axis=0, over rows axis=1, over all entries axis=None)
Returns:	A matrix with unit variance along the specified axis.

loader.untar(fname)[source]¶

Table Of Contents

Previous topic

Next topic

This Page

loader¶

Loading, Saving, and Testing¶

Classes¶

Data Transforms¶

Exceptions¶

API¶