loader¶
Implements a general purpose data loader for python non-sequential machine learning tasks. Several common data transformations are provided in this module, e.g., tfidf, whitening, etc.
Exceptions¶
API¶
-
exception
loader.
Bad_directory_structure_error
[source]¶ Raised when a data directory specified, does not contain a subfolder specified in the folders argument to
read_data_sets
.
-
class
loader.
DataSet
(features, labels=None, num_examples=None, mix=False)[source]¶ General data structure for mini-batch gradient descent training involving non-sequential data.
Parameters: - features – A dictionary of string label names to data matrices. Matrices may be
HotIndex
, scipy sparse csr_matrix, or numpy arrays. - labels – A dictionary of string label names to data matrices. Matrices may be
HotIndex
, scipy sparse csr_matrix, or numpy arrays. - num_examples – How many data points.
- mix – Whether or not to shuffle per epoch.
Returns: Attributes
Methods
-
features
¶ A dictionary of feature matrices.
-
index_in_epoch
¶ The number of datapoints that have been trained on in a particular epoch.
-
labels
¶ A dictionary of label matrices
-
mix_after_epoch
(mix)[source]¶ Whether or not to shuffle after training for an epoch.
Parameters: mix – True or False
-
next_batch
(batch_size)[source]¶ - Return a sub DataSet of next batch-size examples.
- If shuffling enabled:
- If batch_size is greater than the number of examples left in the epoch then a batch size DataSet wrapping back to beginning will be returned.
- If shuffling turned off:
- If batch_size is greater than the number of examples left in the epoch, points will be shuffled and batch_size DataSet is returned starting from index 0.
Parameters: batch_size – int Returns: A DataSet
object with the next batch_size examples.
- features – A dictionary of string label names to data matrices. Matrices may be
-
class
loader.
DataSets
(datasets_map, mix=False)[source]¶ A record of DataSet objects with a display function.
Methods
-
class
loader.
HotIndex
(matrix, dimension=None)[source]¶ Index vector representation of one hot matrix. Can hand constructor either a one hot matrix, or vector of indices and dimension.
Attributes
Methods
-
dim
¶ The feature dimension of the one hot vector represented as indices.
-
shape
¶ The shape of the one hot matrix encoded.
-
vec
¶ The vector of hot indices.
-
-
exception
loader.
Mat_format_error
[source]¶ Raised if the .mat file being read does not contain a variable named data.
-
exception
loader.
Sparse_format_error
[source]¶ Raised when reading a plain text file with .sparsetxt extension and there are not three entries per line.
-
exception
loader.
Unsupported_format_error
[source]¶ Raised when a file is requested to be loaded or saved without one of the supported file extensions.
-
loader.
center
(X, axis=None)[source]¶ Parameters: X – A matrix to center about the mean(over columns axis=0, over rows axis=1, over all entries axis=None) Returns: A matrix with entries centered along the specified axis.
-
loader.
export_data
(filename, data)[source]¶ Decides how to save data by file extension. Raises
Unsupported_format_error
if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.Parameters: - filename – A file of an accepted format representing a matrix.
- data – A numpy array, scipy sparse matrix, or
HotIndex
object.
-
loader.
import_data
(filename)[source]¶ Decides how to load data into python matrices by file extension. Raises
Unsupported_format_error
if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).Parameters: filename – A file of an accepted format representing a matrix. Returns: A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.
-
loader.
is_one_hot
(A)[source]¶ Parameters: A – A numpy array or scipy sparse matrix Returns: True if matrix is a sparse matrix of one hot vectors, False otherwise Examples
>>> import numpy >>> from antk.core import loader >>> x = numpy.eye(3) >>> loader.is_one_hot(x) True >>> x *= 5 >>> loader.is_one_hot(x) False >>> x = numpy.array([[1, 0, 0], [1, 0, 0], [1, 0, 0]]) >>> loader.is_one_hot(x) True >>> x[0,1] = 2 >>> loader.is_one_hot(x) False
-
loader.
l1normalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{ik}|}\)
axis=0 normalizes each column of X by norm of said column. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k |X_{kj}|}\)
axis=None normalizes entries of X by norm of X. \(l1normalize(X)_{ij} = \frac{X_{ij}}{\sum_k \sum_p |X_{kp}|}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
-
loader.
l2normalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{ ik}^2}}\)
axis=0 normalizes each column of X by norm of said column. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k X_{kj}^2}}\)
axis=None normalizes entries of X by norm of X. \(l2normalize(X)_{ij} = \frac{X_{ij}}{\sqrt{\sum_k \sum_p X_{kp}^2}}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
-
loader.
load
(filename)[source]¶ Calls
import_data
. Decides how to load data into python matrices by file extension. RaisesUnsupported_format_error
if extension is not one of the supported extensions (mat, sparse, binary, dense, sparsetxt, densetxt, index).Parameters: filename – A file of an accepted format representing a matrix. Returns: A numpy matrix, scipy sparse csr_matrix, or any:HotIndex.
-
loader.
makedirs
(datadirectory, sub_directory_list=('train', 'dev', 'test'))[source]¶ Parameters: - datadirectory – Name of the directory you want to create containing the subdirectory folders. If the directory already exists it will be populated with the subdirectory folders.
- sub_directory_list – The list of subdirectories you want to create
Returns: void
-
loader.
maxnormalize
(X, axis=1)[source]¶ axis=1 normalizes each row of X by norm of said row. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{i:})}\)
axis=0 normalizes each column of X by norm of said column. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X_{ :j})}\)
axis=None normalizes entries of X norm of X. \(maxnormalize(X)_{ij} = \frac{X_{ij}}{max(X)}\)
Parameters: - X – A scipy sparse csr_matrix or numpy array.
- axis – The dimension to normalize over.
Returns: A normalized matrix.
-
loader.
maybe_download
(filename, work_directory, source_url)[source]¶ Download the data from source url, unless it’s already here. From https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/datasets/base.py
Parameters: - filename – string, name of the file in the directory.
- work_directory – string, path to working directory.
- source_url – url to download from if file doesn’t exist.
Returns: Path to resulting file.
-
loader.
pca_whiten
(X)[source]¶ Returns matrix with PCA whitening transform applied. This transform assumes that data points are rows of matrix.
Parameters: - X – Numpy array, scipy sparse matrix
- axis – Axis to whiten over.
Returns:
-
loader.
read_data_sets
(directory, folders=('train', 'dev', 'test'), hashlist=(), mix=False)[source]¶ Parameters: - directory – Root directory containing data to load.
- folders – The subfolders of directory to read data from by default there are train, dev, and test folders. If you want others you have to make an explicit list.
- hashlist – If you provide a hashlist these files and only these files will be added to your
DataSet
objects. It you do not provide a hashlist then anything with the privileged prefixes labels_ or features_ will be loaded.
Returns: A
DataSets
object.
-
loader.
save
(filename, data)[source]¶ Calls :any`export_data`. Decides how to save data by file extension. Raises
Unsupported_format_error
if extension is not one of the supported extensions (mat, sparse, binary, dense, index). Data contained in .mat files should be saved in a matrix named data.Parameters: - filename – A file of an accepted format representing a matrix.
- data – A numpy array, scipy sparse matrix, or
HotIndex
object.
-
loader.
tfidf
(X, norm='l2row')[source]¶ Parameters: - X – A document-term matrix.
- norm – Normalization strategy: ‘l2row’: normalizes the scores of rows by length of rows after basic tfidf (each document vector is a unit vector), ‘count’: normalizes the scores of rows by the the total word count of a document. ‘max’ normalizes the scores of rows by the maximum count for a single word in a document.
Returns: Returns tfidf of document-term matrix X with optional normalization.
-
loader.
toIndex
(A)[source]¶ Parameters: A – A matrix of one hot row vectors. Returns: The hot indices. Examples
>>> import numpy >>> from antk.core import loader >>> x = numpy.array([[1,0,0], [0,0,1], [1,0,0]]) >>> loader.toIndex(x) array([0, 2, 0])
-
loader.
toOnehot
(X, dim=None)[source]¶ Parameters: - X – Vector of indices or
HotIndex
object - dim – Dimension of indexing
Returns: A sparse csr_matrix of one hots.
Examples
>>> import numpy >>> from antk.core import loader >>> x = numpy.array([0, 1, 2, 3]) >>> loader.toOnehot(x) <4x4 sparse matrix of type '<type 'numpy.float64'>'... >>> loader.toOnehot(x).toarray() array([[ 1., 0., 0., 0.], [ 0., 1., 0., 0.], [ 0., 0., 1., 0.], [ 0., 0., 0., 1.]]) >>> x = loader.HotIndex(x, dimension=8) >>> loader.toOnehot(x).toarray() array([[ 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0., 0., 0., 0.], [ 0., 0., 1., 0., 0., 0., 0., 0.], [ 0., 0., 0., 1., 0., 0., 0., 0.]])
- X – Vector of indices or