*******************
datatool
*******************

Implements a general purpose data loader for python non-sequential machine learning tasks. Given a directory path the data loader will read data files (in commonly found formats) and return a python object containing the data in the form of numpy matrices, along with some supporting functions for manipulating the data. The data loader supports the loading of data sets containing multiple feature inputs, and multiple target labels.

Directory Structure
===================

.. image:: _static/directory.png

*directory* at the top level can be named whatever. There must be three directories below *directory* named **train**, **dev**, and **test**. These names are not contingent.
If the **train**, **dev**, and **test** directories are not present :any:`Bad_directory_structure_error` will be raised during loading. The top level directory may contain other files besides these three directories. According to the diagram:
	*N* is the number of feature sets. Not to be confused with the number of elements in a feature vector for a particular feature set.
	*Q* is the number of label sets. Not to be confused with the number of elements in a label vector for a particular label set.
	The hash for a matrix in a :any:`DataSet.features` attribute is whatever is between **features_** and the file extension (*.ext*) in the file name.
	The hash for a matrix in a :any:`DataSet.labels` attribute is whatever is between **labels_** and the file extension (*.ext*) in the file name.

Notes
=====
		Rows of feature and data matrices should correspond to individual data points as opposed to the transpose.
		There should be the same number of data points in each file of the **train** directory, and the same is true for
		the **dev** and **test** directories. The number of data points can of course vary between **dev**, **train**, and **test** directories.

Supported Formats
=================

    **.mat**:
        Matlab files of matrices made with the matlab save command. Saved matrices to be read must be named **data**.

    **.sparsetxt**
        Plain text files where lines correspond to an entry in a matrix where a line consists of values **i j k**, so a matrix *A* is constructed where  :math:`A_{ij} = k`. Tokens must be whitespace delimited.

    **.densetxt**:
        Plain text files with a matrix represented in standard form. Tokens must be whitespace delimited.

    **.sparse**:
        Like :any:`.sparsetxt` files but written in binary (no delimiters) to save disk space and speed file i/o. Matrix dimensions are contained in the first bytes of the file.

    **.binary**:
        Like :any:`.densetxt` files but written in binary (no delimiters) to save disk space and speed file i/o. Matrix dimensions are contained in the first bytes of the file.

Possible Extensions
===================
    **Data transformations**
        Mean Cancellation, KL-Expansion, Covariance Equalization, Data Whitening,
        shift labels to avoid asymptotes of logistic.
    **Feed**
        Advanced shuffling, variable batch size.
    **Data sets**
        Support for sequential and tensor data.

loader
======

.. automodule:: loader
   :members:
   :undoc-members: