{% extends "about/base.html" %} {% load i18n %} {% block title %}{% trans "About" %} :: HDF5{% endblock %} {% block breadcrumbs %}{% trans "About" %} / HDF5{% endblock %} {% block content %}

{% trans "Our HDF5 Format Explained" %}

HDF5 {% trans "is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format." %}

{% trans "Internally, we make use of HDF5 because of its hierarchical structure which allows for efficient access of data attributes. It also allows for storing data in a flexible way which seems paramount to us in the current state of the machine learning community and its wild bunch of file formats used. We store two different types of files in hdf5 format. The dataset itself and the split file. More details on this in the sections below." %}

{% trans "Please also have a look at the example page on how you can use the files you download from this site" %}: example page. {% trans "An" %} {% trans "example file" %} {% trans "is available for download as well." %}
{% trans "You can download our converter tool from" %} mloss.org {% trans "and convert data files locally to check if everything is transformed appropriately, maybe even help debugging problems." %}

{% trans "HDF5 Attributes and Datasets" %}

{% trans "The basic abstraction is that a data set is a large collection of objects having the same type. Each object is a fixed length array of features which may be of different types. For example, each object can be an array of mixed categorical and numerical data, but all objects in the dataset have the same mix." %}

{% trans "HDF5 attributes on the root level" %}

{% trans "HDF5 group 'data_descr'" %}

{% trans "HDF5 group 'data'" %}

{% trans "EITHER single sparse matrix when all attributes types are the same" %}

{% trans "This represents the Compressed Sparse Column format described at" %} scipy.org {% trans "and" %} Wikipedia.

{% trans "OR single dense matrix when all attribute types are the same" %}

{% trans "OR multiple datasets when variable types are mixed" %}

{% trans "If possible short names are used for the dataset names for better recognition, otherwise the variable type (int, double, str) is used. In the latter case the datasets are also numbered for uniqueness" %}

{% trans "OPTIONAL vector/matrix with natural labels" %}

{% trans "Natural labels are currently considered for files in LibSVM format." %}

{% trans "Split files and Tasks" %}

{% trans "Note that the distinction what is input/output, label/target depends on the TASK, not on the data set itself! We do have a mechanism in place to create automatic split files while slurping datasets from other repositories, though. These datasets may be defined in the split files:" %}

{% trans "HDF5 attributes on the root level" %}

{% trans "See" %} {% trans "above" %}

{% trans "HDF5 group 'task_descr' (derived from the Task object)" %}

{% trans "HDF5 group 'task'" %}

{% trans "Supported formats" %}

{% trans "The website (and its converter tool) currently supports conversion from and to the following data formats:" %}

{% trans "to" %} HDF5 {% trans "from" %} HDF5
    {% for f in supported_formats.to %}
  • {{ f }}
  • {% endfor %}
    {% for f in supported_formats.from %}
  • {{ f }}
  • {% endfor %}

{% trans "When uploading data, files can be compressed by gzip, bzip2 or as a single file in zip files or tarballs. They will be decompressed automatically after upload." %}

{% trans "You will find an implementation of a converter in Python in the source tarball, utils/hdf5conv (and scripts/hdf5conv.py as an example how to use the converter). You can also download an " %}{% trans "example file" %}.

{% trans "Please also have a look at the example page on how you can use the files you download from this site" %}: example page.

{% endblock %}