Guide to Data Handling

The module inferpy.data.loaders provides the basic functionality for handling data. In particular, all the classes for loading data will inherit from the class DataLoader defined at this module.

CSV files

Data can be loaded from CSV files through the class CsvLoader whose objects can be built as follows:

from inferpy.data.loaders import CsvLoader

data_loader = CsvLoader(path="./tests/files/dataxy_0.csv")

where path can be either a string indicating the location of the csv file or a list of strings (i.e., for datasets distributed across multiple CSV files):

file_list = [f"./tests/files/dataxy_{i}.csv" for i in [0,1]]
data_loader = CsvLoader(path=file_list)

A data loader can be built from CSV files with or without a header. However, in case of a list of files, the presence of the header and column names must be consistent among all the files.

When loading data from a CSV file, we might need to map the columns in the dataset to another set of variables. This can be made using the input argument var_dict, which is a dictionary where the keys are the variable names and the values are lists of integers indicating the columns (0 stands for the first data column). For example, in a data set whose columns names are "x" and "y", we might be interested in renaming them:

data_loader = CsvLoader(path="./tests/files/dataxy_0.csv", var_dict={"x1":[0], "x2":[1]})

This mapping functionality can also be used for grouping columns into a single variable:

data_loader = CsvLoader(path="./tests/files/dataxy_0.csv", var_dict={"A":[0,1]})

Data in memory

Analogously, a data loader can be built from data already loaded into memory, e.g., pandas data. To do this, we will use the class SampleDictLoader which can be instantiated as follows.

from inferpy.data.loaders import SampleDictLoader

samples = {"x": np.random.rand(1000), "y": np.random.rand(1000)}
data_loader = SampleDictLoader(sample_dict=samples)

Properties

From any object of class DataLoader we can obtain the size, (i.e., number of instances) of the list of variable names:

>>> data_loader.size
1000
>>> data_loader.variables
['x', 'y']

In case of a CsvLoader, we can determine if the source files have or not a header:

>>> data_loader.has_header
True

Extracting data

Data can be loaded as a dictionary (of numpy objects) or as TensorFlow dataset object:

>>> data_loader.to_dict() 
{'x': array([1.54217069e-02, 3.74321848e-02, 1.29080105e-01, ... ,8.44103262e-01]),
 'y': array([1.49197044e-01, 4.19856938e-01, 2.63596605e-01, ... ,1.20826740e-01])}

>>> data_loader.to_tfdataset(batch_size=50)
<DatasetV1Adapter shapes: OrderedDict([(x, (50,)), (y, (50,))]), 
types: OrderedDict([(x, tf.float32), (y, tf.float32)])>

Usage with probabilistic models

Making inference in a probabilistic model is the final goal of loading data. Consider the following code of a simple linear regression:

@inf.probmodel
def linear_reg(d):
    w0 = inf.Normal(0, 1, name="w0")
    w = inf.Normal(tf.zeros([d,1]), 1, name="w")

    with inf.datamodel():
        x = inf.Normal(tf.ones([d]), 2, name="x")
        y = inf.Normal(w0 + x @ w, 1.0, name="y")


@inf.probmodel
def qmodel(d):
    qw0_loc = inf.Parameter(0., name="qw0_loc")
    qw0_scale = tf.math.softplus(inf.Parameter(1., name="qw0_scale"))
    qw0 = inf.Normal(qw0_loc, qw0_scale, name="w0")

    qw_loc = inf.Parameter(tf.zeros([d,1]), name="qw_loc")
    qw_scale = tf.math.softplus(inf.Parameter(tf.ones([d,1]), name="qw_scale"))
    qw = inf.Normal(qw_loc, qw_scale, name="w")


# create an instance of the model
m = linear_reg(d=1)
vi = inf.inference.VI(qmodel(1), epochs=100)

We have seen so far that, for making inference we invoke the method fit which takes a dictionary of samples as an input parameter:

m.fit(data={"x": np.random.rand(1000,1), "y": np.random.rand(1000,1)}, inference_method=vi)

The data parameter can be replaced by an object of class DataLoader:

data_loader = CsvLoader(path="./tests/files/dataxy_with_header.csv")
m.fit(data=data_loader, inference_method=vi)

Note that column names must be the same as those in the model. In case of being different or reading from a file without header, we use the mapping functionality:

data_loader = CsvLoader(path="./tests/files/dataxy_no_header.csv", var_dict={"x":[0], "y":[1]})
m.fit(data=data_loader, inference_method=vi)