Guide to Data Handling¶
The module inferpy.data.loaders
provides the basic functionality for handling data. In particular,
all the classes for loading data will inherit from the class DataLoader
defined at
this module.
CSV files¶
Data can be loaded from CSV files through the class CsvLoader
whose
objects can be built as follows:
from inferpy.data.loaders import CsvLoader
data_loader = CsvLoader(path="./tests/files/dataxy_0.csv")
where path
can be either a string indicating the location of the csv file or
a list of strings (i.e., for datasets distributed across multiple CSV files):
file_list = [f"./tests/files/dataxy_{i}.csv" for i in [0,1]]
data_loader = CsvLoader(path=file_list)
A data loader can be built from CSV files with or without a header. However, in case of a list of files, the presence of the header and column names must be consistent among all the files.
When loading data from a CSV file, we might need to
map the columns in the dataset to another set of variables. This can be made
using the input argument var_dict
, which is a dictionary where the
keys are the variable names and the values are lists of integers indicating
the columns (0 stands for the first data column). For example, in a data set whose columns names
are "x"
and "y"
, we might be interested in renaming them:
data_loader = CsvLoader(path="./tests/files/dataxy_0.csv", var_dict={"x1":[0], "x2":[1]})
This mapping functionality can also be used for grouping columns into a single variable:
data_loader = CsvLoader(path="./tests/files/dataxy_0.csv", var_dict={"A":[0,1]})
Data in memory¶
Analogously, a data loader can be built from data already loaded into memory, e.g.,
pandas data. To do this, we will use the class SampleDictLoader
which can be
instantiated as follows.
from inferpy.data.loaders import SampleDictLoader
samples = {"x": np.random.rand(1000), "y": np.random.rand(1000)}
data_loader = SampleDictLoader(sample_dict=samples)
Properties¶
From any object of class DataLoader
we can obtain the size, (i.e., number of instances)
of the list of variable names:
>>> data_loader.size
1000
>>> data_loader.variables
['x', 'y']
In case of a CsvLoader
, we can determine if the source files have or not
a header:
>>> data_loader.has_header
True
Extracting data¶
Data can be loaded as a dictionary (of numpy objects) or as TensorFlow dataset object:
>>> data_loader.to_dict()
{'x': array([1.54217069e-02, 3.74321848e-02, 1.29080105e-01, ... ,8.44103262e-01]),
'y': array([1.49197044e-01, 4.19856938e-01, 2.63596605e-01, ... ,1.20826740e-01])}
>>> data_loader.to_tfdataset(batch_size=50)
<DatasetV1Adapter shapes: OrderedDict([(x, (50,)), (y, (50,))]),
types: OrderedDict([(x, tf.float32), (y, tf.float32)])>
Usage with probabilistic models¶
Making inference in a probabilistic model is the final goal of loading data. Consider the following code of a simple linear regression:
@inf.probmodel
def linear_reg(d):
w0 = inf.Normal(0, 1, name="w0")
w = inf.Normal(tf.zeros([d,1]), 1, name="w")
with inf.datamodel():
x = inf.Normal(tf.ones([d]), 2, name="x")
y = inf.Normal(w0 + x @ w, 1.0, name="y")
@inf.probmodel
def qmodel(d):
qw0_loc = inf.Parameter(0., name="qw0_loc")
qw0_scale = tf.math.softplus(inf.Parameter(1., name="qw0_scale"))
qw0 = inf.Normal(qw0_loc, qw0_scale, name="w0")
qw_loc = inf.Parameter(tf.zeros([d,1]), name="qw_loc")
qw_scale = tf.math.softplus(inf.Parameter(tf.ones([d,1]), name="qw_scale"))
qw = inf.Normal(qw_loc, qw_scale, name="w")
# create an instance of the model
m = linear_reg(d=1)
vi = inf.inference.VI(qmodel(1), epochs=100)
We have seen so far that, for making inference we invoke the method fit
which
takes a dictionary of samples as an input parameter:
m.fit(data={"x": np.random.rand(1000,1), "y": np.random.rand(1000,1)}, inference_method=vi)
The data
parameter can be replaced by an object of
class DataLoader
:
data_loader = CsvLoader(path="./tests/files/dataxy_with_header.csv")
m.fit(data=data_loader, inference_method=vi)
Note that column names must be the same as those in the model. In case of being different or reading from a file without header, we use the mapping functionality:
data_loader = CsvLoader(path="./tests/files/dataxy_no_header.csv", var_dict={"x":[0], "y":[1]})
m.fit(data=data_loader, inference_method=vi)