Last Updated on November 23, 2022

In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well.

Some of the common steps required for data preprocessing include:

- Data normalization: This includes normalizing the data between a range of values in a dataset.
- Data augmentation: This includes generating new samples from existing ones by adding noise or shifts in features to make them more diverse.

Data preparation is a crucial step in any machine learning pipeline. PyTorch brings along a lot of modules such as torchvision which provides datasets and dataset classes to make data preparation easy.

In this tutorial we’ll demonstrate how to work with datasets and transforms in PyTorch so that you may create your own custom dataset classes and manipulate the datasets the way you want. In particular, you’ll learn:

- How to create a simple dataset class and apply transforms to it.
- How to build callable transforms and apply them to the dataset object.
- How to compose various transforms on a dataset object.

Note that here you’ll play with simple datasets for general understanding of the concepts while in the next part of this tutorial you’ll get a chance to work with dataset objects for images.

Let’s get started.

This tutorial is in three parts; they are:

- Creating a Simple Dataset Class
- Creating Callable Transforms
- Composing Multiple Transforms for Datasets

Before we begin, we’ll have to import a few packages before creating the dataset class.

import torch from torch.utils.data import Dataset torch.manual_seed(42) |

We’ll import the abstract class `Dataset`

from `torch.utils.data`

. Hence, we override the below methods in the dataset class:

`__len__`

so that`len(dataset)`

can tell us the size of the dataset.`__getitem__`

to access the data samples in the dataset by supporting indexing operation. For example,`dataset[i]`

can be used to retrieve i-th data sample.

Likewise, the `torch.manual_seed()`

forces the random function to produce the same number every time it is recompiled.

Now, let’s define the dataset class.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
class SimpleDataset(Dataset): # defining values in the constructor def __init__(self, data_length = 20, transform = None): self.x = 3 * torch.eye(data_length, 2) self.y = torch.eye(data_length, 4) self.transform = transform self.len = data_length
# Getting the data samples def __getitem__(self, idx): sample = self.x[idx], self.y[idx] if self.transform: sample = self.transform(sample) return sample
# Getting data size/length def __len__(self): return self.len |

In the object constructor, we have created the values of features and targets, namely `x`

and `y`

, assigning their values to the tensors `self.x`

and `self.y`

. Each tensor carries 20 data samples while the attribute `data_length`

stores the number of data samples. Let’s discuss about the transforms later in the tutorial.

The behavior of the `SimpleDataset`

object is like any Python iterable, such as a list or a tuple. Now, let’s create the `SimpleDataset`

object and look at its total length and the value at index 1.

dataset = SimpleDataset() print(“length of the SimpleDataset object: “, len(dataset)) print(“accessing value at index 1 of the simple_dataset object: “, dataset[1]) |

This prints

length of the SimpleDataset object: 20 accessing value at index 1 of the simple_dataset object: (tensor([0., 3.]), tensor([0., 1., 0., 0.])) |

As our dataset is iterable, let’s print out the first four elements using a loop:

for i in range(4): x, y = dataset[i] print(x, y) |

This prints

tensor([3., 0.]) tensor([1., 0., 0., 0.]) tensor([0., 3.]) tensor([0., 1., 0., 0.]) tensor([0., 0.]) tensor([0., 0., 1., 0.]) tensor([0., 0.]) tensor([0., 0., 0., 1.]) |

In several cases, you’ll need to create callable transforms in order to normalize or standardize the data. These transforms can then be applied to the tensors. Let’s create a callable transform and apply it to our “simple dataset” object we created earlier in this tutorial.

# Creating a callable tranform class mult_divide class MultDivide: # Constructor def __init__(self, mult_x = 2, divide_y = 3): self.mult_x = mult_x self.divide_y = divide_y
# caller def __call__(self, sample): x = sample[0] y = sample[1] x = x * self.mult_x y = y / self.divide_y sample = x, y return sample |

We have created a simple custom transform `MultDivide`

that multiplies `x`

with `2`

and divides `y`

by `3`

. This is not for any practical use but to demonstrate how a callable class can work as a transform for our dataset class. Remember, we had declared a parameter `transform = None`

in the `simple_dataset`

. Now, we can replace that `None`

with the custom transform object that we’ve just created.

So, let’s demonstrate how it’s done and call this transform object on our dataset to see how it transforms the first four elements of our dataset.

# calling the transform object mul_div = MultDivide() custom_dataset = SimpleDataset(transform = mul_div)
for i in range(4): x, y = dataset[i] print(‘Idx: ‘, i, ‘Original_x: ‘, x, ‘Original_y: ‘, y) x_, y_ = custom_dataset[i] print(‘Idx: ‘, i, ‘Transformed_x:’, x_, ‘Transformed_y:’, y_) |

This prints

Idx: 0 Original_x: tensor([3., 0.]) Original_y: tensor([1., 0., 0., 0.]) Idx: 0 Transformed_x: tensor([6., 0.]) Transformed_y: tensor([0.3333, 0.0000, 0.0000, 0.0000]) Idx: 1 Original_x: tensor([0., 3.]) Original_y: tensor([0., 1., 0., 0.]) Idx: 1 Transformed_x: tensor([0., 6.]) Transformed_y: tensor([0.0000, 0.3333, 0.0000, 0.0000]) Idx: 2 Original_x: tensor([0., 0.]) Original_y: tensor([0., 0., 1., 0.]) Idx: 2 Transformed_x: tensor([0., 0.]) Transformed_y: tensor([0.0000, 0.0000, 0.3333, 0.0000]) Idx: 3 Original_x: tensor([0., 0.]) Original_y: tensor([0., 0., 0., 1.]) Idx: 3 Transformed_x: tensor([0., 0.]) Transformed_y: tensor([0.0000, 0.0000, 0.0000, 0.3333]) |

As you can see the transform has been successfully applied to the first four elements of the dataset.

We often would like to perform multiple transforms in series on a dataset. This can be done by importing `Compose`

class from transforms module in torchvision. For instance, let’s say we build another transform `SubtractOne`

and apply it to our dataset in addition to the `MultDivide`

transform that we have created earlier.

Once applied, the newly created transform will subtract 1 from each element of the dataset.

from torchvision import transforms
# Creating subtract_one tranform class SubtractOne: # Constructor def __init__(self, number = 1): self.number = number
# caller def __call__(self, sample): x = sample[0] y = sample[1] x = x – self.number y = y – self.number sample = x, y return sample |

As specified earlier, now we’ll combine both the transforms with `Compose`

method.

# Composing multiple transforms mult_transforms = transforms.Compose([MultDivide(), SubtractOne()]) |

Note that first `MultDivide`

transform will be applied onto the dataset and then `SubtractOne`

transform will be applied on the transformed elements of the dataset.

We’ll pass the `Compose`

object (that holds the combination of both the transforms i.e. `MultDivide()`

and `SubtractOne()`

) to our `SimpleDataset`

object.

# Creating a new simple_dataset object with multiple transforms new_dataset = SimpleDataset(transform = mult_transforms) |

Now that the combination of multiple transforms has been applied to the dataset, let’s print out the first four elements of our transformed dataset.

for i in range(4): x, y = dataset[i] print(‘Idx: ‘, i, ‘Original_x: ‘, x, ‘Original_y: ‘, y) x_, y_ = new_dataset[i] print(‘Idx: ‘, i, ‘Transformed x_:’, x_, ‘Transformed y_:’, y_) |

Putting everything together, the complete code is as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
import torch from torch.utils.data import Dataset from torchvision import transforms
torch.manual_seed(2)
class SimpleDataset(Dataset): # defining values in the constructor def __init__(self, data_length = 20, transform = None): self.x = 3 * torch.eye(data_length, 2) self.y = torch.eye(data_length, 4) self.transform = transform self.len = data_length
# Getting the data samples def __getitem__(self, idx): sample = self.x[idx], self.y[idx] if self.transform: sample = self.transform(sample) return sample
# Getting data size/length def __len__(self): return self.len
# Creating a callable tranform class mult_divide class MultDivide: # Constructor def __init__(self, mult_x = 2, divide_y = 3): self.mult_x = mult_x self.divide_y = divide_y
# caller def __call__(self, sample): x = sample[0] y = sample[1] x = x * self.mult_x y = y / self.divide_y sample = x, y return sample
# Creating subtract_one tranform class SubtractOne: # Constructor def __init__(self, number = 1): self.number = number
# caller def __call__(self, sample): x = sample[0] y = sample[1] x = x – self.number y = y – self.number sample = x, y return sample
# Composing multiple transforms mult_transforms = transforms.Compose([MultDivide(), SubtractOne()])
# Creating a new simple_dataset object with multiple transforms dataset = SimpleDataset() new_dataset = SimpleDataset(transform = mult_transforms)
print(“length of the simple_dataset object: “, len(dataset)) print(“accessing value at index 1 of the simple_dataset object: “, dataset[1])
for i in range(4): x, y = dataset[i] print(‘Idx: ‘, i, ‘Original_x: ‘, x, ‘Original_y: ‘, y) x_, y_ = new_dataset[i] print(‘Idx: ‘, i, ‘Transformed x_:’, x_, ‘Transformed y_:’, y_) |

In this tutorial, you learned how to create custom datasets and transforms in PyTorch. Particularly, you learned:

- How to create a simple dataset class and apply transforms to it.
- How to build callable transforms and apply them to the dataset object.
- How to compose various transforms on a dataset object.