Data in machine learning: collection, types and structure

In machine learning, we can identify data as a set of observations or measurements, called dataset, used to train and test a machine learning model.

Data are crucial because artificial intelligence is built on them, and their quantity and quality drastically affect the accuracy of our model.

Data collection

Data collection is the process of collecting data from various sources.

When I start machine learning projects generically I search on Kaggle, a huge machine-learning community on which you can find high-quality datasets on various topics.

Dataset and features

We said that the set of information useful for developing our algorithm is called a dataset.
But in the latter, how is it structured?

It cannot be amassed without a defined structure, otherwise neither a human nor a machine would understand anything about it.

The dataset is organized into features (or variables): a subset of observations about the same phenomenon or characteristic of something.

SexAgeHeight
Male28185 cm
Female19169 cm
Male54172 cm

In this medical dataset, we have 3 different variables.

Types of features

By reading the previous example, we can see that some features are different.

Numerical features

For height or age, the values are numeric and limitless. Features like this are called numerical features.

Categorical features

Meanwhile, for sex, the data in this feature can assimilate a number of limited and fixed values. Features like this are called “categorical features” and the possible values “categories“.

There are 2 types of categorical features:

  • Ordinal data have a qualitative order.
    ex. bad, mid, good…
  • Nominal data don’t have a qualitative order.
    ex. red, green, blue…

Labeled data vs unlabeled data

Labeled dataset

In a labeled dataset each input data is assigned an output (label).

These outputs are in the target (or output) feature.

We want to build a model that can predict the output value given an input observation. This approach is called supervised learning.

Unlabeled dataset

In an unlabeled dataset, there is no output feature.

This means that we can only use an algorithm that finds a meaningful pattern between the input data. This approach is called unsupervised learning.

Feature engineering and data split

Before we input our dataset into our model we have to follow two steps:

  • Feature engineering, the main argument of this blog’s section.
  • Data split, the process of splitting the datasets into 2 smaller datasets, used for training and validation.
Share the knowledge