Data in machine learning: collection, types and structure

In machine learning, we can identify data as a set of observations or measurements, called dataset, used to train and test a machine learning model.

Data are crucial because artificial intelligence is built on them, and their quantity and quality drastically affect the accuracy of our model.

Data collection

Data collection is the process of collecting data from various sources.

When I start machine learning projects generically I search on Kaggle, huge machine-learning community on which you can find high-quality datasets on various topics.

Dataset and features

We said that the set of materials useful for developing our algorithm is called a dataset.
But in the latter, how is this information structured?

It cannot be amassed without a defined structure, otherwise neither a human nor a machine would understand anything about it.

The dataset is organized into features: collections of data that all tell us about the same characteristic or property of something.

For example, in a hospital dataset on patients, some features may be gender, age, height, eye color, etc..

Types of features

Reading the previous example, we can see that some features are different.

For height or age, the values are numeric and limitless. Features like this are called numerical features.

Meanwhile, for sex, the data in this feature can assimilate a number of limited and fixed values. Features like this are called categorical (or nominal) features.

Labeled data vs unlabeled data

Labeled dataset

In a labeled dataset each input data is assigned an output (label).
For example, in the example seen earlier with medical records, there might have been a feature indicating whether the patient has a certain disease.

The model we would like to build should be able to predict based on a patient’s characteristics (input features) whether or not he or she has a certain disease. This approach is called supervised learning, and it is used for labeled data.

Unlabeled dataset

In an unlabeled dataset, there is no output feature.

This means that a model working on this data is not based on the relationship between the input and output, but only on the input data, trying to extract useful information from them. This approach is called unsupervised learning.

Feature engineering and data split

However, the dataset containing our information is not complete. In fact, before we input it into our model we have to follow two more steps

  • Feature engineering, the main argument of this blog’s section.
  • Data split, the process of splitting the datasets into 2 smaller datasets, used for training and validation.
Share the knowledge