What is feature engineering? Definition, techniques and importance

What is feature engineering?

Feature engineering is selecting, extracting, and transforming features from raw data to create a new dataset useful for building predictive models.

This new dataset is compatible with our algorithm and, due to our improvements, provides the model with more valuable information, consequently improving the model’s efficiency.

Why is feature engineering important?

Data is the core of machine learning. Artificial intelligence models structure depends on the data they are trained on.

When machine learning scientists have to work with a company, the data they have available isn’t perfect. There are null variables, data types incompatible with ML models, disorganized features, outliers, etc…

We must start building our model with the most meaningful and most organized dataset possible.

Some feature engineering techniques

1. Transform categorical features

A lot of machine learning algorithms can’t handle categorical features. For example, in a medical regression problem with the feature sex (values male and female), a linear regression algorithm gives an error because it doesn’t understand these values.

To solve this issue we can:

  • Drop the categorical feature.
  • Assign each possible categorical value a number (ordinal encoding).
  • Generate a new feature for each categorical value, assigning a value of 1 or 0 based on the original feature (one-hot encoding).

Which way is the best? Are there other ways to solve this issue? Full article here.

2. Handle missing values

Missing values are a big deal in machine learning. When an observation doesn’t have a value, the cell is filled with “NaN”.

Machine learning algorithms can’t work with these values. We have to solve these issues, and we have 2 ways:

  • Eliminate the samples with a missing value.
  • Replace the missing values with something (imputation)
    ex. mean value of the feature.

Which way is the best? Are there other ways to solve this issue? Full article here.

3. Normalization

Let’s say that in a numerical feature, we have the possible values in a range between 350 and 1450. Normalization allows us to convert these values in the range between 0 and 1.

This technique is useful when we want to compare feature values on different scales. The formula for normalization is:

new_x = x – min / max – min

Where:

  • new_x is the normalized value
  • x is the value
  • min is the minimum value
  • max is the maximum value

4. Standardization

Standardization is a technique to transform each feature to make the mean value 0. Like normalization, standardization helps us confront the values between features.

new_x = x – μ / ​​σ

where:

  • new_x is the standardized value,
  • x is the original value,
  • μ is the mean of the feature,
  • σ is the standard deviation of the feature.
Share the knowledge