Feature Engineering Transforming Predictive Models
Delve into the transformative power of feature engineering in applied machine learning, and learn how carefully crafted features can elevate your models.
Join the DZone community and get the full member experience.
Join For FreeImagine you’re building a model to predict house prices: two models, identical in every aspect except one; one uses raw data, and the other leverages thoughtfully engineered features like the age of the house, proximity to schools, and seasonal price trends. Which model do you think performs better? The answer is intuitive: the latter.
Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work more effectively. It bridges the gap between raw data and the insights needed to drive decision-making. In this article, we’ll explore how feature engineering can significantly impact the performance of your predictive models.
Predictive models are a type of algorithm used to forecast future outcomes based on historical data. It leverages various techniques such as regression (for predicting continuous outcomes), classification (for categorizing data), clustering (for grouping similar data), time series analysis (for sequential data), and more advanced methods like neural networks, reinforcement learning, and ensemble methods. These models identify patterns in past data to make informed predictions about new or unseen data.
"Prediction is very difficult, especially if it's about the future."
- Niels Bohr
The Basics of Feature Engineering
What Is Feature Engineering?
At its core, feature engineering involves transforming raw data into meaningful features that better represent the underlying problem to the predictive models. These features help algorithms discern patterns and make accurate predictions.
Raw Data vs. Engineered Features
Raw Data is original, unprocessed data collected from various sources. It often contains noise, and inconsistencies, and lacks the structure required for effective modeling.
Engineered Features are derived attributes created by processing raw data. They encapsulate domain-specific knowledge and highlight relevant aspects of the data.
The Feature Engineering Workflow
- Data collection: Gather raw data from various sources.
- Data cleaning: Handle missing values, remove duplicates, and correct inconsistencies.
- Feature creation: Generate new features through transformations, aggregations, or domain-specific computations.
- Feature transformation: Apply scaling, encoding, or normalization techniques.
- Feature selection: Identify and retain the most relevant features for the model.
Feature Transformation Techniques
Effective feature transformation can show a path to hidden patterns and enhance model performance. Let’s explore some common techniques with practical examples using Python’s pandas and scikit-learn libraries.
1. Normalization and Scaling
Normalization and scaling are crucial techniques in preprocessing numerical data to ensure that features with different units or ranges don’t disproportionately influence the model. Normalization typically rescales values to a specific range, often [0, 1], making all features comparable and minimizing bias caused by large differences in magnitude. Scaling, particularly standardization, adjusts the distribution of values by centering the data around the mean and scaling it based on standard deviation, resulting in a mean of 0 and a standard deviation of 1. This is especially important for models that rely on distance metrics (like KNN or SVM) or gradient-based optimization (like neural networks) to avoid skewed results due to differing ranges in feature values.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample DataFrame
data = {'Age': [25, 32, 47, 51, 62],
'Income': [50000, 60000, 80000, 90000, 120000]}
df = pd.DataFrame(data)
# Min-Max Scaling
scaler = MinMaxScaler()
df[['Age_scaled', 'Income_scaled']] = scaler.fit_transform(df[['Age', 'Income']])
print(df)
- Output:
Age Income Age_scaled Income_scaled
0 25 50000 0.000000 0.000000
1 32 60000 0.142857 0.142857
2 47 80000 0.428571 0.428571
3 51 90000 0.500000 0.500000
4 62 120000 1.000000 1.000000
2. Polynomial Features
Polynomial features expand the input feature set by adding higher-degree terms, such as squares, cubes, or interactions between features. This technique is particularly useful when the relationship between the features and the target variable is non-linear. For instance, in linear regression, adding polynomial terms allows the model to fit more complex curves rather than straight lines, improving the model’s ability to capture intricate patterns in the data. While polynomial features can significantly enhance the model’s performance on non-linear problems, they can also increase the complexity of the model and risk overfitting, so careful use and regularization are often necessary.
from sklearn.preprocessing import PolynomialFeatures
# Original features
X = df[['Age_scaled']]
# Generate polynomial features up to degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
poly_features = poly.get_feature_names_out(['Age_scaled'])
df['Age_scaled_squared'] = X_poly[:, 1]
print(df)
- Output:
Age Income Age_scaled Income_scaled Age_scaled_squared
0 25 50000 0.000000 0.000000 0.000000
1 32 60000 0.142857 0.142857 0.020408
2 47 80000 0.428571 0.428571 0.183673
3 51 90000 0.500000 0.500000 0.250000
4 62 120000 1.000000 1.000000 1.000000
3. Encoding Categorical Variables
In machine learning, most algorithms require numerical input, but real-world datasets often contain categorical variables — variables that represent categories or groups (e.g., color, city names, product type). Encoding categorical variables involves converting these text-based categories into numerical values so that machine learning models can process them. There are various methods for encoding, with two common techniques being one-hot encoding and label encoding. One-hot encoding creates new binary columns for each category, which is useful when categories have no ordinal relationship. Label encoding, on the other hand, assigns a unique integer to each category but may introduce unintended ordinal relationships. Choosing the appropriate encoding method is crucial for improving the performance and accuracy of the model, as poorly encoded categorical variables can negatively impact predictions.
- One-hot encoding:
# Sample DataFrame with categorical feature
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']}
df = pd.DataFrame(data)
# One-Hot Encoding using pandas
df_encoded = pd.get_dummies(df, columns=['City'])
print(df_encoded)
- Output:
City_Chicago City_Los Angeles City_New York
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
- Label Encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['City_encoded'] = le.fit_transform(df['City'])
print(df)
- Output:
City City_encoded
0 New York 2
1 Los Angeles 1
2 Chicago 0
3 New York 2
4 Chicago 0
4. Log Transformations for Skewed Data
Log transformations are commonly used when dealing with skewed data — data that exhibits a long tail in one direction, either left (negatively skewed) or right (positively skewed). Skewed data can lead to models that perform poorly because they are overly influenced by extreme values. By applying a log transformation, you can compress the range of the data, making it more normally distributed, which helps certain algorithms (like linear regression) perform better. This technique is particularly helpful when dealing with variables like income or sales, where a small number of high values can disproportionately impact the model. It stabilizes variance and reduces the impact of outliers.
import numpy as np
# Sample skewed data
data = {'Sales': [100, 150, 200, 250, 300, 1000]}
df = pd.DataFrame(data)
# Apply log transformation
df['Sales_log'] = np.log(df['Sales'])
print(df)
- Output:
Sales Sales_log
0 100 4.605170
1 150 5.010635
2 200 5.298317
3 250 5.521461
4 300 5.703782
5 1000 6.907755
Feature Selection
Not all features contribute positively to the model’s performance. Feature selection involves identifying and retaining the most relevant features while discarding the rest. This process can prevent overfitting, reduce complexity, and improve model interpretability.
1. Variance Threshold
A variance threshold is a simple feature selection technique used to remove features with low variability, which typically contribute little to model performance. Features with zero or near-zero variance have nearly identical values across all data points, meaning they provide minimal information and are unlikely to help the model make distinctions between different classes or predict the target variable. By applying a variance threshold, we can filter out these low-variance features, reducing model complexity and potentially improving both training speed and prediction accuracy.
from sklearn.feature_selection import VarianceThreshold
# Sample DataFrame
data = {'Feature1': [0, 0, 0, 0, 0],
'Feature2': [1, 2, 3, 4, 5],
'Feature3': [10, 10, 10, 10, 10]}
df = pd.DataFrame(data)
# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.1)
selector.fit(df)
# Get columns to keep
cols = df.columns[selector.get_support()]
df_selected = df[cols]
print(df_selected)
- Output:
Feature2
0 1
1 2
2 3
3 4
4 5
2. Correlation Matrix
A correlation matrix is a table that shows the correlation coefficients between multiple variables in a dataset. Correlation values range from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation. In feature selection, a correlation matrix helps identify highly correlated features, which can introduce redundancy and multicollinearity in models like linear regression. By examining the matrix, you can remove one of the features that are highly correlated (typically above 0.95) to simplify the model without losing much predictive power. This step helps reduce overfitting and improves the interpretability of the model.
import seaborn as sns
import matplotlib.pyplot as plt
# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
'B': [2, 4, 6, 8, 10],
'C': [5, 3, 6, 9, 12],
'D': [5, 3, 6, 9, 12]}
df = pd.DataFrame(data)
# Compute correlation matrix
corr_matrix = df.corr().abs()
# Select upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find features with correlation > 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
df_reduced = df.drop(columns=to_drop)
print("Features to drop:", to_drop)
print(df_reduced)
- Output:
Features to drop: ['B', 'D']
A C
0 1 5
1 2 3
2 3 6
3 4 9
4 5 12
3. Recursive Feature Elimination (RFE)
Recursive Feature Elimination (RFE) is a feature selection technique that works by recursively fitting a model and eliminating the least important features based on the model's coefficients or importance scores. It starts with all features and systematically removes the least significant one, retraining the model each time until the desired number of features is reached. RFE is commonly used with linear models, decision trees, or random forests to rank and retain the most relevant features for the problem at hand. This method ensures that the model only uses the most valuable features, which can improve performance, reduce overfitting, and enhance interpretability.
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
# Load dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
# Initialize model
model = LinearRegression()
# Initialize RFE
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)
# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected Features:", selected_features.tolist())
- Output:
Selected Features: ['RM', 'PTRATIO', 'LSTAT', 'DIS', 'NOX']
Automation With Feature Engineering Tools
Manually crafting features can be time-consuming, especially with large datasets. Automation tools can expedite this process, though they come with their own set of advantages and limitations.
FeatureTools: Automated Feature Engineering
FeatureTools is an open-source Python library for automated feature engineering.
import pandas as pd
import featuretools as ft
# Sample data
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'join_date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-20'])
})
transactions = pd.DataFrame({
'transaction_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 1, 3, 2],
'amount': [250, 450, 300, 150, 500],
'transaction_date': pd.to_datetime(['2023-01-10', '2023-02-20', '2023-01-15', '2023-03-25', '2023-02-22'])
})
# Create EntitySet
es = ft.EntitySet(id='Customers')
# Add entities
es = es.add_dataframe(dataframe_name='customers',
dataframe=customers,
index='customer_id',
time_index='join_date')
es = es.add_dataframe(dataframe_name='transactions',
dataframe=transactions,
index='transaction_id',
time_index='transaction_date')
# Define relationship
relationship = ft.Relationship(es,
parent_dataframe_name='customers',
parent_column_name='customer_id',
child_dataframe_name='transactions',
child_column_name='customer_id')
es = es.add_relationship(relationship)
# Automatically generate features
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_dataframe_name='customers',
agg_primitives=['sum', 'mean', 'count'],
trans_primitives=['year', 'month'])
print(feature_matrix)
- Output:
customer_id | join_date | transactions.sum (amount) |
transactions.mean (amount) |
transactions.count(transaction_id) | transactions.year(transaction_date) | transactions.month (transaction_date) |
---|---|---|---|---|---|---|
1 | 2023-01-01 | 550 | 275 | 2 | 2023 | 1 |
2 | 2023-02-15 | 950 | 475 | 2 | 2023 | 2 |
3 | 2023-03-20 | 150 | 150 | 1 | 2023 | 3 |
Pros
- Efficiency: Rapidly generates a large number of features
- Consistency: Applies standardized transformations
- Scalability: Handles large datasets with ease
Cons
- Overfitting risk: Automated features might introduce noise or redundant information.
- Lack of domain insight: May miss domain-specific nuances that manual feature engineering can capture
- Computational overhead: Generating numerous features can be resource-intensive.
Conclusion
Feature engineering stands as a cornerstone. By transforming and selecting the right features, you empower your models to recogonise intricate patterns and deliver accurate predictions. Whether through manual ingenuity or leveraging automation tools, the essence remains: understanding your data and domain is paramount.
As machine learning continues to permeate diverse industries, the ability to craft meaningful features will distinguish adept practitioners from the rest. Embrace feature engineering not just as a task, but as an art that blends data science with domain expertise to sculpt models that truly resonate with real-world complexities.
Happy Feature Engineering!
Opinions expressed by DZone contributors are their own.
Comments