For many organizations, data is an overwhelming wave of information. It’s a chaotic mess that’s impossible to realize any benefit from. With feature engineering, organizations can make sense of their data and turn it into something beneficial.
The term feature engineering refers to the process of applying domain knowledge to data by generating features that transform the data to make it easier to understand and interpret. It usually occurs after the data gathering and cleaning process and before training machine learning models.
Feature engineering is often part of the ML problem solving workflow:
- Gather data
- Clean it
- Perform feature engineering
- Define the model
- Train the model
- Run tests
- Predict the output
Most of the information used by Artificial Intelligence (AI) is contained in tables. Each row is an observation, and a column is a feature. Unfortunately, the data is often complicated, irrelevant, missing, or duplicated.
Feature engineering provides a process for transforming data into a format that better represents the underlying problem. To do this, it makes the data more digestible, putting data into categories to better reflect a finite set of outcomes or systematically replacing missing values with appropriate estimations.
This process of transforming data with feature engineering is often as much of an art as a science. For example, a business may want to predict instances of fraud. Raw timestamped transactions could be entered into AI software, but the output may not be meaningful or actionable. However, a bit of domain expertise helps the data scientist. The scientist, using their knowledge of retail, creates a new feature that differentiates between the work week and weekends, as there are always spikes in retail activity during the weekend. Once that context is manually established, models are better able to spot anomalies, with fewer false positives. That’s the ‘art’ of feature engineering.
Done correctly, feature engineering amplifies the predictive power of Machine Learning (ML) algorithms. It achieves this by fashioning features out of raw data that feeds and facilitates the ML process. It can be the differentiator between a good data model and a bad one.
Breaking it down further, the feature engineering part comprises the following steps:
- Brainstorm new, possible features for the model
- Create the features
- Test how efficiently these features work with the model
- Tweak the features, repeat, or go back to the drawing board as needed
- Get the features to work seamlessly with the model
Feature engineering should not be considered a one-time step. It can be used throughout the data science process to either clean data or enhance existing results. Feature engineering is an iterative process that is interwoven between data selection, model evaluation, and re-evaluation. The process continues until the data is in a format that is ingestible by ML models and enables those models to output actionable results.
Examples Of Feature Engineering For Machine Learning
ML algorithms learn solutions to specific problems using the sample data they are presented with. Feature engineering helps an organization arrange the best representation of their sample data to give the model a chance to learn the solution to any specific problem.
In feature engineering, representation and relationships matter, and there are four common engineering strategies:
- Resampling imbalanced data
- Creating new features
- Managing missing values
- Detecting outliers
Resampling Imbalanced Data
In its raw form, data is usually imbalanced. Most of the time this can easily be resolved with validation techniques. But sometimes the imbalance can be large, affecting the outputs. Feature engineering can resolve this by artificially generating samples in the minority groups. These samples can be used to help address variability or uncertainty in the data.
Creation of New Features
Creating new features can just be restating data in a different format to match the context of the question. For example, a company may have the departure and arrival times for trains and turn them into total travel time. Combining the timestamps into one new feature enables the algorithm to fit the business need and produce more actionable results
Users can also combine two moderately useful features or two features that by themselves are not useful on their own to create one feature that helps the machine learn better. An example of this is in healthcare where a variety of risk factors are present but, on their own, don’t indicate a likelihood of a medical event. For example, age, hypertension, and being a smoker individually don’t predict having a stroke, but the three factors together do.
Feature selection is simply about picking the right independent features that correlate the most with the dependent feature. All these things combine to make the best possible predictive model. Heatmaps, univariate selection, and the ExtraTreesClassifier method are all tried-and-tested methods for identifying the features that are related appropriately.
Feature engineering also helps pick which buckets to create so that the machine can accurately map relevant data to the right bucket. This includes removing and weeding out unwanted features and noise that helps the model to function more smoothly.
Managing Missing Values
Missing values are a frequent problem in data, but there are many methods to adequately resolving them during the data cleansing process.
There are also several advanced engineering techniques that can use existing data to accurately recreate missing values and complete the dataset, ensuring the data is in a form that models can better utilize.
One method is data deletion. With this method, Feature engineers can remove samples that have missing values. This works best when only a few samples are incomplete. The more missing values a dataset contains, the more problematic this method becomes.
Another technique involves replacing missing data with a variable of the mean or median. While this approach resolves missing data, it can skew the results. If data has a gaussian distribution, then the missing results could be imputed (a model within a model) so that they match the normal distribution.
These are the two main methods. While there are other methods that can be used to manage missing values, the general approach is to remove data or input estimated values.
Outlier detection is another process that crosses the cleansing/engineering barrier. In the data cleansing step, AI may simply remove the outliers, suggesting they are errors, or a sample that’s not relevant to the data. However, that’s a blunt tool and could miss essential information.
In data science, key factors that influence a model’s performance are data handling and data processing. A model without proper data handling results in an accuracy of about 70%. When feature engineering is applied to the same model, the performance can greatly improve.
But a good understanding of the data is still needed for feature engineering as it allows a data scientist to specify thresholds where the data is still logical. For example, a business may have a customer who is 100 years old but definitely not 1,000 years old. A machine may disregard both data points while a data scientist knows the extra zero is likely an input error.
This part of the feature engineering process can be long, frustrating, and rely on the skill and domain knowledge of a data scientist. This is why some view feature engineering in ML as nothing less than an art form.
Advantages Of Feature Engineering
As the adage goes, AI and ML models are only as good as the data they receive. Including feature engineering in the modeling process can ensure the quality and relevance models receive help them solve real-world problems. But there are two important things to keep in mind as you proceed:
- Framing The Problem Correctly: Using the right objective measures to estimate the accuracy of the output
- Inter-Dependencies Within The Model: The inherent, underlying structures in the organization’s data. Good structure always provides far better results.
Once these things are considered when selecting or designing features, the advantages of feature engineering include:
- More flexibility and less complexity in models
- Faster processing
- Clear, easy-to-understand models
- Simpler models that are easier to maintain
- A better understanding of the underlying problem
- Better representation of all the available data that is helpful in characterizing the underlying problem
Challenges of Feature Engineering
Data is often unstructured and messy, containing outliers, redundancy, and missing values. Because data comes from multiple sources, making redundancy and duplicate data are a given. Since data is the starting point for ML, this results in the following challenges for feature engineering:
- Enormous amounts of data from multiple sources that must be cleansed, aggregated, and analyzed
- Data must be organized into a recognizable structure that models and tools can work with
- Business context and processes must be understood to discern patterns and facilitate analysis
- Insights given must be relevant and actionable for the organization
- Data should be presented in a way that’s easy for people to understand, such as dashboards or graphs
- Timeliness can be a problem, with results taking so long that the results are no longer applicable
- Processes are labor intensive and often must be completed by a data scientist
The Future Of Feature Engineering
Modern technologies are improving the performance of feature engineering. Deep learning as a subset of ML is starting to reshape the process. Autoencoders and restricted Boltzmann machines are showing promise, automatically learning abstract feature representation.
The more that computers ‘think’ like humans, the more helpful their feature engineering becomes. Taking heavily manual tasks from data scientists and allocating them to machines removes cost and time constraints. This means that data forms such as images, videos, objects, and speech, which are not easily understood by traditional AI that relies on tables, may be accurately interpreted by machines soon.
New ML models are increasingly offering human-like thought processes, better feature analysis, and higher model accuracy.
But for now, the field is still reliant on data scientists. The best interpretations of data not only require knowledge of data science, but also industry or domain knowledge, making this subset of AI a specialized field. Data interpretation is vital to organizations wanting accurate predictions, and this is the best way to get valid results.
Does Your Organization Need More Accurate Predictions?
Alteryx’s machine learning package offers Deep Feature Synthesis. This helps to create more accurate models by understanding relationships within your data and detecting high quality features.
These algorithms give a step up for organizations needing accurate models and predictions, allowing for better explanations, decision making, and future plans.