Automated Feature Engineering

Feature engineering is one of the most important tools used to increase the performance of machine learning algorithms. In practice, it is one of the most effective ways of creating new input features to obtain the best possible results and to build better predictive models faster.

Typically, it requires the expertise of professionals with the necessary domain-specific-feature knowledge to assist with the feature engineering process. For certain domains, automated feature engineering works particularly well. One of these domains referred to as “time series data” has strong seasonal effects. The most prominent example of course being sales data.

The framework selected for generating the best results for automated feature engineering is Prophet. Prophet is an open source software released by Facebook's Core Data Science team. It is a forecasting procedure used for time series data, based on an additive regression model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

To demonstrate the power of automated feature engineering, we took a deeper dive into the Prophet framework and applied it to one of the data science challenges using practical code examples in R. Our focus: how to extract useful features with prophet. Running the code is quite computationally expensive (considerable amount of different resources required, e.g. processing power, memory, time), but results clearly show it is well worth the effort.

A Brief Introduction to Prophet

We have already explained that Prophet is an open source software released by Facebook's Core Data Science team and procedure is an additive regression model working best with time-series seasonality data, including several seasons of historical data. In short and more accurately, a seasonality model generating and delivering forecasts that are more precise.

Prophet decomposes time series data into three main model components: trend, seasonality, and holidays. They are combined in the following formula:


Where g(t) represents the trend function which models non-periodic changes in the value of the time series, s(t) is the periodic changing part, such as: weekly and yearly seasonality, and h(t) represents the effects of holidays which occur potentially at irregular intervals over one or more days. The error term ϵt represents any idiosyncratic changes, not supported by the model.

Exploratory Data Analysis

Let's start our exploratory journey. To demonstrate, let us take a closer look at two different stores. Store A is only open on weekdays, whereas store B is also open on weekends and state holidays.

Using the sales history of Store A, we identified a significant spike around Christmas and a trough before the New Year. In order to determine more trends, we added a moving average. Now, we can observe the seasonality more clearly. There is a small peak during spring. In summer sales drop, but start to increase at the beginning of autumn, this lasts until the New Year.

We have already established the annual sales seasonality, what other seasonality might it have? Weekly perhaps? To check, we first need to calculate the average sales for each weekday. It is evident there is a peak in sales on Monday and a slump on Thursday. Since the store is not open on Sunday, it is reasonable to expect more sales on Saturday and Monday. This in turn, confirms the store has a weekly seasonality.

What about holiday effects? Does state and school holidays also have an impact on the sales? Similar to weekly seasonality, we can calculate the average sales for holidays and non-holiday periods. As Store A is not open on state holidays, we will use Store B for further evaluation. Without a doubt, state holidays have a huge impact on sales. At Easter sales increased almost by 50%. However, the impact of school holidays remains relatively small compared to state holidays.

Similarly, we can treat sales promotion days as a type of holiday too.

For detailed analysis, including charts to illustrate trends and results, please refer to our Kaggle article.

Feature Engineering with Prophet

Since identifying, that seasonalities exist and holidays affect the sales data, we are going to use prophet to decompose it.

What can we expect from the decomposition? Let us go back and use Store B as an example. We decomposed the time series of Store B into 4 components. The decomposition of the first component is the trend, which represents how sales has increased and how it is expected to continue on growing. These figures may be beneficial for sales predictions and business analysis. For example, it is not easy to compare the performance of a company in August and December, because August is usually the slack season, whereas December is the opposite, the peak season. Nevertheless, by comparing the trend we can easily achieve our goal, seemingly Store B has grown and tends to be saturated.

The second component handles the holiday and promotion effects. Here, they include normal holiday periods such as Easter and Christmas and school holidays and promotional events for all stores. This component allows us to predict large downturns for holidays and promotional events to the sales.

Lastly, we have weekly and annual seasonalities. They are periodic, repetitive, and generally encompass regular and predictable patterns in the levels of business activity. If you compare the weekly and annual seasonalities with the average sales for weekdays and months, you will find they are consistent.

Result Comparison

Now, we can use our generated features to perform the prediction and check if there is any improvement. The benchmark we used is a simple model with a few Feature Engineering components. In our regression example, the log-transformation is used for sales. Year, month and week are applied and the Store ID used as a factor for the model, in addition to average client per month and average sales for each weekday.

Using this basic model, you could generate a score of 0.11668 for the Public Leaderboard and 0.12561 for the Private Leaderboard, which is in the top 40%. The basic model clearly demonstrates how powerful the Prophet features are. By applying them to the basic model, the prophet features will generate 6.57% and 7.29% on the Private score and Public score respectively, which could raise our ranking substantially and push us into the top 10%.

For detailed analysis, including charts to illustrate trends and results, please refer to our Kaggle article.

Interested in our portfolio? Project Inquiry? Questions related to an upcoming project?
We will find the best course of action and suitable solution to transform your business.