The Naive model is extremely simple - take the last observed value and use this as the prediction. For such a basic model it can prove to be quite powerful. For example, here in Perth, we can forecast whether it is going to rain or not by using a naive model. If it rained today, then we forecast that it will rain tomorrow, and vice versa. Using this model we end up being correct about 80% of the time.

MON | TUE | WED | THU | FRI | SAT | SUN | FORECAST (MON) |
---|---|---|---|---|---|---|---|

🌧️ | 🌧️ | ☀️ | ☀️ | ☀️ | 🌧️ | 🌧️ | 🌧️ |

First we will explore a naive forecast and then see if we can improve our predictive powers using a seasonal naive forecast.

We'll use the M5 dataset hosted by Kaggle to run through a naive forecast. The M5 dataset contains the sales figures for Walmart for 1919 days and the goal of the Kaggle competition is to accurately guess the following 28 days of sales. We'll first read in the sales figures using Pandas.

```
import numpy as np
import pandas as pd
sales = pd.read_csv('data/sales_train_validation.csv')
print(sales.shape)
sales.head()
```

The first six columns are identifiers for each product - a unique store-product id, item id, and then which department, category, store and state it belongs to. Following that are the unit sales recorded for 1913 days for the 30490 products.

A Naive model takes the last observed value and uses it as the forecast. So to start off with, we'll grab the last column of the sales dataframe.

```
naive = sales.iloc[:,-1]
```

Easy.

We've got the values we're going to use for the forecast, we need a metric to gauge how accurate we are. Kaggle uses the WRMSSE for the error measurement, which is a weighted and scaled RMSE. You can find out more about it in the WRMSSE tutorial.

For now, we'll go along with submitting it to Kaggle. But later we will use the m5-wrmsse package as it will be a lot quicker.

Let's have a peek at the sample submission file that Kaggle has supplied.

```
submission_file = pd.read_csv('data/sample_submission.csv')
submission_file
```

The competition requires that we forecast 28 days of sales. You may notice that the submission file has 60,980 rows - this is twice as many rows as the sales dataframe. It may be confusing if you're starting this after the competition has already has closed. The competition was conducted in 2 stages - in the first stage only the validation set was available. Later on, the evaluation set also became available.

The top 30,490 rows of the submission file are for the 28-day forecast using the **sales_train_validiation.csv** dataset. The validation set contains days 1 to 1913, so the forecast will be for days **1914 to 1941**.

The bottom 30,490 rows of the submission file are for the 28-day forecast using the **sales_train_evaluation.csv** dataset. The evaluation set contains days 1 to 1941, so the forecast will be for days **1942 to 1969**.

To get a score on the Public Leaderboard, you only need to fill in the top half of the submission file. The bottom half can be left as zeros for now. So let's make a copy of the submission_sample dataframe and fill in the top half with our prediction from the naive model.

```
submission_naive = submission_file.copy()
submission_naive.iloc[:30490,1:] = np.array([naive]*28).T
submission_naive.to_csv('submission_naive.csv', index=False)
```

After you submit this to Kaggle, you should get an WRMSSE score of 1.46378 on the Public Leaderboard.

We'll see if we can improve on this using a seasonal naive approach.

The naive forecast we created takes the very last day of sales to use as the forecast. We can try to improve on this by taking into account seasonal fluctuations in the data. Now, seasonal doesn't necessarily mean summer, autumn, winter and spring. It can mean a weekly pattern, or a daily pattern. Maybe a store is always busy on Saturday, but quiet on Monday.

Let's see if there's a trend throughout the week. We'll find the total unit sales for each day of our data and then combine it with information in calendar.csv so we know which weekday it corresponds to.

```
calendar = pd.read_csv('data/calendar.csv')
calendar.head()
```

```
total_sales = sales.filter(like='d_', axis=1).sum()
total_sales = pd.DataFrame(total_sales, columns=['total_sales'])
total_sales = total_sales.reset_index().rename(columns={'index': 'd'})
total_sales = calendar[['weekday','month','d']].merge(total_sales, how='left', on='d')
total_sales = total_sales.dropna()
total_sales.head()
```

Let's use Seaborn to do a boxplot, using the weekday as a categorical variable.

```
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='weekday', y='total_sales', data=total_sales)
plt.show()
```

We can see there is a definite trend - Saturday and Sunday have significantly higher unit sales on average, which then fall to a low during mid-week.

Let's also take a look at the unit sales by month.

```
sns.boxplot(x='month', y='total_sales', data=total_sales)
plt.show()
```

There isn't as much of a definitive trend here. Some industries will have big swings throughout the year, for example an air conditioning company will normally see an uptick during the summer and winter months.

We will push through with using a 7-day seasonal cycle. There are a few options, the first will involve taking the last 7 days of sales and then repeating it 4 times to obtain the 28-day forecast. We'll also try using the last month of sales as well as the sales from the same month in the previous year.

```
# Forecasting Horizon
h = 28
# Last Week
naive_lw = pd.concat([sales.iloc[:,-7:]]*4, axis=1, ignore_index=True)
# Last Month
naive_lm = sales.iloc[:,-h:]
# Last Year
naive_ly = sales.iloc[:,-13*h:-12*h]
```

We have created three different 28-day forecasts which we can now compare the error. You can either submit these using Kaggle, or use the m5-wrmsse package outlined in the WRMSSE tutorial.

```
from m5_wrmsse import wrmsse
wrmsse_lw = wrmsse(naive_lw.values)
wrmsse_lm = wrmsse(naive_lm.values)
wrmsse_ly = wrmsse(naive_ly.values)
print('WRMSSE (Last Week): %.3f\
\nWRMSSE (Last Month): %.3f\
\nWRMSSE (Last Year): %.3f' %
(wrmsse_lw, wrmsse_lm, wrmsse_ly))
```

Using the previous 4-week period as the forecast produced the lowest WRMSSE, followed closely by using just the previous week. The error increased dramatically when forecasting with the same month from the previous year - most likely due to product lines changing in that time. A store like Walmart would be continually changing the products it sells, so it will be difficult to make an accurate forecast based on last years' sales figures. However, all three seasonal naive models were an improvement over using just the final day as the forecast (WRMSSE=1.46).

The naive or seasonal naive forecast is often used as a baseline, it's a very simple technique and often produces quite a good result. More complex forecasting techniques can be compared to the baseline to see how much of an improvement can be made, but generally this comes with increased computation time. We'll dig deeper into the M5 dataset in future tutorials on the Adaptations of Croston's Method, ADIDA and Intermittent Demand and Multiple Temperal Aggregation.