This will guide you saving money when fuel your car

11 min readMar 31, 2021

Gasoline costs distribution in Brazil (Translation anticlockwise: costs 55%, production, logistics and resell, ethanol, federal taxes, state taxes, taxes 45 %)

As difficulty endures in these pandemic times, no doubt people all over the world have little choice in buying some goods, which economists could call an effect of the low price elasticity of demand. Some foods are included in these, but there are substitutes like (meat → chicken and chicken → eggs) and so on, but what about other goods like oil?

So predicting fuel prices in Brazil would be, for sure, a much valuable choice for any costumer and even helping lowering foods prices considering the whole logistics in Brazil is driven by diesel and gas.

1) Project definition

First of all, we have to collect historical prices. Thanks ANP (Agência Nacional de Petróleo) there’s a good repository. It was collected prices from January 2017 to December 2020 from gas stations all over the nation. So we’re dealing with prices varying in time and space.

The strategy is first finding other features but historical prices of gas. So what does really leverage gasoline prices?

Crude WTI prices: Gasoline is made from refined crude oil, so with no doubt if crude oil rises, gasoline will become more expensive too.

US dollar: Brazil is a large oil producer, so when domestic exchange (brazilian real R$) is in a low value faced to US dollar, our producers tend to sell it in international market getting more R$ per gallon and the more crude oil leaves the country, the more expensive crude oil becomes in domestic market and the more expensive gasoline becomes in domestic market. So we should also insert USD/BRL in one of our features.

Ethanol: In addition of brazilian “flex” cars with engine that works with gasoline or ethanol (even with a mix of both), our gasoline has a percentage of ethanol permitted by laws. So if ethanol price rises, gasoline become more expensive because our gasoline has about 27 % of ethanol. On the other hand, people tend to substitute ethanol to gasoline, also contributing to raise its prices.

Some other numbers of supply and demand are also included in this work and are no less important than the previous ones:

Money spent in buying anhydrous ethanol.

Money spent in buying hydrous ethanol.

Money received selling anhydrous ethanol.

Money received selling hydrous ethanol.

Money spent in buying gasoline from the international market.

Money received selling gasoline to the international market.

These are important ones because if all the ethanol produced in Brazil are sold in the international market, brazilian consumers would be forced to use gasoline, raising its prices. The same happens with import/export of gasoline but in a supply shock view instead of demand migration.

Solution structure

The solution is going to be shown as a prediction of the average price for the fuel from an specific flag in the city where the consumer lives, for example. So at that present moment if a better price is found, then the consumer must buy fuel, if not, the consumer could wait for a better price or only use the fuel needed at that moment, saving as much money as possible.

In this case the random forest regression algorithm was used because it provides high accuracy through cross validation and as the number of trees raises it also prevents overfitting. As all features aren’t falling over the training sets domain, there’s no need to extrapolate the random forest’s output.

Metrics used and why

Having all this in hands we should choose metrics to evaluate the model.

In this case MSE (mean squared error) is a good choice to train the predictive model because it punishes large errors and here large errors is totally undesirable given the high uncertainty.

But RMSE (root mean squared error) is chosen to report the results because it has a more comprehensive unit (R$ instead of R$²). It is expected to have answer like: “This model gives error of R$ 0.80 per liter” clearing if a consumer may use it to make his buying decision.

Now, having a plan, it’s time to explore the data.

2) Analysis

Of course these data have to be inspected to fit in the same dataframe in a concise way. All of dates are declared as strings and in different formats. Bellow are some examples of how the data are presented:

USD/BRL prices where date are presented in %b %d,%y format

For that, close prices (Price column) was considered to all stock exchange data, dropping all other columns except date.

Fuel prices in brazilian gas stations where date are presented in %d/%m/%y format

Which columns mean respectively region, state, city, name of the gas station, code of the company, fuel, date collected, selling price, buying price, unit, flag. Flag is a kind of brand of the gas station like in US they have Texaco, ESSO and so on.

Brazilian derivatives import and export in US$ FOB (Free on Board)

These numbers are got month by month so all of the days in a month is gonna face the same number. The columns are named like: year, product (we are interested in “GASOLINA A”), name of the trading movement (nothing important for us), unit, months from January to December (including the total in the end).

How do the data behave?

Now we’re interested in looking at the behavior of each feature. What is the trend? Is there any seasonality? Do they have negative or NaN values?

At first, we have a little problem with crude prices, there’s a negative value (- 37 USD) in April 20th, so this value was removed. The crude prices ranges from 10 dollars to 76 dollars so there are a large variability to work in gasoline price prediction. No seasonality can be plainly see in the graph but some trends can help our prediction like all the year of 2017, the fall in the end of 2018 and the sharp fall in the beginning of the pandemic last year.

Crude WTI price (there’s a sharp falling in April 2020 showing negative price)

No negative or NaN values in USD/BRL values, and it had a steep rise last year showing also a large variability. The US dollar raised from R$ 3.05 in January 2017 to R$ 5.89 in the middle of the pandemic in May/June 2020.

It’s possible to see the lack of some values (traded in contracts of 29000 gallons) in ethanol prices because it is not necessary traded everyday (maybe by government measure or brazilian holidays). The same doesn’t happen in Crude Oil and USD/BRL because those are traded in an international trading floor not subjected by these issues. It ranges from R$ 1325 to R$ 2235.

3) Methodology

Challenges faced

Considering the population concentration in some southeast states, it is common to have collection bias (not a lot of gas stations in some places and not a lot of workers in the inner Brazil to collect those prices). As we can see in the bar chart bellow, it wouldn’t be a good idea trying to predict prices in state with only few registers, so we’re going to follow with the top 3 states SP, RJ and MG which stands for São Paulo, Rio de Janeiro and Minas Gerais.

Number of registers by brazilian state (SP reaches almost 300 k followed by MG and RJ with 100 k and almost 100 k respectively)

Some “flags” are also dominant, as we have to filter the top four: BRANCA, PETROBRAS DISTRIBUIDORA S.A., RAIZEN and IPIRANGA.

Number of prices registered grouped by flag

There’s still a small problem: ANP didn’t collect some prices in the middle of pandemic (as a matter of safety or so). Average prices of gasoline are shown in the plot bellow.

Gasoline average prices in RJ, SP and MG

With this problem, there’s nothing to be done, and the target is going to be empty in this period of time losing some predicting power (of trend mainly).

Preprocessing

We have talked before about:

A lot of NaN values in WTI Crude price and USD/BRL. That’s because we don’t have trading floor working in weekends.

And there are much more NaN values in ethanol prices because including lack of trading floors in the International market, we should also consider brazilian holidays. (And it looks a lot of gas prices have been collected on holidays) what leads to NaN values when merging data as is show bellow in May 1st (labor day in Brazil).

Dataframe of data merged showing ethanol NaN price in a brazilian holiday at the first row

Therefore we should forward fill all NaN values, because in Saturday and Sunday, the price of crude and dollar can be considered the same as Friday. The same happens in brazilian holidays.

Now we have a complete dataframe, but the categorical variables have to be taken in place. These are flag (branca, petrobras, and so on…) and city (all the cities in the states filtered before).

Using the method pd.get_dummies(), we have:

Also, it is important to split the date in day, month, year and create other time variables like week, weekday and week in the year. This is specially important because weekends can be subjected to a small demand shock (travels and so), the same happens with holidays. In another view, small demand shocks can also happens in the beginning of the month (days/weeks where people use to have payday and companies execute their bills)

Splitted data and other time variables concatenated in the final dataframe

After that, there’s a complete dataframe following to the implementation step.

4) Implementation

Model and hyper parameters

To solve this problem, train and test data were splitted in 0.75 and 0.25 respectively.

Random forest regressor in scikit learn (default parameters) has been chose as follows, raising an RSME of R$ 0.13 and taking a bit more than 30 minutes to train and predict (Core i7 9th Gen with 16 Gb DDR4 RAM in Debian 11).

It sounds good result but there is still some issues to take care:

The max_depth parameter in default is chosen as “None”, so the regressor could stretch it to maximum and taking it to a very particular case leading to a suspect of overfit.
It’s better fit the model in a small grid search (to save processing time) looking for a better answer at least in the surroundings of the default parameters and including cross validation what is also very important to prevent overfit.

Improvements

So a grid search was performed changing two hyper parameters: the number of estimators in the regressor (n_estimators) for [60, 80, 120] and the maximum depth of each tree (max_depth) for [40, 50, 60], including a cross validation of 3 and number of parallel jobs of only 1 (to prevent some kind of memory error and loss of work processed) as shown:

The new RMSE were found (R$ 0.17) is a bit worse than the previous one but since cross validation was used and also respected a max_depth of the trees in the forest, to prevent over fitting, maybe the default result was starting to fit wrong tests validation so we keep the grid search result as the valid one.

Justification

The final hyper parameters are max_depth = 60, n_estimators = 120, all the others were kept in default. Max depth in 60 means the model is stretching the branches but not permitting it getting too long to fall in a very particular case. With a high number of estimators we guarantee that the final tree is an average of many trees and it traditionally gives more robustness to the model and didn’t seem to harm too much the processing time.

So, as was planned, predicting a price at R$ 5.73 in a week from now would mean the consumer would find values from R$ 5.56 to R$ 5.90 (seeing RMSE) and compare that to today’s prices founds.

All these results sound fully satisfactory at this moment considering the prices fluctuation and the consumer still has no clue of what is going to face in the near future.

5) Conclusion

End-to-end summary

In a nutshell...

It was downloaded bunch of .csv from ANP open data website in different patterns (some separated by comas, some separated by /t) where gas prices (and other derivatives) are found and also derivatives import export data.

Crude WTI Oil and USD/BRL data were download by Investing.com website.

Ethanol prices were downloaded by Investing.com website but considering prices in B3 (brazilian stock exchange).

All of these data passed by preprocessing to clean wrong values and also to fill NaN values with adequate values.

Possible cases of bias have been taken in place to filter data to fit in model and after that, categorical variables were transformed to dummy variables. By the end date were spplited in new variables for day, week, month, weekday and year.

Random Forest Regressor was used to fit the first version of this work in default parameters and results were tested in hyper parameters surroundings taking in hand cross validation and respecting a limit of max depth of each tree branch. This was, by far, the most difficult decision to be taken because it’s a bit hard to forsake an impressive result to take a “worse” one believing that is more concise.

The grid search results were then kept in a way to prevent over fitting considering it was also a good result.

To be done

Now, this MVP is ready to be tested, but that could be done?

The model is fed with city, product (gasoline), flag, year, month, day, weekday, week_in_year and also crude price and dollar price (in reais). The result is a mean of the price prediction in that city, that flag, and that day. Anything over this can be considered more expensive and anything under can be considered a cheaper choice. In next versions it’s planned to make a confidence interval to help consumers in this kind of classification.

Other variables has to be considered like neighborhood. This problem can be solved only sending e-mails to ANP and asking for a complete dataset, which is being done at this time.

And finally, in the next versions it’s planned to try other models including recurrent neural networks to compare with random forest regressor.

The full work notebooks and data are in my git hub repository here