Zero by 2050 OR Limits of a Forest

10 min readFeb 7, 2022

Phew. This wasn’t an easy one.

As you may have guessed from my previous post, I am still in the Udacity learning programme. Actually I am at the end of it. My very last project is free style (which is great for anyone who has a hobby). It means that I can pick any topic, explore it, draw some conclusions and present them to you wrapped into an exciting story.

So I started browsing the datasets and came across one about healthy pregnancies. Since about five of my friends had newborns in the span of the last two months (corona boom), the topic immediately jumped to my attention and I thought I’d do that. Then I couldn’t find it again next day, but instead found a large dataset with chocolate reviews. “Chocolate!” - I thought, “I’ll write about chocolate”.

And just when I was on the borderline between chocolate and pregnancies, I stumbled over a dataset about energy balance in the EU. And this was it. I couldn’t go past. You’d say it’s weird for someone who was almost elbow deep in chocolate flavours by that point, and I can’t agree more, but there is the truth, my article is about energy balance.

Project Definition

Problem Introduction

To set some context:

I live in Germany and I read the news sometimes. You may have heard that we have a new chancellor and he has a climate agenda. His New Year speech (where not about pandemic and vaccination) was about climate change and the plan to reach net zero emissions by 2050. To me it sounded vaguely unrealistic back then, so in my data obsessed mind I always wanted to prove him wrong. Sounds like an easy task if you are a beginner data scientist, you have energy consumption data by source from 1990 up to 2019 on your hands (and you are actually required to produce some meaningful analysis to graduate). Right? That’s what I thought too.

It wasn’t so easy (spoiler alert).

Problem statement

I decided to explore and prepare the dataset, identify the sources of energy and check their usage long-term across the last 30 years. Then my plan was to find a model that can predict usage for “green” sources vs. “non-green” sources — generally and location based — for the next 30 years, and to verify whether the EU goal of going green by 2050 is actually achievable given the current trend.

These are the questions I aimed to answer:

Can we group the energy sources by their climate impact?
Can we observe a trend on the historic data in each of the identified groups?
Can we predict trend of energy consumption in each of the groups until 2050?
How does the prediction correspond to the goal of climate neutrality in 2050?

Planned Solution

I explored the data, made it clean and ready for analysis, while answering questions 1 and 2. Questions 3 and 4 require some modeling, so I used two different models (Random Forest and Linear Regression with Polynomial Features) to understand if we can predict using the dataset and if the goal of 2050 is achievable.

Metrics

My definition of success evolved while I was working on my project.

Clear trend lines showing where the green sources may (or may not) prevail over other sources. Identifiable year when it happens.
Identifiable year when “other” sources reach zero, if they do. If they don’t, year of lowest usage.
Measurement of model efficiency. When I started working on Random Forest, I habitually chose R2 score, because it is a good regression accuracy metric. However, when I turned to train and tests sets on Linear Regression with Polynomial Features, R2 score started showing negative on test values, because the prediction and the real data points were not overlapping as much (see Fig. 10) — due to the inability to predict the next data point exactly, i.e absence of impactful variables. The generic trend line remained accurate at the same time. I decided to pick another metric which would allow me to compare train and test scores better, i.e MAPE (mean absolute percentage error). This means that I started focusing on the relative error and loss instead of accurate fit to test values as such, which wasn’t possible. I aimed to keep MAPE as low as possible, in most cases below 0.1.

Analysis

Data Exploration & Visualisation

My main dataset (135.19 Mb) from Kaggle contained the following (Fig. 1):

That is: energy usage, energy source, unit of energy, location, year, consumed energy value.

In addition my data pack contained two dictionaries, one for nrg_bal values, one for siec values.

I checked the value distribution across 1) years and 2) geographies. Overall it was pretty homogenous.

Data Preprocessing

When I looked at the value counts for the next column, unit, I found that they were suspiciously equal and contained three types of energy measurement:

That’s terajoules, gigawatt hours and kilotonnes of oil equivalent.

This means that in effect I had three identical datasets merged into one. I checked it on data distribution for each unit.

Fig. 4 Illustration of geographical data distribution split per “unit” value

I chose gigawatt hours and converted them to terawatt hours, so that I could actually read those numbers. Therefore my dataset became three times shorter.

I also checked duplicates (there were none) and missing values in my data:

nrg_bal         0.000000
siec            0.000000
unit            0.000000
geo             0.000000
TIME_PERIOD     0.000000
OBS_VALUE      33.386142

These 33.4 against energy value is percentage of missing values. I got rid of these rows because there is no point training an algorithm on non-existing data.

As my next step I checked which energy sources I have in my data. It was a fine collection of 11 unique items:

I broadly categorised them into three groups:

Definitely “bad”, climate-damaging (0, 2, 7, 8, 9, 10).
Acceptable (1, 3, 4, 6). Gas and nuclear heat are questionable sources, but they are not classified as climate-damaging (surprise!)
Climate-friendly (5)

This is how the historic trend across all geographies looks like:

The generalised picture gave me the trend I was hoping for. Climate-friendly energy goes up, acceptable energy plateaues, climate-damaging energy goes down. It all seems to be early days though.

Modeling Implementations

Now, to modeling. I did make a bold attempt at creating a Random Forest machine learning algorithm to predict future trends. I failed, because Random Forests treat the data very seriously and can’t very well generalise, especially on sequential data (like our timeline, where the order matters) without any other variable inputs but years in a sequence.

Although MAPE values were good (Fig. 6), the model was overfitting, i.e. we still saw some fluctuations of the data points in the beginning, but then they decreased to no change (Fig. 7).

Fig. 7 shows the result after I created a “lag” column as part of the X variables with data about values from the previous year.

“Lag” turned out to be one of the most impactful features (Fig. 8), however, still did not allow capturing the trend.

So I turned to linear regression with polynomial features. And bingo!

Fig. 9 Testing the approach on Linear Regression with Polynomial Features

This is a trend line set through historic data (Fig. 10). The paler color shows test data points for each category.

Quite believable, isn’t it?

Refinement / Hyperparameter tuning

Right, the trend seems to be picked up more or less accurately. It’s clear that we won’t reach zero by 2050 on the non-green energy, but there is a chance to substantially decline. Green energy has a clear upward trend, perhaps too steep with degree 2. So I extended my timeline up to 2050 and this is what I got:

Looks rather positive, doesn’t it? Another, more detailed chart shows how it compares to input dataset (Fig. 12).

Fig. 12 All sources usage at degree 2 Polynomial Features

Probably climate-friendly won’t go up that fast. If we put degree at 1 (Fig. 14), it starts to underfit on test, which gives us a more pessimistic prediction. So the truth is somewhere between 1 and 2. Other sources are fine. More detailed prediction is hardly sustainable in the absence of impactful variables.

Fig. 13 Function to predict usage by source category

Fig. 14 Climate-friendly sources usage prediction at degree 1 Polynomial Features

Depending on the observed data distribution the degree and the train/test size may vary, so I included them as input variables to the function (Fig. 13). It was a wise move, because these parameters need to be adjusted for each geography, this is why in my final function degree was one of input variables.

I added the geographical dimensions back in, added the possibility to input degree as variable, and also enriched the analysis with the point in time where climate-friendly sources come closest (or overcome) other sources. And also when other sources reach lowest point (or zero).

Results

In Germany we are working towards a brighter future, no doubt, however, it’s not yet to come with the current trend that goes beyond 2050.

In the UK they do it even faster and may actually reach the goal before 2050. The trend is not quite captured for acceptable sources, it will probably plateau by the looks of it, so the truth will be between degree 2 and degree 1.

Fig. 18 UK, degree 2 on acceptable sources

Fig. 19 UK, degree 1 on acceptable sources

Many countries seem to be able to achieve the set goal. There are countries like Albania though (Fig. 21), for example, which have a comparatively low consumption currently and an upward trend on all groups of sources (climate-damaging being most prominent). Others, like Finland (Fig. 20), evidently invest into green energy and are closer to the goal than many others.

Conclusion

Reflection

Answering by questions form above:

Can we group the energy sources by their climate impact?

Yes, we certainly can, as I did during data preprocessing: climate-friendly, acceptable, climate-damaging.

Can we observe a trend on the historic data in each of the identified groups?

Yes, climate-damaging has a clear downward trend, acceptable plateaues with a slight decline, climate-friendly goes up.

Can we predict trend of energy consumption in each of the groups until 2050?

Yes, as a generalised trend line, but probably not as a detailed forecast due to many unknown variables impacting the numbers.

How does the prediction correspond to the goal of climate neutrality in 2050?

According to our generalised prediction it is realistic for most EU countries.

Obviously, we have some limitations here. As I mentioned, I don’t have access to the variables that determine the production of energy in various sources, e.g. a colder winter can have a significant upward impact, however, I will not be able to identify the reason. Also, the population grows, the consumption overall changes, so the observed trend in the overall energy source usage does not tell us anything about consumption per household, it may go up or down. Finally, I don’t know what may or may not impact the trend in the future. Assuming everything stays in the same range, the forecast is quite optimistic.

Although I am limited by the approach and the available data, my assumption was that even a simplified trend line will get us nowhere by 2050. The exploration gave me some hope. And why else do we need data science if not for this hope? :)

Improvements

Possible ways to improve this / go deeper into the problem would be:

Find a better dataset with more variables that impact the consumption
Group the sources differently
Use a less generalised model and additional datasets (e.g. weather, population, etc. in the same years). Custom ensemble model might be an option
Retrain the model in each new year