The aim of this research thesis is to use machine learning models to distinguish owners of heat pumps from non-owners of heat pumps based on load profiles and temperature data only. As is the case with data mining project, its workflow can be divided into business understanding,
...
The aim of this research thesis is to use machine learning models to distinguish owners of heat pumps from non-owners of heat pumps based on load profiles and temperature data only. As is the case with data mining project, its workflow can be divided into business understanding, data gathering, analysis, modeling and interpretation and deployment. As of the time of the time of this writing the models have not yet been deployed. The necessity to conduct this master thesis arises from the growing popularity of heat pumps in the Netherlands, and the potential issues associated with this spread on management of low-voltage distribution grids, in particular the rising electricity demand in the heating season. Before such issues can be analyzed however, the number of all heat pump users needs to be determined. This master thesis aims in determining precisely the overall number of heat pumps users by examining individual load profiles.
Data available for the purpose of thesis consists of load profiles of owners and non-owners of heat pumps provided by Alliander, load profiles of London-based non-owners of heat pumps, referred to as baseload profiles, load profiles of heat pumps only spread across the UK, temperature records for De Bilt in the Netherlands, London and Nottingham. The above-mentioned data was cleaned and manipulated before features were extracted from it. In particular, synthetic load profiles of heat pump owners were created by pairing baseload profiles with the pump only load profiles. Next all load profiles were normalized in order to diminish the importance of confounding variables, more on that later, and only night-hours were kept so as not to account for PV production. Such normalized load profiles were paired up in two sets: Alliander's set which consisted of load profiles of heat pump owners and non-owners provided by Alliander and simulated set which consisted of baseload profiles and synthetic load profiles of heat pump owners. It is worth mentioning that a confounding variable was present in Alliander's set, mainly the size of house since owners of heat pumps all lived in single standing houses as compared to non-owners, majority of whom lived in apartments.
Subsequently, the following four features were extracted from normalized night-time load profiles within each two sets: (1) average daily electricity consumption in January and December (this period is also referred to as winter or heating period), (2) ratio of average daily electricity consumption in January and December to average daily electricity consumption in July and August (it is also referred to as summer or cooling period), (3) slope of the curve representing mean daily temperature on x-axis and daily electricity consumption in y-axis, and lastly (4) coefficient of determination of curve representing mean daily temperature on x-axis and daily electricity consumption in y-axis.
Three main evaluation criteria were set for the performance of machine learning models: True Negative Rate, True Positive Rate and Precision. For simplicity, the mean score was used as well, which is equal to the average of True Positive Rate, True Negative Rate and Precision. Benchmark for all evaluation metrics was set to 90\%. Five models that were used to distinguish heat pump owners from non-owners were Logistic regression, Decision Tree and Support Vector Machines with Linear, Polynomial and Radial kernels. The evaluation procedure was the following: first hyperparameters for all the five models were tuned by using 10-fold cross validation with test and training set being features extracted from Alliander's set only. Next, the models with optimal hyperparameters were trained on features extracted from Alliander's set and tested on features from simulated set.
The results show that none of the models managed to reach the benchmark of triple 90\% for True Positive Rate, True Negative Rate and Precision. In the hyperparameter tuning stage both True Negative Rate and True Negative Rate were close to reaching 90\%, however, this has been achieved at the cost of low Precision, reaching just above 50\%. This was the case due to the propensity of the models to commit type I error, that is false positives. On the other hand, at the evaluation stage when the simulated set-features served as test set, it was noticed that precision was at a significantly higher level, approximately 75\%, which came at a cost of lower True Positive Rate, around 50\%. True Negative Rate though did exceed 90\%. These results show a strong tendency of making type II error, that is false negatives. The best performing model, which was the Support Vector Machine with Radial Kernel, achieved a mean score of 75\%.
The divergence of results from the hyperparameter tuning stage to the evaluation stage is caused by the fact that there are different usage patterns of heat pumps between the owners of heat pumps in Alliander's load profiles and in the synthetic load profiles of heat pump owners. Particularly it is the case that owners of heat pumps in Alliander's set, do use heat pump in the night, as compared to synthetic users, which do so to a much smaller extent. As a result the features extracted from the simulated set of load profiles are less indicative of heat pump ownership than the features extracted from Alliander's load profiles.
This master thesis could be improved by trying out more machine learning models, improving the process of normalization of load profiles and acquiring better heat pump only load profiles which are more similar to Alliander's set in terms of usage patterns among many others. Further work can be built upon the results of this thesis. Once heat pump owners have been identified based on load profiles, similar work can be done for identification based on voltage profiles. The advantage of using voltage profiles rather than load profiles is the fact that voltage profiles are not as privacy sensitive. Furthermore, providing more insight into the kind of heat pumps the users are utilizing (air-to-air or water source or geothermal) might provide further insight. Last but not least, the models developed in this master thesis could be deployed in Alliander and used to investigate heat pump ownership among entire dataset of 120k load profiles at the disposal of Alliander.