 Regular article
 Open Access
 Published:
Energy consumption prediction using people dynamics derived from cellular network data
EPJ Data Sciencevolume 5, Article number: 13 (2016)
Abstract
Energy efficiency is a key challenge for building sustainable societies. Due to growing populations, increasing incomes and the industrialization of developing countries, the world primary energy consumption is expected to increase annually by 1.6%. This scenario raises issues related to the increasing scarcity of natural resources, the accelerating pollution of the environment, and the looming threat of global climate change.
In this paper we introduce a new and original approach to predict next week energy consumption based on human dynamics analysis derived out of the anonymized and aggregated telecom data, which is processed from GSM network call data records (CDRs). We introduce an original problem statement, analyze regularities of the source data, provide insight on the original feature extraction method and discuss peculiarities of the regression models applicable for this big data problem.
The proposed solution could act on energy producers/distributors as an essential aid to smart meters data for making better decisions in reducing total primary energy consumption by limiting energy production when the demand is not predicted, reducing energy distribution costs by efficient buyside planning in time and providing insights for peak load planning in geographic space.
Introduction
Energy efficiency is a key challenge for building sustainable societies. Due to growing populations, increasing incomes and the industrialization of developing countries, the world primary energy consumption is expected to increase annually by 1.6%. This scenario raises issues related to the increasing scarcity of natural resources, the accelerating pollution of the environment, and the looming threat of global climate change.
In order to improve the efficiency of the supply systems and thus to reduce the amount of energy consumption, a critical step is to understand energy needs at relatively high spatial and temporal resolution. An accurate prediction of energy demands could provide useful information to make decisions on energy generation and purchase. Furthermore, an accurate prediction would have a significant impact on preventing overloading and allowing an efficient energy storage. Hence, several computational works have started developing machine learning models to predict the energy consumption of residential and commercial buildings using features such as weather and energy bills [1]. For example, Kolter and Ferreira [2] used monthly electricity and gas bills and buildings’ characteristics to model energy consumption. Other studies investigated the relationship between the human occupancy of buildings and the consumption patterns, using WiFi connections as a proxy for human occupancy [3].
Nowadays, the almost universal adoption of mobile phones is generating an enormous amount of data about human behaviors with a breadth and depth that was previously inconceivable [4]. In 2013, there was 6.8 billion of mobile phone subscribers worldwide, with millions of new subscribers every day,^{Footnote 1} and several studies have shown that the mobile phone data, specifically the Call Detail Records (CDRs) needed by the mobile phone operators for billing purposes, can be exploited to model individuals’ mobility patterns [5–7] and to map the distribution of the population in space and time [8]. Not surprisingly, a couple of works have proposed to use mobile phone data for the design and the planning of energy systems and infrastructures [9, 10]. However, with the exception of a very recent work by [11] using ‘Data for Development’ (D4D) data from Senegal [12], no quantitative studies have investigated the potential of mobile phone data to understand energy consumption.
In the current paper, we propose and evaluate the usage of anonymized and aggregated people dynamics features, derived from the mobile phone network activity, to predict energy consumption. Specifically, we target two different tasks of paramount importance to increase the efficiency of energy producers and distributors and to meet consumers’ peak demands: (i) predicting the daily average energy consumption and (ii) predicting the peak daily energy consumption. It is worth to notice that none of the anonymized and aggregated people dynamics features can be traced back to make inferences about individuals and hence there are minimal  if any  privacy concerns.
To validate our approach we use mobile phone records from a territory in the Northern Italy, the province of Trentino. The data, released for the Telecom Italia Big Data Challenge 2014, were collected from November 1, 2013 to December 31, 2013 [13].
Our results prove that people dynamics, extracted from aggregated and anonymized mobile phone data, are good proxies for modeling energy consumption.
Datasets description
In this section we introduce the datasets that have been used to evaluate our approach: (i) an energy consumption dataset and (ii) a mobile phone records dataset. The datasets were collected from November 1, 2013 to December 31, 2013 over a territory of 6,000 square kilometers in the Northern Italy, the province of Trento. The datasets contain 50 thousand records for energy consumption and 600 million data records concerning telecommunication events respectively [13]. The two datasets have also the same spatiotemporal aggregation. The temporal aggregation is of ten minute intervals, while the spatial one results by partitioning the territory using a regular square grid. Each square of the grid measures approximately 1 square kilometer. In our paper, we refer to this grid as the partitioning grid.
Energy consumption dataset
The energy consumption dataset is provided by the local energy company, SET, that manages almost the entire electrical network over the Trentino territory. SET uses around 180 primary (medium voltage) distribution lines to bring energy from the national grid (high voltage) to Trentino’s consumers. To ensure the privacy of SET’s customers, their locations and the geometry of the 180 primary distribution lines is not explicitly exposed.
Consequently, the Customer site dataset shows the number of customer sites of each power line per grid square, while the Line measurement dataset indicates the amount of flowing energy through the lines at time t. Customer sites provide energy to different types of customers (e.g. houses, condominiums, business activities, industries etc.), which require different amount of electricity. For privacy reasons this information is hidden, meaning that in the dataset the energy flowing is uniformly distributed among the various types of customers.
In Figure 1 we show the process done by the organizers of the Telecom Italia Big Data Challenge 2014 to transform the original dataset to the one we had access. In the first layer there is the exact position of each customer site (e.g. some of them are industries, others are small houses) and the precise geometry of each line. In the second layer we lose the exact geometries of customer sites and power lines. However, this information is summarized in the Customer site dataset where for each square grid the number of customer sites is recorded along with the information about the power line they are connected to. In the third layer we know how the customer sites of a power line are distributed over the grid and the energy flowing through each powerline (from the Line measurement dataset). It is then possible to distribute the energy flowing through a powerline p over the grid in order to build a choropleth map of the energy consumption in each partitioning grid square (last layer in Figure 1).
In sum, the structure of the Customer site dataset is the following:

Square id: identification string of a given square of the partitioning grid;

Line id: identification string of the distribution power line, which is grouped with the partitioning grid square;

Number of customer sites: number of customer sites present in a given square of the partitioning grid, connected to the grid powerline (Line id).
Instead, the Line measurement dataset is composed by:

Line id: identification string of the distribution power line;

Timestamp: timestamp relative to the instant when the measurement of the current passing through the power line is done. Date in the format YYYYMMDD HH24:MI;

Value: the ampere value of the current passing through a given powerline (Line id) at a given Timestamp. This quantity is positive if the direction of the current goes from the national grid into the local line, negative otherwise.
Call Detail Records
The Call Details Records dataset contains anonymized and aggregated incoming and outgoing calls, received and sent SMSs, and Internet connection events, generated from November 1, 2013 to December 31, 2013 by the cellular network of Telecom Italia Mobile, the largest mobile operator in Italy with 34% of the entire market share.
The dataset is composed by three subdatasets: (i) the Telecommunications Activity dataset providing the activity of Trentino, showing all the mentioned telecommunication events which took place within this area. The data provides information of Telecom Italia’s customers interacting with the network and of other people using it on roaming. For each square of the partitioning grid the dataset provides every ten minutes the activity in terms of sent and received SMSs, issued calls, received calls and Internet traffic related events. The information is aggregated using the country code, which has a different semantic for each kind of activity (e.g. the country of the person receiving/sending the message, the country of the person receiving/issuing the call, the country of the person connected to Internet), (ii) the Telecommunications  Square to Counties dataset providing the level of interaction between each square of the partitioning grid and the national counties. The level of interaction between a square A and a county B is given as a pair of decimal numbers. The first number is proportional to the number of calls issued from the square A to the county B, the second one is proportional to the number of calls from the county B to the square A. The temporal aggregation is done in timeslots of ten minutes, and (iii) the Telecommunications  Square to Square dataset providing information regarding the directional interaction strength between each pair of squares of the partitioning grid. The directional interaction strength between the square A and the square B is proportional to the number of calls issued from the square A to the square B. Again, the temporal aggregation is done in timeslots of ten minutes.
Methodology
We formulate the problem of predicting the electric energy consumption of a given geographical area as a nonlinear regression task. More specifically, we deal with two different prediction tasks: (i) average daily energy consumption and (ii) peak daily energy consumption. Each task is solved for the next 7 days interval for each electric line ID. This setting is justified by the economic and managerial value of the expected output  it is easy to plan energy supply for the next week, given we have the predicted energy consumption demand.
Electric energy consumption is measured in W ⋅ h (Watt × Hour). In terms of electromagnetism, one Watt is the rate at which work is done when one Ampere (A) of current flows through an electrical potential difference of one Volt (V). Assuming that electrical potential, measured in V is standardized in Trentino province (thus is equal for all line IDs) and given the same timeframes for the analysis, the electric energy consumption prediction task reduces to predicting electric current measured in Ampere per each time frame per each line ID. The values in Ampere of the current passing through the given power line are given by the electric energy distribution company.
Forecasting model training and prediction is done for daily intervals in high order Hilbert space, derived from the anonymized and aggregated mobile network activity in Trentino. The features which are extracted from the source data characterize diversity, regularity and general human dynamics in each small part of the Trentino territory spatially separated by square grid.
In sum, the proposed technical solution includes the following main steps:

1.
An highly parallelized feature extraction algorithm, which characterizes diversity, regularity and general human dynamics, derived from telecommunication data and aggregated by the square grid areas, including innovative secondorder features in time and frequency domains;

2.
A feature selection algorithm (32 features for the final models are selected out of \({>}3{,}000\) features), thus reducing the computational complexity of the model;

3.
A nonlinear regression modeling and prediction based on ensemble of decision trees, which are bootstrapped and aggregated;

4.
A model generalization strategy, as opposed to data overfitting, including strict separation of the test set from the training set (the test set is the next week after the training set with the dependent variables taken with 7days shift to the future), random splits, bootstrapping and bagging techniques.
In the next subsections we provide further details of the experimental setup we followed (preliminary data analysis, feature extraction, feature selection, and model building).
Preliminary data analysis
As preliminary analysis we performed a spectrogram of the temporal current line (see Figure 2) in order to visually justify the feature extraction approach described in the next section. In Figure 2, the horizontal axis represent days (the temporal scaling was done starting from the 100 milliseconds initial resolution), while the vertical axis represents frequency. The amplitude of a particular frequency at a given time is given by the intensity and the color of each point in the plot.
As expected, we found that the response variable  the measure of the amount of electric current passing a point in an electric circuit per unit of time for each power line  has a number of cyclic characteristics and trends. Cyclic characteristic of a time series in data analysis is called seasonality  a property of a signal, experiencing regular changes, which recur every observed time frame, e.g. daily, weekly, yearly. We found predictable changes of the pattern in a response variable time series, that repeat over daily and weekly periods. Interestingly, these temporal regularities were characteristics of different locations of the Province of Trento. Hence, we were able to identify three possible clusters roughly corresponding to (i) the residential areas, (ii) the touristic areas and (iii) the city center areas and/or industrial areas (see Figure 2).
Specifically, we separated the energy consumption signal into three major components: daily seasonality, trend and a remainder component applying seasonaltrend decomposition procedure based on loess [14]. An interesting result for each power consumption cluster type is presented in Figures 3, 4 and 5.
As shown in Figure 3, the typical energy consumption behavior of a residential area shows uneven seasonality on weekly scale, varying seasonality during day and night, variable consumption during the weekdays, low consumption during holidays, and low noise of measurements.
Turning our attention to the typical energy consumption behavior of a touristic area (e.g. Cavalese, a small village and very famous ski resort in Fiemme Valley), we observed uneven seasonality on a weekly scale, varying seasonality during day and night, variable consumption during the weekdays, upward sloping trend toward holidays, abnormally high load during holidays, and noisy values, which are probable effects of solar energy production. In particular, the significant increase in energy consumption during weekends and holidays is justified in Northern Italy, where a large amount of people leaves the major cities to reach mountain touristic locations.
Interestingly, no significant differences were found for city center areas and industrial areas. They both show stationary seasonality on weekly scale, stationary seasonality during day/night, stable consumption during the weekdays, low consumption during weekends, and low noise of measurements.
The discovered seasonalities were explicitly coded into the feature space by using a number for the hour of the day and a number for the weekday for each data source being processed.
Feature extraction
To solve the problem of computational complexity due to the huge amount of data samples (>600 millions of Call Data Records) we moved from the time domain of communication patterns to the frequency domain, applying the Fast Fourier Transform algorithm to each group of daily time series. Also we found that only a small set of harmonics in Fourier domain explains the response variable variance for each type of firstorder feature space time series, which reduces the computational complexity by a number of orders. The usage of a limited number of harmonics in Fourier domain is a known method of compression, which is frequently applied in the field of digital signal processing [15]. For example, some lossy image and sound compression methods employ discrete Fourier transforms. In our experiments, we used from 16 to 64 Fourier coefficients, which are enough to represent the temporal properties of the communication data.
Diversity and regularity have been shown to be important in the characterization of different facets of human behavior and, in particular, the concept of entropy has been applied to assess the predictability of mobility [6] and spending patterns [16, 17], the socioeconomic characteristics [18] and the crime levels [19, 20] of cities and some individual traits such as personality [21]. Hence, for each variable from CDRs we computed the mathematical functions, which characterize the distributions and measure the information theoretic and statistical properties of such variables, e.g. mean, median, standard deviation, min and max values and Shannon entropy.
In order to be able to also account for temporal relationships, the same computations as above were repeated on sliding windows of variable length (1hour, 4hour and 1 day), producing secondorder features that capture spatiotemporal relationships, thus preserving useful source data properties.
It is worth noticing that in the computation of the distributions’ properties in frequency domain we do not limit the higherorder functions to metrics with an intuitive explanation from physics. For example, the ‘variance of real numbers part of Fourier transform of area codes’ represents, for each spatial square, a measure of diversity of the area codes of telecommunication activity.
Feature selection
In order to reduce model complexity and enhance generalization properties by reducing the risk of overfitting [22], a feature selection step was performed before the model building. The feature selection was done on a reduced sample of the training data, which was one week long. The metric used to rank the features was the total decrease in node impurities, which is the impurity measure of a decision tree node derived from relative entropy metric [23]. This choice was motivated because it outperformed other metrics such as mutual information, information gain, and chisquare statistic [24, 25]. We reduced the feature space only to 32 dimensions for each of the two models without loosing much accuracy. The 32 dimensions were chosen because the addition of other features increased the computational complexity without improving significantly the performance. The final feature sets are provided in Tables 3 and 4. In these tables the mean decrease in node impurity is presented in nonnormalized form.
Model building
We formulated two separate problems  (1) predicting mean daily consumption and (2) predicting peak daily consumption. To this end, we built 2 regression models. For each of the regression models and for each sample we have a scalar outcome variable, \(Y \in\mathbb{R}\) and a vector of explanatory variables in the selected feature space \(\vec{X} \in\mathbb{R}^{d}\).
Our goal was to estimate the regression function
for any \(x \in \text{the space }\mathbb{R}\), by generating decision trees at random and combining them to form the aggregated regression estimate
where \(\mathbb{E}_{\Theta}\) is the regression expectation with respect to a random parameter which is conditional on the vector X⃗ and data set \(\Omega_{n}\).
A random forest is a collection of tree predictors, such as \(\operatorname{RF} (\vec{x},\vec{T},\Omega_{k} )\), \(k = 1,2,\ldots,K\), where the \(\Omega_{k}\) are random vectors. The random forest prediction for our regression problem is an unweighted average over the forest. The keys to convergence and superior metrics are low correlation and low bias of the model. Hence, in order to keep bias low the trees are grown to their maximum depth. At the same time, to keep correlation low when trees are grown we use randomization, such as each tree is grown on a bootstrap sample of the training set and the number of predictors in each specified tree is much smaller than the total number of total variables in the training set. At each node, variables are selected at random out of all variables, and the split is fitted as the best split on this subset of variables.
The choice of Leo Breiman’s Random Forest algorithm [26] is justified because it yields one of the best performances among ensemble models and it is still very simple and not dependent on multiple hyperparameters optimization, which is a good way to demonstrate the properties of the functional relationships we are modeling. Random Forest approach is also known obtaining excellent performances in terms of accuracy and scaling up due to the ability of parallelizing tree growth, to the ability of handling thousands of variables, to the robustness for badly unbalanced data, and finally to the ability of providing internal unbiased estimates of the error as trees are added to the ensemble.
Specifically, Random Forest consists of a collection of randomized primary regression trees \(r_{n}(\vec{x},\Theta_{m},\Omega_{n})\), \(m \ge 1\), where \(\Theta_{1},\Theta_{2},\ldots \) are outputs of a randomizing variable Θ. These random trees are combined to form the aggregated regression estimate \(\bar{r} (\vec{X},\Omega_{n})\). At each node, a coordinate of \(\vec{X} = (X_{1},\ldots,X_{d})\) is selected, with the jth feature having a probability \(p_{nj} \in(0, 1)\) of being selected. At each node, once the coordinate is selected, the split is done at the midpoint of the chosen side. The splits are traversed to the terminal node (leaf), minimizing the mean squared error.
Thus, our model regression estimate is:
For each model we used a random selection of features to split each node, growing binary trees and averaging them [27], which has computationally efficient outstanding properties exploited by machine learning community. We also took advantage of the wellknown performance improvements that are obtained by growing an ensemble of trees and averaging. Random vectors were sampled before the growth of each tree in the ensemble, and a random selection without replacement was performed [28].
Experimental results and discussion
The main outcome of our approach is an ensemble machine learning algorithm that predicts energy consumption as a nonlinear time series regression problem on a daily scale for each electrical line id.
Several metrics to compare our models with baselines are provided in Tables 1 and 2. In particular, we report the Mean Absolute Error (MAE), the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), the Relative Squared Error (RSE), the Relative Absolute Error (RAE), and \(R^{2}\).
Turning our attention to the MSE, the prediction performance for daily average energy consumption for the next 7 days prediction interval is 2.43 times better than the baseline, MSE = 325.2679 compared to the baseline MSE = 790.6041 (training set arithmetic mean).
Interestingly, the prediction performance for daily peak energy consumption for the next 7 days prediction interval is 59.93 times better than the baseline (MSE = 601.7531 vs baseline MSE = 36,062.7851, which is the training set maximum value). The choice of this baseline is based on existing practice of energy companies to meet the maximum energy demand they experience in the past.
As shown in Tables 1 and 2, we got a negative \(R^{2}\) metric for our nonlinear regression problem baseline. Usually, \(R^{2}\) is defined as the proportion of variance explained by the regression model fit. If the fit is actually worse than just fitting a horizontal line, then \(R^{2}\) could be negative.
Tables 3 and 4 show the feature space used for the two final prediction models. The most powerful feature for both the prediction tasks is the number of consumers per electric power line  a feature from the energy consumption dataset. This feature provides a static characterization of a specific geographical area. Our machine learning algorithm uses this feature to build different energy consumption models for each range of consumers and power lines on each square of the partitioning grid. Then, the algorithm combines a number of these models into an ensemble model leveraging the decision tree regression properties. In sum, the number of consumers per power line is an efficient way to connect spatiotemporal human dynamics characteristics, detected by telecommunication data, with the static property of the geographic area.
As shown in Tables 3 and 4, the other relevant predictors describe human mobility patterns in a geographical space, which are found to be a good proxy for predicting daily electric energy consumption.
In sum, we found that together with static properties of the places, i.e. number of consumers per grid powerline, the spatial and temporal distribution properties of the mobile network data, such as Internet communications activity, are good predictors of nearterm energy consumption. Also voice calls and SMS activity add some value to the prediction metrics; specifically, the spectral statistics of these activities and the crosstemporal entropy.
We also found that among second and higherorder statistics, skewness and kurtosis of harmonics in frequency domain and entropy of crosstemporal communication patterns are the best predictors for maximum energy consumption prediction task, which is inline with the intuition of extreme value theory [29].
The full analysis of best predictors is provided in Table 3 for the average energy consumption prediction task, and in Table 4 for the maximum energy consumption prediction task.
For a commercial application it is possible to improve prediction metrics by creating a separate model for each power line, increasing the feature space during feature selection process and adding additional information to the feature space, such as the historical energy consumption properties and the weather forecast. These multimodal data sources are out of the scope of this research result, but in fact improve the model metrics. The cost of this improvement is an increase in computational complexity of each model, that could be efficiently parallelized in the cloud or by an efficient use of high performance computing (HPC) infrastructures, which usually exist in telecommunication and energy companies. All the computations we propose could be done in batch mode and do not require realtime processing.
Implications and limitations
Our results prove that human dynamics, which can be extracted from aggregated and anonymized mobile phone data, are good proxies for modeling energy consumption. This contribution has several practical implications for the energy producers and distributors, the telecom companies and, more in general, for the whole society. For example, our results could help to optimize the economy of energy producers/distributors value chain, also acting as an efficient tool for meeting peak electrical energy demands and creating a new market for telecom data usage. Again, our results could help to reduce the total primary energy consumption and thus its ecological footprint (e.g. climate change).
Among the limitations of our approach we consider that rural areas and areas, which are not equipped with telecom equipment or having small number of telecom activity, could not be used as proxies for energy consumption prediction in a powergrid. Also, our approach uses data from a single operator, Telecom Italia, and does not characterizes the households’ activities. Finally, the introduced models do not account for seasonality on a yearly scale due to the 2months limitation of our dataset. However, given the good horizontal scaling of the learning algorithm, this latter limitation could be solved by training the model on much more data.
Conclusion
Looking at the amount of electric current passing through a point in an electric circuit per unit of time and for each power line, we found that it has a number of cyclic characteristics and trends. We found predictable changes that repeat over daily and weekly periods. Based on these regularities we separated all power lines into 3 clustered areas: residential, touristic and city center/industrial areas. Then, we hypothesized and proved that cellular communication patterns, which represent human dynamics in space and time, could be a good proxy for energy consumption prediction. To this end, we computed, from the anonymized and aggregated mobile network activity, a number of predictors characterizing diversity, regularity and general mobile network activity in each part of the territory spatially aggregated by the square grid.
The prediction tasks, (i) predicting the daily average energy consumption and (ii) predicting the peak daily energy consumption, are solved for the next 7 days intervals for each electric line ID and are formulated as nonlinear regression tasks. We used ensemble learning methods (Random Forest) to solve the optimization problem and avoid overfitting.
To solve the problem of computational complexity of the huge amount of data samples we moved from time domain to frequency domain. We also found that only a small set of harmonics in Fourier domain explains the response variable variance for each type of firstorder feature space in time series, which reduces the computational complexity by a number of orders. The stateoftheart feature selection pipeline that we apply, reduce the feature space down to 32 dimensions without losing significant accuracy.
The obtained results prove that human dynamics, extracted from aggregated and anonymized mobile phone data, are good proxies for modeling energy consumption. This contribution could help to optimize the economy of energy producers/distributors value chain and to reduce the total primary energy consumption, meeting the people’s energy needs.
Notes
References
 1.
Zhao H, Magoules F (2012) A review on the prediction of building energy consumption. Renew Sustain Energy Rev 16:35863592
 2.
Kolter ZJ, Ferreira J (2011) A largescale study on predicting and contextualizing building energy usage. In: Proceedings of the conference on artificial intelligence (AAAI), special track on computational sustainability and AI
 3.
Martani C, Lee D, Robinson P, Britter R, Ratti C (2012) ENERNET: studying the dynamic relationship between building occupancy and energy consumption. Energy Build 47:584591
 4.
Blondel V, Decuyper A, Krings G (2015) A survey of results on mobile phone datasets analysis. EPJ Data Sci 4:10
 5.
Gonzalez M, Hidalgo C, Barabasi A (2015) Understanding individual human mobility patterns. Nature 453:779782
 6.
Song C, Qu Z, Blumm N, Barabasi AL (2010) Limits of predictability in human mobility. Science 327:10181021
 7.
Kung KS, Greco K, Sobolevsky S, Ratti C (2014) Exploring universal patterns in human homework commuting from mobile phone data. PLoS ONE 9(6):e96180
 8.
Deville P, Linard C, Martin S, Gilbert M, Stevens FR, Gaughan AE, Blondel VD, Tatem AJ (2014) Dynamic population mapping using mobile phone data. Proc Natl Acad Sci 111(45):1588815893
 9.
Batty M, Axhausen K, Giannotti F, Pozdnoukhov A, Bazzani A, Wachowicz M, Ouzounis G, Portugali Y (2012) Smart cities of the future. Eur Phys J Spec Top 214(1):481518
 10.
Keirstead J, Jennings M, Sivakumar A (2012) A review of urban energy system models: approaches, challenges, and opportunities. Renew Sustain Energy Rev 16:38473866
 11.
MartinezCesena EA, Mancarella P, Ndiaye M, Schläpfer M (2015) Using mobile phone data for electricity infrastructure planning. arXiv:1504.03899
 12.
de Montjoye Y, Smoreda Z, Trinquart R, Ziemlicki C, Blondel VD (2014) D4DSenegal: the second mobile phone data for development challenge. arXiv:1407.4885
 13.
Barlacchi G, De Nadai M, Larcher R, Casella A, Chitic C, Torrisi G, Antonelli F, Vespignani A, Pentland A, Lepri B (2015) A multisource datatset of urban life in the city of Milan and the Trentino province. Sci Data 2:150055
 14.
Cleveland RB, Cleveland WS, McRae JE, Terpenning I (1990) STL: a seasonaltrend decomposition procedure based on loess. J Off Stat 6(1):373
 15.
Stranneby D, Walker W (2004) Digital signal processing and applications. Elsevier, Amsterdam.
 16.
Krumme C, Llorente A, Cebrian M, Pentland A, Moro E (2013) The predictability of consumer visitation patterns. Sci Rep 3:1645
 17.
Singh VK, Freeman L, Lepri B, Pentland AS (2013) Predicting spending behavior using sociomobile features. In: 2013 international conference on social computing (SocialCom). IEEE Comput. Soc., Los Alamitos, pp 174179
 18.
Eagle N, Macy M, Claxton R (2010) Network diversity and economic development. Science 328(5981):10291031
 19.
Bogomolov A, Lepri B, Staiano J, Oliver N, Pianesi F, Pentland A (2014) Once upon a crime: towards crime prediction from demographics and mobile data. In: Proc. 16th ICMI. ACM, New York, pp 427434
 20.
Bogomolov A, Lepri B, Staiano J, Letouze E, Oliver N, Pianesi F, Pentland A (2015) Moves on the street: classifying crime hotspots using aggregated anonymized data on people dynamics. Big Data 3(3):148158
 21.
Montjoye Y, Quoidbach J, Robic F, Pentland A (2013) Predicting personality using novel mobile phonebased metrics. In: Greenberg AM, Kennedy WG, Bos ND (eds) Social computing, behavioralcultural modeling and prediction. Lecture notes in computer science, vol 7812. Springer, Berlin, pp 4855
 22.
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:11571182
 23.
Singh SR, Murthy HA, Gonsalves TA (2010) Feature selection for text classification based on Gini coefficient of inequality. J Mach Learn Res 10:7685
 24.
Raileanu L, Stoffel K (2004) Theoretical comparison between the Gini index and information gain criteria. Ann Math Artif Intell 41(1):7793
 25.
Tuv E, Borisov A, Runger G, Torkkola K (2009) Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res 10:13411366
 26.
Breiman L (2001) Random forests. Mach Learn 45(1):532
 27.
Breiman L (1999) Random forestsrandom features. Technical Report 567, Department of Statistics, UC Berkeley
 28.
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123140
 29.
Pickands J (1975) Statistical inference using extreme order statistics. Ann Stat 3(1):119131
Acknowledgements
We thank Telecom Italia SpA and SET Distribuzione SpA for providing additional data and help in preparation of this research.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Authors equally designed and performed research and wrote the paper.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 energy consumption prediction
 mobile phone data
 human dynamics
 machine learning