The Power of Functional Data Analysis
Unlocking Patterns in Energy Data
Introduction
As utilities and energy companies continue to collect more and more customer data, the challenge of analyzing and understanding this data becomes increasingly complex. One powerful approach to tackle this challenge is functional data analysis. In this article, we will explore how functional data analysis can be used to describe customer behavior, and how this information can be used for feature engineering, and ultimately drive better business outcomes.
The objectives of this article are threefold: to provide a gentle, non-technical introduction to functional data analysis, demonstrate an application using functional data analysis to make inferences about customer behavior, and showcase how such inferences can be leveraged as features in other processes, such as predictive models.
What Is Functional Data Analysis?
I was introduced to functional data analysis in graduate school by a professor. One day while studying GAMs (Generalized Additive Models) and splines, the professor suggested that anyone interested in learning more should explore the works of professors Ramsay and Silverman. I was curious and decided to purchase the authors’ books. In doing so I discovered a new world of statistics and methods of analysis that were very applicable to modeling energy data.
But, what is functional data analysis anyway?
According to Wikipedia, functional data analysis, which is often abbreviated as FDA, “is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum.”
According to the ever-popular ChatGPT, “FDA is a statistical approach that analyzes data represented as functions over a continuous domain, aiming to extract information by modeling the functional relationships between variables. FDA is useful for analyzing complex data structures subject to noise and measurement error, with applications in fields such as time series analysis, image analysis, and bioinformatics.”
Finally, I define functional data analysis as a non-parametric regression technique used to approximate a curve or a function via a linear combination of basis functions. FDA prodvides methods to analyze information on curves or functions.
The figure above is a helpful illustration taken from a presentation I gave at an energy-professionals meetup. The 24 red data points represent the mean consumption by hour for a period of time, say the last two years. The black curve was fitted to the points using functional data analysis.
A neat feature of the fitted curve is that it allows us to make inferences about points in time for which there is no data. For instance, there are only have 24 data points, one observation per hour. However, what if what’s of interest is the consumption behavior in between hours? Well, fret not because the fitted curve is continuous and allows us to estimate what consumption was at 05:30, 11:15, 21:55, or any point in the range of time the curve was fitted to, which in this example is hours 0 through 23.
An additional feature of the fitted curves that set FDA apart from other statistical techniques is the fitted curves are differentiable. This means it’s possible to calculate the first and second derivatives of the curves, a capability that has significant implications in various applications from engineering to marketing. Initially, the importance of derivatives might not be immediately clear, but after reading this article, you will understand why they matter and appreciate their potential.
The curve fitting methods of FDA are robust, meaning that there are many ways to fit curves to the data. For example, there are different basis functions; Fourier, B-splines, wavelet, etc., and there are different optimization techniques, such as least squares or penalized methods. We won’t discuss the technical details of how functional data analysis works. However, I’d be remiss not to share some details. For reference, throughout this article I used a fully parameterized Fourier basis expansion optimized using a penalized technique to fit curves to the data. The Fourier basis is well suited for periodic data and it’s is easily differentiated.
Another fun tidbit about functional data analysis is that many traditional methods have functional counterparts. For instance, it’s possible to take the mean of fitted curves or the median, and to construct confidence intervals on the curves. There also exists functional principal component analysis and even function clustering techniques to group FDA-fitted curves. In summary, FDA is feature rich and offers many avenues of exploration.
I’m considering potentially writing a series of posts that explore other topics related to FDA, such as curve clustering to analyze patterns across months, phase-plane plots to showcase how a classical engineering technique can be a leveraged to analyze consumption behavior, a deeper dive into the mechanics of FDA, or even more advanced techniques such as regression analysis and predictive modeling using functional data analysis.
Lastly, for intrigued individuals, the aforementioned presentation can be viewed below or at this link.
The presentation from the video above can be accessed at this link.
Data Discussion
The data for this article represents my household’s energy consumption and consists of 15-minute interval meter readings. The start of the period is 2021-01-16 and the end of the period is 2024-09-20 for a total of 1,343 days and 45 months.
Data Transformations
During my time at a public utility, we employed a common technique to transform the raw meter readings into a more manageable format, called the daily load profile. The daily load profile shows how electricity usage varies throughout the day and it’s a useful technique to summarize consumption.
The daily load profile is versatile, as it can be summarized many different ways. For instance, it’s common to exclude weekends for commercial clients because most commercial entities consume the bulk of their energy Monday through Friday.
For our purposes, consider each day as a separate, distinct load profile. The 15-minute meter readings will be summarized by hour because doing so smooths the data and makes it easier to work with. We’ll see below that smoothing the data doesn’t have much of a negative effect. Then, days will be collected by month and year so that each month has at most as many observations as there are days in that month. Sometimes there are outages and no energy is consumed or the meter can’t upload data to the energy provider. This shows ups as missing values in the data. For ease of analysis, any day with a missing value has been excluded.
The hierarchy of nesting days within months allows us to analyze each day separately or to analyze many days at once via a summary of the data, such as the mean of various days (e.g., the mean of all the days in January 2021). Below is an example of a daily load profile from a randomly chosen month.
Fitting the Curves
In this section we’ll discuss different aspects of fitting curves to the data. However, we will not discuss the underlying machinations of how FDA works or the technical aspects of fitting the curves, so we’ll use illustrations to understand the various steps and processes.
Using FDA we’ll first derive smooth approximations of the data, one curve per day. The figures below illustrate the curve-fitting process. The figure on the left shows the daily load profile for two days while the second shows the FDA-fitted curves. Notice that the fitted curves are smooth and capture the essence of the data but don’t necessarily capture it’s peakedness - you might say the curves aren’t over-fitted.
The next step is to summarize the fitted curves. We will accomplish this using methods from functional data analysis. The idea is similar to summarizing non-functional data. There are two components: a) the group-by clause and b) the summarizing function. In this case, we will summarize the data via the mean and group-by month. The result is one curve per month and we’ll call these summarized curves mean daily load profiles. Below is a plot of the summarized monthly curves.
Next, the figure below illustrates the summarized curve of a randomly chosen month. The summarized hourly curves are the focus of our discussion in this article. However, I was curious to see what effect summarizing the 15-minute interval data by hour would have. To accomplish this, I also fit curves to each day on the 15-minute data. Then, as before, I summarized the fitted-curves by month. As expected, smoothing the 15-minute data results in a loss of information. For example, the magnitude of the tallest peak is lessened and shifted. Overall, though, I think the smoothed data fits well enough.
We’ll wrap up this section with a discussion about the procedure outlined above. Instead of fitting curves to each day first, another approach would be to summarize the data first and then fit the curves to the summarized data.
In context, we’d group-by month and hour then take the mean of the data first before fitting the curve. The resulting curve would be nearly the same as the summarized curves above. The difference is twofold: 1) summarizing first is more efficient and 2) if only the mean is used to summarize the data we loose insight into the variability of the data - one could summarize the data by mean and standard deviation but then the variability would need to also be translated to its functional counterpart and that adds steps. By summarizing the daily fitted curves (as was done above) the variability is implicitly captured, no extra steps needed. Additionally, since the curves contain information about both the mean and variance, it’s trivial to compute confidence intervals for the fitted curves.
Regarding computational efficiency. FDA is fast. I’m able to fit curves to the 1,343 days in this analysis and summarize said curves in minutes using modest resources. To fit the curves I use distributed resources; machine 1: 16Gb RAM with 4 cores and machine 2: 8Gb RAM with 8 cores. This works well for the size of my data. However, if I planned to analyze years worth daily/hourly/15-min energy consumption data for millions of customers then fitting a curve to each day might not be the optimal strategy.
Customer Profiles
On-peak hours are the time of the day when there is the highest demand for electricity. This is usually when people are using the most electricity for tasks such as cooking, cleaning, using electronic devices, and running air conditioning or heating units. The purpose of defining on-peak hours is to help utilities manage the electricity grid more efficiently and ensure a stable supply of electricity during times of high demand.
In this section I’ll illustrate an alternative method to define on-peak hours using functional data analysis. The approach is based on the velocity of consumption. The idea is to define the peak hours of consumption as the time of day when the rate of consumption, or the velocity of consumption is highest or lowest.
The velocity-based definition of on-peak hours may at times align with the usual definition. A key difference between the two approaches is that the usual definition only provides a time of day. Whereas the velocity-based definition can also describe how quickly demand is increasing or decreasing. That is, we can know the rate at which energy is being consumed and not just the times of day. This additional information could be useful to utilities and energy companies for demand-planning.
Next, let’s discuss the intuition behind the velocity-based definition. Below are two figures, on the left is a mean daily load profile (i.e., a mean-fitted curve) and on the right are plots of the first and second derivatives, that is the velocity and acceleration, respectively. You may recall from calculus that the first derivative can be used to find global or local inflection points and that the first derivative describes a rate - in this case the rate of energy consumption.
A visual inspection of the plots suggests that the highest point (max velocity) and the lowest point (min velocity) visually occur around 4pm (max) and 2am (min). This implies that for the given month, on average, my household started consuming electricity at the fastest pace around 4pm and reduced consumption at the fastest pace around 2am.
I think a little story-telling would be helpful so let me add context to the plots. Summers in Texas are HOT but we don’t run the A/C all day. We’re conscience of the environment and our wallets. The A/C is on a schedule. The system starts to cool the house down early, before scheduled, to reach a certain temperature by the requested time. Then as the night cools the system relaxes and stops running so aggressively. Marring the derivative discussion with the parent, kWh plot shows that my household in this given month used the most energy in the evening.
Often there are other stories to be told. 2am and 4pm are the global inflection points. There are other inflection points though, local inflection points. Sometimes these are of interest. For example, there might only be a small difference between the global max and the local max. This might imply that there are nuances in the data. Imagine a multi-generational household, a household with children of different ages that go to school at different times, or a household where at least one adult work the night shift. These households would likely have local inflection points that are of interest. Part of my interest in FDA with energy consumption data is that it tells a more complete picture than just the kWh alone.
Let’s wrap this section up with a discussion about derivatives. A visual inspection of the plots is one thing, but we can also use math (or maths) to find inflection points and confirm if our findings are truly inflection points. Using FDA we have an approximation or an estimate of the velocity of consumption. We can use the fitted curve to find the max and min values. The index or the argmin/argmax are also of interest as that tells us the time of day. Lastly, we can use the second derivative test to confirm if the previous findings are truly inflection points. When the second derivative is 0 then the findings from the first derivative are inflection points. We can also use the second derivative to find local inflection points, that’s anywhere acceleration is 0.
Feature Engineering
In this section I’ll illustrate how to create features from the consumption data and the derivatives discussed in the previous section. In essence, I will create profiles of the months and perform an analysis on the data.
I worked on a variety of projects during my tenure at Consumers Energy. For some projects the energy data had to be condensed - often as much as a single row/observation per customer. Clustering projects usually require such a structure, so that’s what I’ll focus on.
Feature engineering will focus on two areas, time and measurements based on consumption. Using my monthly household energy data I will extract the min/max kWh as well as the time of day associated with the min/max values. Additionally, utilizing the derivative curves from the previous section it’s possible to extract the min/max velocity and the times of day associated with the min/max values. Each variable is numeric. The time fields display military time in fractional units by 30 minute intervals. For example, 15.5 is 3:30 pm. The numeric nature of makes using the fields easier.
function(df, date_var, verbose = FALSE) {
# import
box::use(
forcats[as_factor],
lubridate[day, month, wday, year]
)
# vars
date_vec <- df[[date_var]]
# create day variables
df[["day_name"]] <- as_factor(weekdays(date_vec, abbreviate = TRUE))
df[["dow_int"]] <- wday(date_vec)
df[["day_int"]] <- day(date_vec)
# create month variables
df[["month_name"]] <- as_factor(months(date_vec, abbreviate = TRUE))
df[["month_int"]] <- month(date_vec)
# create quarter variable
df[["quarter_int"]] <- quarters(date_vec)
# create year variable
df[["year_factor"]] <- as_factor(year(date_vec))
if (verbose) message("Date variables created")
# output
return(df)
}
<bytecode: 0x557e475b9720>
<environment: 0x557e33338e30>
Engineered Features |
||||||||
---|---|---|---|---|---|---|---|---|
n Randomly Chosen Months | ||||||||
Date | Min kWh | Max kWh | Min kWh Time | Max kWh Time | Min Velocity | Max Velocity | Min Velocity Time | Max Velocity Time |
Mar 2021 | 0.08 | 0.34 | 4.00 | 12.00 | −0.05 | 0.06 | 12.00 | 7.00 |
Feb 2022 | 0.14 | 0.64 | 1.00 | 9.00 | −0.07 | 0.10 | 12.50 | 6.50 |
Mar 2022 | 0.07 | 0.44 | 2.00 | 9.00 | −0.06 | 0.09 | 10.00 | 6.50 |
Jul 2023 | 0.20 | 0.86 | 13.00 | 19.00 | −0.08 | 0.16 | 2.00 | 15.50 |
Jan 2024 | 0.23 | 0.92 | 24.00 | 9.00 | −0.10 | 0.20 | 10.50 | 6.00 |
If it were available, I could add other information as well, such as the maximum number of people in the house by month, average temperature by month, etc. There are many possibilities, it all depends on the goal of the analysis. To keep it simple, no additional data was added. This would most similar to a behavioral clustering project.
The goal of the clustering analysis might be to assess which months are similar to each other (i.e. members of the same cluster). Intuitively we’d expect months in the same season to be similar, such as June and July might be in a summer cluster.
Two methods will be used, Multidimensional Scaling (MDS) and Partitioning Around Medoids (PAM). MDS is a dimension reduction technique, it compresses multivariate data into a specified number of dimensions - usually 2 since MDS is often utilized as a visualization tool. Side note, there are several ways to compress multivariate data into \(p\) dimensions, such as UMAP or T-SNE. I used MDS because it’s pre-installed in base R and I didn’t want to install another library.
As for the clustering algorithm, I chose PAM because it’s a more robust version of K-means. I used K = 4 because there are four seasons (although it doesn’t always feel that way in Texas). The data was centered and scaled prior to performing MDS and PAM. I’m still unsure if it makes sense to standardize the time variables, but I didn’t want them to have more influence than the other variables.
What do you think? Does the plot make sense? The coldest months appear to be in the top left corner, summer months in the top right corner, spring and fall in the middle, and two outliers perhaps towards the bottom of the graph.
February 2021 was known as Sow-Mageddon in Texas (Texas Parks & Wildlife Magazine 2022), a snow storm crippled North Texas for a few days, perhaps a week. There were even rolling blackouts. In December 2022 my household hosted family for the holidays so consumption was greater than usual. Also, in the summer months (top right) July 2023 stands out a bit, which makes sense because it was one of the warmest months on record.
Assessing K = 4 (Seasons)
Before wrapping up this section I wanted to assess how reasonable K = 4 is. First, I looked at an elbow plot. I read recently that it’s not recommended to use the elbow plot anymore, I’ll explore that another time. Also, I didn’t use the K-Means algorithm to cluster the data, I used PAM so the results may not translate. Regardless, K between 4 and 6 seems to make sense.
Have you ever heard of a clustergram? See this link for details. In summary, it’s another way to visualize what value K should take on. There isn’t a library on CRAN that I know of for this, but someone created a function accessible via Github (Galili 2010). Similar to the elbow plot, a value between 3 and 5 makes sense. This function also uses K-Means but it could be extended for other clustering algorithms. It’s recommended to repeat the process several times to account for randomness in the process. I’ve only called the clustergram function once for illustrative purposes.
Curve Clustering
Another application of FDA that’s worthwhile to discuss is curve clustering. Like their non-functional clustering cousins, curve clustering techniques have a wide breadth of applications, such as customer segmentation. For our purposes, it might be of interest to see which of the 45 months are similar [across time].
In a traditional unsupervised paradigm we might treat the months as observations and the 24 hours as variables, resulting in a 45 row by 24 column matrix/data frame. Under this structure we’re essentially treating the 24 hours as independent observations, but we know the data isn’t independent. The data are temporally or serially dependent. One solution to this dilemma might be to perform a dimension reduction technique such as PCA and then cluster the data.
FDA provides an alternative approach, curve clustering. In this approach FDA first estimates a continuous curve using the 24 hours, then the curves themselves are the object of the [curve] clustering algorithm. There are different curve clustering algorithms. The results of which may mirror those of a traditional clustering technique while the results of others will differ.
Below are results of a traditional clustering process using K-means followed by a curve clustering algorithm.
The Data
Data |
||||||||
---|---|---|---|---|---|---|---|---|
n by 24 matrix | ||||||||
hour_0 | hour_1 | hour_2 | hour_3 | hour_4 | hour_5 | hour_6 | hour_7 | |
2021-01-01 | 0.18 | 0.21 | 0.18 | 0.22 | 0.19 | 0.21 | 0.22 | 0.26 |
2021-02-01 | 0.36 | 0.39 | 0.36 | 0.38 | 0.37 | 0.37 | 0.36 | 0.44 |
2021-03-01 | 0.13 | 0.10 | 0.10 | 0.08 | 0.10 | 0.10 | 0.16 | 0.19 |
2021-04-01 | 0.10 | 0.10 | 0.11 | 0.09 | 0.08 | 0.10 | 0.17 | 0.17 |
2021-05-01 | 0.18 | 0.13 | 0.12 | 0.09 | 0.09 | 0.10 | 0.20 | 0.18 |
K-Means
Curve Clustering via Functional Data Analysis
I used the funFEM function (Bouveyron 2021) to cluster the fitted curves.
Error in if (min(colSums(T)) <= 1) stop("One cluster is almost empty!") :
missing value where TRUE/FALSE needed
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in if (min(colSums(T)) <= 1) stop("One cluster is almost empty!") :
missing value where TRUE/FALSE needed
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!
Error in .fstep(fd, T, lambda) : One cluster is almost empty!