Locally Adaptive Online
Functional Data Analysis

Valentin Patilea\(^\dagger\)

ENSAI & CREST\(^\dagger\), valentin.patilea@ensai.fr

Jeffrey S. Racine\(^\ddagger\)

McMaster University\(^\ddagger\), racinej@mcmaster.ca

Sunday, August 27, 2023

Slide Pro-Tips

  • Link to slides - jeffreyracine.github.io/laofda (case sensitive, Google Translate)

  • View full screen by pressing the F key (press the Esc key to revert)

  • Access navigation menu by pressing the M key (click X in navigation menu to close)

  • Advance using arrow keys

  • Zoom in by holding down the Alt key in MS Windows, Opt key in macOS or Ctrl key in Linux, and clicking on any screen element (Alt/Opt/Ctrl click again to zoom out)

  • Export to a PDF by pressing the E key (wait a few seconds, then print [or print using system dialog], enable landscape layout, then save as PDF - press the E key to revert)

  • Enable drawing tools - chalk board by pressing the B key (B to revert), notes canvas by pressing the C key (C to revert), press the Del key to erase, press the D key to download drawings

Abstract

One drawback with classical smoothing methods (kernels, splines, wavelets etc.) is their reliance on assuming the degree of smoothness (and thereby assuming continuous differentiability up to some order) for the underlying object being estimated. However, the underlying object may in fact be irregular (i.e., non-smooth and even perhaps nowhere differentiable) and, as well, the (ir)regularity of the underlying function may vary across its support. Elaborate adaptive methods for curve estimation have been proposed, however, their intrinsic complexity presents a formidable and perhaps even insurmountable barrier to their widespread adoption by practitioners. We contribute to the functional data literature by providing a pointwise MSE-optimal, data-driven, iterative plug-in estimator of “local regularity” and a computationally attractive, recursive, online updating method. In so doing we are able to separate measurement error “noise” from “irregularity” thanks to “replication”, a hallmark of functional data. Our results open the door for the construction of minimax optimal rates, “honest” confidence intervals, and the like, for various quantities of interest.

Outline of Talk

  • Modern data is often functional in nature (e.g., an electrocardiogram (ECG) and many other measures recorded by wearable devices)

  • The analysis of functional data requires nonparametric methods

  • However, nonparametric methods rely on smoothness assumptions (i.e., require you to assume something we don’t know)

  • We show how we can learn the degree of (non)smoothness in functional data settings and we separate this from measurement noise (this cannot be done with classical data)

  • This allows us to conduct functional data analysis that is optimal (we do this in a statistical framework)

  • We emphasize online computation (i.e., how to update when new functional data becomes available)

Classical Versus Functional Data

  • A defining feature of classical regression analysis is that

    • sample elements are random pairs, \((y_i,x_i)\)

    • the function of interest \(\mathbb{E}(Y|X=x)\) is non-random

  • A defining feature of functional data analysis is that

    • sample elements are random functions, \(X^{(i)}\)

    • these are also functions of interest

  • The following figure (Figure 1) presents \(N=25\) sample elements (classical left plot, functional right plot)

Classical Versus Functional Data

Figure 1: Classical Versus Functional Sample Elements (\(N=25\))

Functional Data

  • FDA is the statistical analysis of samples of curves (i.e., samples of random variables taking values in spaces of functions)

  • FDA has heterogeneous, longitudinal aspects (“individual trajectories”)

  • Furthermore, we never know the curve values at all points

  • The curves are only available at discrete points (e.g., \((Y^{(i)}_m , T^{(i)}_m) \in\mathbb R \times [0,1]\))

  • The points at which curves are available can differ across curves

  • The curves may be measured with error

  • Consider measurements taken from 1 random curve:

    • Figure 2 is measured without error from an irregular curve

    • Figure 3 is measured with error from a smooth curve

    • Figure 4 displays varying (ir)regularity and measurement noise

FDA Sample Element (1 Random Function)

Figure 2: Irregular Function, Data Measured Without Error

FDA Sample Element (1 Random Function)

Figure 3: Regular Function, Noisy Data

FDA Sample Element (1 Random Function)

Figure 4: Irregular Function, Varying Regularity, Noisy Data

FDA Sample Elements (Random Functions)

Figure 5: Berkeley Growth Study Data

FDA Sample Elements (Random Functions)

Figure 6: Canadian Weather Study Data

Modelling Functional Data

  • Why not use classical linear regression to model each curve?

    • it would be difficult to justify given the nature of the data
  • Why not use existing nonparametric methods to model each curve?

    • it would be hard to justify smoothness assumptions, for one
  • Why not use classical time series or longitudinal/panel analysis

    • the observed points may not be equally spaced

    • the various curves may not be sampled at the same time points

    • the observed points may not be stationary

  • Ideally, one would keep the observation unit as collected, and model data as realizations in a suitable space of objects (i.e., space of curves)

Project Scope

  • We drill down on unknown regularity of unknown functions

  • We exploit a key feature of FDA, namely, replication

  • We use data-driven local nonparametric kernel methods

  • Our method adapts, pointwise, to

    • the local regularity of the underlying process (i.e., the Hölder exponent \(H_t\) defined shortly and to measurement noise (i.e., we are able to separate noise from regularity thanks to replication)

    • the purpose(s) of estimation (i.e., \(\mu(t)\) versus \(\Gamma(s,t)\) defined shortly)

  • Our method is computationally tractable:

    • we have processed millions of curves, each containing hundreds of sample points, on a laptop in real-time as new online data arrives

Functional Data Setting

Functional Data

  • Functional data carry information along the curves and among the curves

  • Consider a second-order stochastic process with continuous trajectories, \(X = (X_t : t\in [0,1])\)

  • The mean and covariance functions are \[\begin{equation*} \mu(t) = \mathbb{E}(X_t)\text{ and } \Gamma (s,t) = \mathbb{E}\left\{ [X_s - \mu(s)] [X_t-\mu(t)]\right\},\, s,t\in [0,1] \end{equation*}\]

  • The framework we consider is one where independent sample path realizations \(X^{(i)}\), \(i=1,2\ldots,N\), of \(X\) are measured with error at discrete times

  • The data associated with the \(i\)th sample path \(X^{(i)}\) consists of the pairs \((Y^{(i)}_m , T^{(i)}_m) \in\mathbb R \times [0,1]\) generated as \[\begin{equation*} Y^{(i)}_m = X^{(i)}(T^{(i)}_m) + \varepsilon^{(i)}_m, \qquad 1\leq m \leq M_i \end{equation*}\]

Discrete Sample Points

  • Here, \(M_i\) (number of discrete sample points for the \(i\)th curve) is an integer which can be non-random and common to several \(X^{(i)}\), or an independent draw from some positive integer random variable drawn independently of \(X\)

  • The \(T^{(i)}_m\) are the measurement times for \(X^{(i)}\), which can be non-random, or be randomly drawn from some distribution, independently of \(X\) and the \(M_i\)

  • The case where \(T^{(i)}_m\) are the same for several \(X^{(i)}\), and implicitly the \(M_i\) are the same too, is the so-called “common design” case

  • The case where the \(T^{(i)}_m\) are random is the so-called “independent design” case (our main focus lies here)

Measurement Errors and Design

  • The \(\varepsilon^{(i)}_m\) are measurement errors, and we allow \[\begin{equation*} \varepsilon^{(i)}_m = \sigma(T^{(i)}_m) e^{(i)}_m, \quad 1\leq m \leq M_i \end{equation*}\]

  • The \(e^{(i)}_m\) are independent copies of a centred variable \(e\) with unit variance, and \(\sigma(T^{(i)}_m)\) is some unknown bounded function which accounts for possibly heteroscedastic measurement errors

  • Our approach applies to both independent design and common design cases

  • Relative to the number of curves, \(N\), the number of points per curve, \(M_i\), may be small (“sparse”) or large (“dense”)

Estimation Grid, Batch/Online

  • Let \(\mathcal T_0\subset [0,1]\) be a set of points of interest

  • Typically, \(\mathcal T_0\) is a refined grid of equidistant points

  • We wish to estimate the following functions:

    • \(\mu(\cdot)\), \(\sigma(\cdot)\), \(f_T(\cdot)\) on \(\mathcal T_0\)

    • \(\Gamma(\cdot,\cdot)\) on \(\mathcal T_0 \times \mathcal T_0\)

  • We are also concerned with computational limitations frequently encountered in this framework (we propose a recursive solution via stochastic approximation algorithms)

  • We will be interested in updating estimates as new functional data arises (“online”) as well as performing estimation on an existing set of curves (“batch”)

Replication

  • One key distinguishing feature of FDA is that of “replication” (i.e., common structure among curves)

  • Essentially, there is prior information in the \(N-1\) sample curves that can be exploited to learn about the \(N\)th, which is not available in, say, classical regression analysis

  • This common structure can be exploited for a variety of purposes

  • For instance, it will allow us to obtain estimates of the regularity of the curves that may vary across their domain \(t\in[0,1]\) (i.e., local regularity estimates)

  • This would not be possible in the classical nonparametric setting where we are restricted to a single curve only

Local Regularity

Overview

  • Smoothness […] such as the existence of continuous second derivatives, is often imposed for regularization and is especially useful if nonparametric smoothing techniques are employed, as is prevalent in FDA” (Wang, Chiou, and Müller 2016)

  • This is problematic since imposing an unknown (and global) degree of smoothness may be incompatible with the underlying stochastic process

  • A key feature of our approach is its data-driven locally adaptive nature

  • We consider a meaningful regularity concept for the data generating process based on probability theory

  • We propose simple estimates for local regularity and link process regularity to sample path regularity

Definition

  • A key element of our approach is “local regularity”, which here is the largest order fractional derivative admitted by the sample paths of \(X\) as measured by the value of \(H_t\), the “local Hölder exponent”, which may vary with \(t\)

  • More precisely, here “local regularity” is the largest value \(H_t\) for which, uniformly with respect to \(u\) and \(v\) in a neighborhood of \(t\), the second order moment of \((X_u-X_v)/|u-v|^{H_t}\) is finite

  • We can then assume \[\begin{equation*} \mathbb{E}\left[(X_u-X_v)^2\right]\approx L_t^2|u-v|^{2H_t} \end{equation*}\] when \(u\) and \(v\) lie in a neighborhood of \(t\)

  • If a function is smooth (i.e., continuously differentiable), then \(H_t=1\), otherwise the function is non-smooth with \(0<H_t<1\)

  • If a function is a constant function then \(L_t=0\), otherwise \(L_t>0\)

Definition

  • Let \([t-\Delta_*/2, t + \Delta_*/2] \cap [0,1]\), and define \[\begin{align*} \theta(u,v) &= \mathbb{E}\left[ (X_u-X_v)^{2} \right], \quad\text{ hence }\\ \theta(u,v) &\approx L_t^2 |u-v|^{2H_t} \quad \text{if } |u-v| \text{ is small and close to }t \end{align*}\]

  • Letting \(\Delta_* =2^{-1}e^{-\log(\bar M_i)^{1/3}}>0\), \(t_1=t-\Delta_*/2\), \(t_3= t + \Delta_*/2\), and \(t_2=(t_1+t_3)/2\) (the definition of \(t_1\) and \(t_3\) is adjusted in boundary regions), then we show that \[\begin{equation*} H_t \approx \frac{\log(\theta(t_1,t_3)) - \log(\theta(t_1,t_2))}{2\log(2)} \quad \text{if } |t_3-t_1| \text{ is small} \end{equation*}\]

  • Moreover, \[\begin{equation*} L_t \approx \frac{\sqrt{\theta(t_1,t_3)}}{|t_1-t_3|^{H_t} } \quad \text{if } |t_3-t_1| \text{ is small} \end{equation*}\]

Estimation

  • The idea is to estimate \(\theta(t_1,t_3)\) and \(\theta(t_1,t_2)\) averaging vertically over curves

  • Given estimates \(\widehat\theta(t_1,t_3)\) and \(\widehat\theta(t_1,t_2)\), the estimators of \(H_{t}\) and \(L_t\) are given by \[\begin{equation*} \widehat H_t = \frac{\log(\widehat\theta(t_1,t_3)) - \log(\widehat\theta(t_1,t_2))}{2\log(2)},\quad \widehat L_t = \frac{\sqrt{\widehat\theta(t_1,t_3)}}{|t_1-t_3|^{\widehat H_t} } \end{equation*}\]

  • The estimator \(\widehat{\theta}(t_l,t_j)\) is the average of local curve smoothers, i.e., \[\begin{equation*} \widehat{\theta}(t_l,t_j)=\frac{1}{N}\sum_{i = 1}^N\left(\widetilde X^{(i)}(t_l)-\widetilde X^{(i)}(t_j)\right)^2 \end{equation*}\]

  • The smoother \(\widetilde X^{(i)}(t)\) depends on a bandwidth \(h_t\) that, post-iteration, adapts to the local regularity of the underlying process

Nonparametric Smoother

  • To construct \(\widehat H_t\) and \(\widehat L_t\) on the previous slide, we use a generic kernel-based smoother on de-meaned \(Y^{(i)}_m\) given by \[\begin{equation*} \widetilde X^{(i)}(t) = \sum_{m=1}^{M_i} W_{m}^{(i)}(t;h_t) \left(Y^{(i)}_m -\widetilde\mu(T^{(i)}_m)\right),\, \sum_{m=1}^{M_i}W_{m}^{(i)}(t;h_t) =1 \end{equation*}\]

  • The weights \(W_{m}^{(i)}(t;h_t)\) are functions of the elements in \(\mathcal T_{obs}^{(i)} = \left\{ T_1^{(i)},\ldots, T_{M_i}^{(i)}\right\}\), and depend on a bandwidth \(h_t\) which can vary with \(t\)

  • The main case we have in mind is local polynomial smoothing

  • In the case of non-differentiable functions (our main focus lies here), we consider the local constant NW estimator (Nadaraya 1965; Watson 1964)

Nonparametric Smoother

  • The weights of the NW estimator of \(X^{(i)}(t)\) are \[\begin{equation*} W_{m}^{(i)}(t;h_t) = K\left( \frac{T^{(i)}_m-t}{h_t} \right)\left[\sum_{m'=1}^{M_i} K\left( \frac{T^{(i)}_{m'}-t}{h_t} \right)\right]^{-1}, \quad 1\leq m \leq M_i \end{equation*}\]

  • \(K(u)\) is a symmetric, non-negative, bounded kernel with support in \([-1,1]\) (e.g., Epanechnikov 1969)

  • For estimating \(\mu(t)\) and \(\Gamma(s,t)\) we might entertain an indicator associated with the weights \(W_{m}^{(i)}(t;h)\), that is \[\begin{equation*} w^{(i)}(t;h) = 1 \quad \text{if} \quad \sum_{i=1}^{M_i} \mathbf 1 \left\{\left|T^{(i)}_m-t\right|\leq h\right\} >1, \quad \text{and } 0 \text{ otherwise} \end{equation*}\]

  • When \(w^{(i)}(t;h) = 0\) we discard the \(i\)th curve vertically as we lack information local to \(t\) (i.e., if all \(W_{m}^{(i)}(t;h)=0\), where \(0/0\coloneqq0\))

Assumptions

Assumptions

  • The stochastic process \(X\) is a random function taking values in \(L^2 (\mathcal T)\), with \(\mathbb E (\| X\|^2) <\infty\)

  • The process is not deterministic with all sample paths equal to a common path

  • The increments of the process have any moment, and the distributions of the increments are sub-Gaussian

  • The functions \(X^{(i)}(t)\) may be nowhere differentiable

  • The process \(X\) may be non-stationary with non-stationary increments

  • The measurement errors \(\varepsilon^{(i)}\) may be heteroscedastic

  • The mean function \(\mu(t)\) may be smoother than the \(X^{(i)}(t)\) functions

  • \(0<L_t<\infty\) and \(0<H_t<1\)

Estimation of Local Regularity

Methodology

  • We take the optimal bandwidth expression for \(h_t\) (which depends on \(H_t\) and \(L_t\)) that minimizes pointwise MSE using a general squared bias term (not the usual term one gets assuming twice differentiable curves)

  • Then, given an initial batch of \(N\) curves, we estimate \(H_t\) and \(L_t\) for each \(t\in \mathcal T_0\), which involves an iterative plug-in procedure:

    1. begin with some starting values for the local bandwidths \(h_t\)

    2. construct preliminary estimates of each curve for every \(t\in \mathcal T_0\) using the data pairs \((Y^{(i)}_m , T^{(i)}_m)\) and local bandwidth starting values

    3. use these preliminary curve estimates to get starting values for \(H_t\) and \(L_t\) for every \(t\in \mathcal T_0\), and plug these into the optimal bandwidth expression

    4. repeat 1-3 using the updated plug-in bandwidths; continue iterating \(H_T\), and \(L_t\) and \(h_t\) for every \(t\in \mathcal T_0\) until the procedure stabilizes (this occurs quite quickly, typically after 10 or so iterations)

Details

  • To estimate the \(i\)th curve at a point \(t\) with local Hölder exponent \(H_t\) and local Hölder constant \(L_t\), the MSE-optimal bandwidth \(h^*_{t,HL}\) is \[\begin{equation*} h^*_{t,HL} = \left[ \frac{\sigma_t^2 \int K^2(u)du }{2H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{\bar M_i} \right]^{\frac{1}{2H_t+1}} \end{equation*}\]

  • The kernel function \(K(u)\) is provided by the user hence \(\int K^2(u)du\) and \(\int |u|^{2H_t}|K(u)|du\) can be computed given \(H_t\)

  • \(\sigma_t^2\) is estimated using one-half the squared differences of the two closest \(Y^{(i)}\) observations at \(t\), averaged across all curves

  • The design density \(f_T(t)\) is straightforward to estimate

  • We estimate \(H_t\) and \(L_t\) for some batch of \(N\) curves as outlined on the previous slide, then we recursively update them as online data arrives

MfBm Example (independent design)

Figure 7: Multifractional Brownian Motion - True Curves

MfBm Example (independent design)

Figure 8: Multifractional Brownian Motion - Data

MfBm Example (independent design)

Figure 9: Multifractional Brownian Motion - Bandwidth Iteration

MfBm Example (independent design)

Figure 10: Multifractional Brownian Motion - Local Hölder Exponents Iteration

MfBm Example (independent design)

Figure 11: Multifractional Brownian Motion - Estimated and True Regularity

Berkeley Growth Study Example (common design)

Figure 12: Berkeley Growth Example - Estimated Regularity

Canadian Weather Study Example (common design)

Figure 13: Canadian Weather Example - Estimated Regularity

\(H_t\) Comparison: growth, canWeather, MfBm

Figure 14: Estimated Regularity Comparison

Estimation of \(\mu(t)\) and \(\Gamma(s,t)\)

Estimation of \(\mu(t)\) and \(\Gamma(s,t)\)

  • In order to estimate \(\mu(t)\) at point \(t\) and \(\Gamma(s,t)\) at points \(s\) and \(t\), ideally we would use the (unknown, continuous) curves evaluated at these points, hence the ideal estimators given \(N\) curves would be \[\begin{align*} \widehat \mu(t)&=\frac{1}{N}\sum_{i=1}^N X^{(i)}(t)\\ \widehat \Gamma(s,t)&=\frac{1}{N}\sum_{i=1}^N \left(X^{(i)}(s)-\mu(s)\right)\left(X^{(i)}(t)-\mu(t)\right) \end{align*}\]

  • Of course, these are infeasible as we don’t observe the true curves, we observe \(M_i\) noisy sample pairs for each curve \(X^{(i)}\) measured at random points

  • That is, we observe \(Y^{(i)}_m=X^{(i)}(T_m^{(i)})+\varepsilon^{(i)}_m\) at discrete irregularly spaced \(T_m^{(i)}\), i.e., we observe vectors of pairs \((Y^{(i)}_m,T_m^{(i)})\) of length \(M_i\), \(i=1,\dots,N\)

Estimation of \(\mu(t)\) and \(\Gamma(s,t)\) Cont.

  • Our estimates of \(\mu(t)\) and \(\Gamma(s,t)\), like those of \(X^{(i)}\), are local in nature and adapt to \(H_t\) and \(L_t\)

  • It can be shown that to estimate \(\mu(t)\) and \(\Gamma(s,t)\) at points \(s\) and \(t\) with local Hölder exponent \(H_t\) and local Hölder constant \(L_t\), the MSE-optimal bandwidths are given by \[\begin{align*} h^*_{t,\mu} &= \left[ \frac{\sigma_t^2 \int K^2(u)du }{2H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{N\bar M_i} \right]^{\frac{1}{2H_t+1}},\\ h^*_{t,\Gamma} &= \left[ \frac{\sigma_t^2 \int K^2(u)du }{4H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{N\bar M^2_i} \right]^{\frac{1}{2H_t+2}} %% Note this is the "sparse" Gamma bandwidth when N exceeds M (don't want to clutter with min of 2 bandwidths) \end{align*}\]

  • Using estimates of \(H_t\), \(L_t\), \(\sigma_t\) and \(f_T(t)\) from the batch of \(N\) curves, we smooth \(X^{(i)}(s)\) and \(X^{(i)}(t)\) using \(\widehat h_{t,\mu}\) and \(\widehat h_{t,\Gamma}\) to construct \(\widehat\mu(t)\) and \(\widehat\Gamma(s,t)\) (here we use, e.g., \(\widehat X^{(i)}(t) = \sum_{m=1}^{M_i} W_{m}^{(i)}(t;\widehat h_{t,\Gamma}) Y^{(i)}_m\))

Example: Estimation of \(\Gamma(s,t)\)

Figure 15: Berkeley Growth

Example: Estimation of \(\Gamma(s,t)\)

Figure 16: Canadian Weather

Example: Estimation of \(\Gamma(s,t)\)

Figure 17: Multifractional Brownian Motion

Curve Recovery

Curve Recovery

  • Sometimes one needs to estimate a curve, e.g., a new online sample arrives and one is interested in estimating the (unknown) new online curve at some point \(t_0\in[0,1]\)

  • We wish to build an estimator of \(X^{(i)}(t_0)\) making use of information in the noisy measurements \((Y^{(i)}_m,T^{(i)}_m)\) of \(X^{(i)}\), and can do so by exploiting information provided by the mean and covariance functions of the process \(X\)

  • Recall that \(\mathcal{T}^{(i)}\) represents the set of design points where the curve \(X^{(i)}\) is measured with error (\(t_0\) can be an element of \(\mathcal{T}^{(i)}\), or not)

  • Let \(\mathbb{Y}^{(i)}=(\mathbb{Y}^{(i)}_1,\dots,\mathbb{Y}^{(i)}_{M_i})^T\), and \(\mathbb{X}^{(i)}=(\mathbb{X}^{(i)}(T^{(i)}_1),\dots,\mathbb{X}^{(i)}(T^{(i)}_{M_i}))^T\)

Curve Recovery Cont.

  • We use the following vectors and matrices determined by the mean and covariance structure of \(X\) and the noise variance: \[\begin{align*} \mu^{(i)}=\left(\mu(T^{(i)}_m)\right)_{1\le m\le M_i},&\qquad \Gamma^{(i)}=\left(\Gamma(T^{(i)}_m,T^{(i)}_{m'})\right)_{1\le m,m'\le M_i},\\\text{and}\quad \Sigma^{(i)}&=\text{diag}\left(\sigma^2(T^{(i)}_m)\right)_{1\le m\le M_i} \end{align*}\]

  • The type of prediction we consider has the form (\(\widetilde X_{inf}^{(i)}(t_0)\) is infeasible) \[\begin{equation*} \widetilde X_{inf}^{(i)}(t_0)=\mu(t_0)+\beta_{t_0}^T\left\{\mathbb{Y}^{(i)}-\mu^{(i)}\right\} \end{equation*}\]

  • Here, \(\beta_{t_0}\) is theoretical quantity which depends on unknown quantities, but a feasible version can be readily constructed by noting that \[\begin{equation*} \beta_{t_0}=\operatorname{Var}^{-1}_{M,T}\left(\mathbb{Y}^{(i)}\right)\operatorname{Cov}_{M,T}\left(\mathbb{Y}^{(i)},X^{(i)}(t_0)\right) \end{equation*}\]

Curve Recovery Cont.

  • It can be shown that \[\begin{equation*} \beta_{t_0}=\left(\Gamma^{(i)}+\Sigma^{(i)}\right)^{-1}\Gamma^{(i)}_{t_0} \end{equation*}\]

  • Of course, this estimator is infeasible, but given existing estimates of \(h\), \(H\), \(L\), \(\mu\), \(\Gamma\) and \(\sigma\), we have a feasible estimator given by \[\begin{equation*} \widetilde X^{(i)}(t_0)=\widehat\mu(t_0)+\widehat\beta_{t_0}^T\left\{\mathbb{Y}^{(i)}-\widehat\mu^{(i)}\right\}\text{ where } \widehat\beta_{t_0}=\left(\widehat\Gamma^{(i)}+\widehat\Sigma^{(i)}\right)^{-1} \widehat\Gamma^{(i)}_{t_0} \end{equation*}\]

  • We expect (and simulations underscore) that this estimator is capable of outperforming the individual local curve estimates used to construct \(H_t\) and \(L_t\) in some settings given a sufficient number of curves (i.e., the \(\widehat X^{(i)}(t)\) considered previously)

  • We also propose a “combined” estimator that is a data-driven convex combination of the reconstructed estimator \(\widetilde X^{(i)}(t_0)\) and the individual curve estimator \(\widehat X^{(i)}(t_0)\)

Example: Curve Recovery Versus Smooth Estimation

Figure 18: MfBm Curve Reconstruction

Online Recursive Updating

Online Recursive Updating

  • As \(N\) increases batch computation can become infeasible

  • Furthermore, new online curves may arrive in real-time

  • We may need to update our batch estimates in real-time, i.e., our estimates of \(H_t\), \(L_t\), \(h_{t,HL}\), \(h_{t,\mu}\), \(h_{t,\Gamma}\), \(\mu(t)\), \(\sigma(t)\), \(f_T(t)\), and \(\Gamma(s,t)\)

  • We shall rely on recursive methods, i.e., “stochastic approximation” algorithms, which date to Robbins and Monro (1951) and Kiefer and Wolfowitz (1952)

  • The challenge is a) to avoid batch computation when the number of curves gets unmanageably large, and b) to process online curves in real-time with negligible memory and computational overhead

  • Though recursive updating will necessarily be sub-optimal, if done properly the efficiency loss will be negligible and can be controlled (batch computation on all curves would be optimal but infeasible)

Online Recursive Updating of Local Hölder Exponent

  • Given observations from a new online curve \(X^{(i+1)}\), let the local bandwidth \(\widehat h_{t,HL}\) be that based on the estimates of \(H_t\) and \(L_t\) for recursion \(i\) (call this \(\widehat h_{t,HL}^{(i)}\))

  • Let \(\gamma_{i+1}=(\sum_{j=1}^iM_j)/(\sum_{j=1}^iM_j+M_{i+1})\) (which equals \(i/(i+1)\) if \(M_1=\dots=M_{i+1}\)), and let \(\widehat{\theta}_i(u,v)=\frac{1}{i}\sum_{j = 1}^i\left\{\widehat X^{(j)}(u)-\widehat X^{(j)}(v)\right\}^2\)

  • The recursively updated estimator of \(\theta(u,v)\) using \(\widehat h_{t,HL}^{(i)}\) is given by \[\begin{equation*} \widehat\theta_{i+1}(u,v) = \gamma_{i+1} \widehat\theta_{i}(u,v)+ (1-\gamma_{i+1}) \left\{\widehat{X}^{(i+1)}(u) - \widehat{X}^{(i+1)}(v)\right\}^2 \end{equation*}\]

  • The recursively updated estimators of \(H_t\) and \(L_t\) are then obtained via \[\begin{equation*} \widehat H_{t,i+1} = \frac{\log\big( \widehat\theta_{i+1}(t_1,t_3)\big) - \log \big(\widehat\theta_{i+1}(t_1,t_2)\big)}{2\log (2)},\quad\widehat L_{t,i+1} = \frac{\sqrt{\widehat\theta_{i+1}(t_1,t_3)}}{|t_1-t_3|^{\widehat H_{t,i+1}}} \end{equation*}\]

Online Recursive Updating of Local Hölder Exponent

  • Updated estimators of \(\sigma(t)\) and \(f_T(t)\) are recursively computable

  • The recursively updated estimators of \(h_{t,HL}\), \(h_{t,\mu}\), \(h_{t,\Gamma}\) can then be computed (recall they depend on \(H_t\), \(L_t\), \(\sigma(t)\), and \(f_T(t)\)), and call these \(\widehat h_{t,HL}^{(i+1)}\), \(\widehat h_{t,\mu}^{(i+1)}\), and \(\widehat h_{t,\Gamma}^{(i+1)}\)

  • Using \(\widehat h_{t,\mu}^{(i+1)}\) or \(\widehat h_{t,\Gamma}^{(i+1)}\) we can estimate \(X^{(i+1)}(t)\) and recursively update the estimators of \(\mu(t)\) or \(\Gamma(s,t)\), again using \(\gamma_{i+1}\) and Robbins and Monro (1951) (\(\widehat h_{t,HL}^{(i+1)}\) will start the next recursion for \(X^{(i+2)}(t)\), etc.)

  • The “memory footprint” is determined by the grid \(\mathcal T_0\subset [0,1]\) (e.g., 100 equidistant points) since we construct estimates of \(\mu(\cdot)\), \(\sigma(\cdot)\), \(f_T(\cdot)\) on \(\mathcal T_0\) and estimates of \(\Gamma(\cdot,\cdot)\) on \(\mathcal T_0 \times \mathcal T_0\)

  • Thus, our procedures require only that we retain and update vectors and matrices of length \(\mathcal T_0\) and dimension \(\mathcal T_0 \times \mathcal T_0\)

Online Recursive Updating Clip

Recursive Updating: Finite-Sample Performance

Table 1 summarizes RMSE performance for 1,000 Monte Carlo replications from a multifractional Brownian motion with covariance \(\Gamma(s,t)\) based on \(\tau=1\), \(L_t=1\), \(H_t\in[0.25,0.75]\), \(\sigma=0.25\), with \(\mu(t)=\sin(2\pi t)\), \(t\sim U[0,1]\), independent design with \(M_i=M\), \(N=16000\) online curves, an initial batch of \(N=5\) curves to “burn in” values for \(H_t\) and \(L_t\), and a grid of 100 equidistant points for \(\mathcal T_0\in[0,1]\)

Table 1: Mean RMSE after final recursion
\(M\) \(\widehat h_{H_t,L_t}\) \(\widehat H_t\) \(\widehat X^{(i)}\) \(\widehat\mu(t)\) \(\widehat\Gamma(s,t)\)
100 0.0108 0.1479 0.1899 0.0202 0.0495
200 0.0076 0.1395 0.1622 0.0186 0.0425
400 0.0051 0.1172 0.1384 0.0167 0.0360
800 0.0032 0.0860 0.1185 0.0149 0.0309

Table 2 summarizes the finite sample RMSE performance of two \(\Gamma(s,t)\) estimators that either a) discard degenerate points vertically (“skip”) or b) use the recovered curves based on the previous estimate of \(\Gamma(s,t)\) (“lp”), as the number of online curves processed increases from \(N=1000,2000,...,16000\), \(M=200\)

Table 2: Mean RMSE after final recursion
\(N\) \(\widehat\Gamma(s,t)_{\text{skip}}\) \(\widehat\Gamma(s,t)_{\text{lp}}\)
1000 0.1006 0.0596
2000 0.0799 0.0416
4000 0.0642 0.0297
8000 0.0524 0.0216
16000 0.0425 0.0152

Summary and Appendices

Summary

  • Though this project has a lot of moving parts, we demonstrate that the approach can deliver consistent computationally feasible FDA

  • Most importantly, our method is data-driven and locally adaptive to the regularity of the stochastic process

  • We support both batch estimation and online updating using computationally attractive approaches

  • Though not mentioned explicitly so far, the pointwise asymptotics for the individual curve, mean curve, and covariance curve estimates are established and can be used to construct interval estimates etc.

  • R code exists and we expect to be releasing this publicly in the near future

Appendix A: Resources

Appendix B: MfBm Gaussian Processes

  • A Multifractional Brownian Motion (MfBm, Peltier and Lévy Véhel 1995), say \((W(t))_{t\geq 0}\), with Hurst index function \(H_t \in(0,1)\), is a centred Gaussian process with covariance function \[\begin{equation*} C(s,t) = \mathbb{E}\left[W(s)W(t)\right] = D(H_s,H_t )\left[ s^{H_s+H_t} + t^{H_s+H_t} - |t-s|^{H_s+H_t}\right],\, s, t\geq 0, \end{equation*}\] where \[\begin{equation*} D(x,y)=\frac{\sqrt{\Gamma (2x+1)\Gamma (2y+1)\sin(\pi x)\sin(\pi y)}} {2\Gamma (x+y+1)\sin(\pi(x+y)/2)}, \, D(x,x) = 1/2,\, x,y >0 \end{equation*}\]

  • Define the Hurst index function given \(0 <\underline H \leq \overline H <1\), a change point \(t_c \in (0,1)\), and a slope \(S>0\), by \[\begin{equation*} H_t = \underline H + \frac{\overline H - \underline H}{1+\exp(- S (t-t_c))}, \qquad t\in [0,1] \end{equation*}\]

Appendix B: MfBm Gaussian Processes Cont.

  • The MfBm data are then generated as follows (Chan and Wood 1998)

    • for each \(i\), an integer \(M_i\) is generated as a realization of some random variable with mean \(\mathfrak m\) (e.g., Poisson), or \(M_i\) could be a constant

    • next, generate \(M_i\) independent draws \(T_1^{(i)},\ldots,T_{M_i}^{(i)}\) from a uniform random variable on \([0,1]\)

    • using the covariance formula above, the \(M_i\times M_i\) matrix \(C^{(i)}\) with the entries \[\begin{equation*} C^{(i)}(T_m^{(i)}, T_{m^\prime}^{(i)}),\quad 1\leq m,m^\prime \leq M_i, \end{equation*}\] is computed

  • The \(M_i\)-dimensional vector with components \(X^{(i)}(T_n^{(i)})\), \(1\leq m \leq M_i\) is the realization of a zero mean Gaussian distribution with covariance matrix \(C^{(i)}\)

  • While the increments for Brownian Motion (i.e., when \(H_t=1/2\)) are stationary and independent, increments for MfBm are neither

Appendix C: (Non-)Smooth, (Ir)Regular

  • We use the terms “smooth” and “regular”, or their opposites “non-smooth” and “irregular” so we provide some background that might be helpful

  • We use “smoothness” of a function in the standard sense, i.e.,

    • smoothness is measured by the number of continuous derivatives over some domain (called the “differentiability class”)

    • at minimum, a function is considered smooth if it is “differentiable everywhere” (hence continuous)

    • at the other extreme, if it also possesses continuous derivatives of all orders it is said to be “infinitely differentiable” and is often referred to as a “\(C^{{\infty }}\) function”

  • Thus, “non-smooth” functions are not differentiable everywhere, and in the extreme may be “nowhere differentiable”

Appendix C: (Non-)Smooth, (Ir)Regular Cont.

  • We use the term “irregular” to refer to non-smooth functions whose “Hölder continuity exponent” varies over some domain

  • We have in mind Hölder continuity where the “Hölder exponent” \(H\) defines the regularity of the function (also called its “Hurst” index)

  • But, in addition, we have in mind that such regularity might vary over \(t\) hence we write \(H_t\) and call this the “local Hölder exponent”

  • The Hölder continuity condition you may be familiar with is \[|X_u-X_v|\le L|u-v|^H,\quad 0<L<\infty,\quad 0<H< 1\]

  • So taking the square we have \[|X_u-X_v|^2\le L^2|u-v|^{2H}\]

Appendix C: (Non-)Smooth, (Ir)Regular Cont.

  • Now, \(X\) is just a realization of a stochastic process, and the condition is that we put an expectation over all realizations, and since we want to conduct statistical analysis we can’t use \(\le\), we instead must use \(\approx\)

  • You need equality in order to get estimates of \(H\) and \(L\) (if we don’t have equality, forget it), but the good news is that almost all the processes you find in probability texts do indeed satisfy the almost equal condition we use (e.g., derivatives, squares, log(1+…) for Gaussian processes do satisfy this with equality), so in fact there is no loss in generality and this is not restrictive at all

  • Furthermore, we adopt an augmented Hölder continuity condition and allow \(H_t\) and \(L_t\) to vary on \(t\in[0,1]\), thus we work with \[\mathbb{E}\left[(X_u-X_v)^2\right]\approx L_t^2|u-v|^{2H_t},\quad u,v\text{ in some neighborhood of } t\]

Appendix D: Pointwise Local Curve Properties

  • The Hölder class MSE of one kernel smoothed curve at a point \(t\), where \(M_i\) is the number of observations for the \(i\)th curve, is given by \[MSE_i\leq C_1(t)h^{2H_t} + \frac{C_2(t)}{M_i f_T(t) h}\]

  • The constants in the above formula are \(C_1(t)=L_t^2\int |u|^{2H_t}|K(u)|du\) and \(C_2(t)=\sigma^2_t \int K^2(u)du\), where \(K(u)\) is the kernel function

  • Given this, the MSE-optimal bandwidth for curve \(i\) at point \(t\) is \[h^*_t=\left[\frac{C_2(t)}{2H_tC_1(t)M_i f_T(t)}\right]^{\frac{1}{2H_t+1}}\]

  • Using the Epanechnikov kernel given by \(\frac{3}{4}(1-u^2)\) on \([-1,1]\), it can be shown that \(C_1(t)=\frac{3L_t^2}{[2H_T+1][2H_t+3]}\) and \(C_2(t)=\frac{3}{5}\sigma^2_t\)

Appendix E: Estimation of \(\sigma^2_t\)

  • To construct a consistent estimator of \(\sigma^2_t\) appearing in the MSE-optimal bandwidth formula, we use the following expression: \[\begin{equation*} \widehat\sigma^2_t=\frac{1}{2N}\sum_{i=1}^N\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2 \end{equation*}\]

  • To obtain this expression, let \(m(t)\) denote the order statistic of discrete sample point \(m\) indexed at \(t\) (i.e., \(T^{(i)}_{m(t)}\) is the closest sample design point to \(t\), \(T^{(i)}_{m(t-1)}\) the second closest, etc.)

  • Recalling that \(\varepsilon^{(i)}_m = \sigma(T^{(i)}_m) e^{(i)}_m\), we can express \((Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)})\) in the expression above as follows: \[\begin{align*} Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}&=X^{(i)}(T^{(i)}_{m(t)})+\varepsilon^{(i)}_{m(t)}-X^{(i)}(T^{(i)}_{m(t-1)})-\varepsilon^{(i)}_{m(t-1)}\\ &=\left(X^{(i)}(T^{(i)}_{m(t)})-X^{(i)}(T^{(i)}_{m(t-1)})\right)+\left(\varepsilon^{(i)}_{m(t)}-\varepsilon^{(i)}_{m(t-1)}\right) \end{align*}\]

Appendix E: Estimation of \(\sigma^2_t\) Cont.

  • Given continuity of \(X^{(i)}(T^{(i)}_{m(t)})\), the first term on line 2 is negligible

  • Given this, and the independence of the \(e^{(i)}_{m(t)}\), we have \[\begin{align*} \mathbb{E}\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2 &\approx\mathbb{E}\left(\varepsilon^{(i)}_{m(t)}-\varepsilon^{(i)}_{m(t-1)}\right)^2=2\sigma^2(t), \end{align*}\] which leads to the expression \(\widehat\sigma^2_t=\frac{1}{2N}\sum_{i=1}^N\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2\)

  • A similar approach has been used in a classical homoscedastic nonparametric setting (Horowitz and Spokoiny 2001, Equation (2.9))

  • We modify this for use in an FDA setting, and by exploiting “replication” (i.e., by averaging vertically over all curves at point \(t\)) we obtain a simple method for computing the measurement noise variance that allows for heteroscedasticity of unknown form

References (scrollable)

Belhakem, Ryad, Franck Picard, Vincent Rivoirard, and Angelina Roche. 2021. “Minimax Estimation of Functional Principal Components from Noisy Discretized Functional Data.” https://doi.org/https://doi.org/10.48550/arXiv.2110.12739.
Cai, T. Tony, Mark Low, and Zongming Ma. 2014. “Adaptive Confidence Bands for Nonparametric Regression Functions.” Journal of the American Statistical Association 109 (507): 1054–70.
Cai, T. Tony, and Ming Yuan. 2011. Optimal estimation of the mean function based on discretely sampled functional data: Phase transition.” The Annals of Statistics 39 (5): 2330–55. https://doi.org/10.1214/11-AOS898.
Chan, Grace, and Andrew T. A. Wood. 1998. “Simulation of Multifractional Brownian Motion.” In COMPSTAT, edited by Roger Payne and Peter Green, 233–38. Heidelberg: Physica-Verlag HD.
Epanechnikov, V. A. 1969. “Nonparametric Estimation of a Multidimensional Probability Density.” Theory of Applied Probability 14: 153–58.
Febrero-Bande, Manuel, and Manuel Oviedo de la Fuente. 2012. “Statistical Computing in Functional Data Analysis: The R Package fda.usc.” Journal of Statistical Software 51 (4): 1–28. https://www.jstatsoft.org/v51/i04/.
Giné, Evarist, and Richard Nickl. 2010. Confidence bands in density estimation.” The Annals of Statistics 38 (2): 1122–70. https://doi.org/10.1214/09-AOS738.
Gloter, Arnaud, and Marc Hoffmann. 2007. Estimation of the Hurst parameter from discrete noisy data.” The Annals of Statistics 35 (5): 1947–74. https://doi.org/10.1214/009053607000000316.
Golovkine, Steven, Nicolas Klutchnikoff, and Valentin Patilea. 2022. Learning the smoothness of noisy curves with application to online curve estimation.” Electronic Journal of Statistics 16 (1): 1485–1560. https://doi.org/10.1214/22-EJS1997.
Hall, Peter, and Joel L. Horowitz. 2007. “Methodology and Convergence Rates for Functional Linear Regression.” The Annals of Statistics 35 (1): 70–91.
Horowitz, Joel L., and Vladimir G. Spokoiny. 2001. “An Adaptive, Rate-Optimal Test of a Parametric Mean-Regression Model Against a Nonparametric Alternative.” Econometrica 69 (3): 599–631.
Horváth, Lajos, and Piotr Kokoszka. 2012. Inference for Functional Data with Applications. Vol. 200. Springer Series in Statistics. https://doi.org/https://doi.org/10.1007/978-1-4614-3655-3.
Kiefer, J., and J. Wolfowitz. 1952. Stochastic Estimation of the Maximum of a Regression Function.” The Annals of Mathematical Statistics 23 (3): 462–66. https://doi.org/10.1214/aoms/1177729392.
Kokoszka, P., and M. Reimherr. 2017. 1st ed. Chapman; Hall/CRC. https://doi.org/https://doi.org/10.1201/9781315117416.
Lepski, O. V., E. Mammen, and V. G. Spokoiny. 1997. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors.” The Annals of Statistics 25 (3): 929–47. https://doi.org/10.1214/aos/1069362731.
Levitin, Daniel J., Regina L. Nuzzo, Bradley W. Vines, and James O. Ramsay. 2007. “Introduction to Functional Data Analysis.” Canadian Psychology 48: 135–55.
Nadaraya, E. A. 1965. “On Nonparametric Estimates of Density Functions and Regression Curves.” Theory of Applied Probability 10: 186–90.
Padilla-Segarra, Adrián, Mabel González-Villacorte, Isidro R. Amaro, and Saba Infante. 2020. “Brief Review of Functional Data Analysis: A Case Study on Regional Demographic and Economic Data.” In Information and Communication Technologies, edited by Germania Rodriguez Morales, Efraín R. Fonseca C., Juan Pablo Salgado, Pablo Pérez-Gosende, Marcos Orellana Cordero, and Santiago Berrezueta, 163–76. Cham: Springer International Publishing.
Peltier, Romain-François, and Jacques Lévy Véhel. 1995. Multifractional Brownian Motion : Definition and Preliminary Results.” Research Report RR-2645. INRIA. https://hal.inria.fr/inria-00074045.
Ramsay, J. O., Spencer Graves, and Giles Hooker. 2022. Fda: Functional Data Analysis. https://CRAN.R-project.org/package=fda.
Ramsay, J. O., and B. W. Silverman. 2005. Functional Data Analysis. 2nd ed. New York: Springer-Verlag.
Ray, Kolyan. 2017. Adaptive Bernstein–von Mises theorems in Gaussian white noise.” The Annals of Statistics 45 (6): 2511–36. https://doi.org/10.1214/16-AOS1533.
Reiss, Philip T., Jeff Goldsmith, Han Lin Shang, and R. Todd Ogden. 2017. “Methods for Scalar-on-Function Regression.” International Statistical Review / Revue Internationale de Statistique 85 (2): 228–49. http://www.jstor.org/stable/44840886.
Robbins, Herbert, and Sutton Monro. 1951. A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.
Tian, Renfang. 2020. “On Functional Data Analysis: Methodologies and Applications.” PhD thesis, UWSpace. http://hdl.handle.net/10012/15811.
Wang, Jane-Ling, Jeng-Min Chiou, and Hans-Georg Müller. 2016. “Functional Data Analysis.” Annual Review of Statistics and Its Application 3 (1): 257–95. https://doi.org/10.1146/annurev-statistics-041715-033624.
Watson, G. S. 1964. “Smooth Regression Analysis.” Sankhya 26 (15): 175–84.
Zhou, Yidong, Satarupa Bhattacharjee, Cody Carroll, Yaqing Chen, Xiongtao Dai, Jianing Fan, Alvaro Gajardo, et al. 2022. Fdapace: Functional Data Analysis and Empirical Dynamics. https://CRAN.R-project.org/package=fdapace.