ENSAI & CREST\(^\dagger\), valentin.patilea@ensai.fr
McMaster University\(^\ddagger\), racinej@mcmaster.ca
Tuesday, June 25, 2024
Link to slides - jeffreyracine.github.io/Braga (case sensitive, Google Translate)
Link to paper - https://ideas.repec.org/p/mcm/deptwp/2024-04.html
View full screen by pressing the F key (press the Esc key to revert)
Access navigation menu by pressing the M key (click X in navigation menu to close)
Advance using arrow keys
Zoom in by holding down the Alt key in MS Windows, Opt key in macOS or Ctrl key in Linux, and clicking on any screen element (Alt/Opt/Ctrl click again to zoom out)
Export to a PDF by pressing the E key (wait a few seconds, then print [or print using system dialog], enable landscape layout, then save as PDF - press the E key to revert)
Enable drawing tools - chalk board by pressing the B key (B to revert), notes canvas by pressing the C key (C to revert), press the Del key to erase, press the D key to download drawings
One drawback with classical smoothing methods (kernels, splines, wavelets etc.) is their reliance on assuming the degree of smoothness (and thereby assuming continuous differentiability up to some order) for the underlying object being estimated. However, the underlying object may in fact be irregular (i.e., non-smooth and even perhaps nowhere differentiable) and, as well, the (ir)regularity of the underlying function may vary across its support. Elaborate adaptive methods for curve estimation have been proposed, however, their intrinsic complexity presents a formidable and perhaps even insurmountable barrier to their widespread adoption by practitioners. We contribute to the functional data literature by providing a pointwise MSE-optimal, data-driven, iterative plug-in estimator of “local regularity” and a computationally attractive, recursive, online updating method. In so doing we are able to separate measurement error “noise” from “irregularity” thanks to “replication”, a hallmark of functional data. Our results open the door for the construction of minimax optimal rates, “honest” confidence intervals, and the like, for various quantities of interest.
Modern data is often functional in nature (e.g., an electrocardiogram (ECG) and many other measures recorded by wearable devices)
The analysis of functional data requires nonparametric methods
However, nonparametric methods rely on smoothness assumptions (i.e., require you to assume something we don’t know)
We show how we can learn the degree of (non)smoothness in functional data settings and we separate this from measurement noise (this cannot be done with classical data)
This allows us to conduct functional data analysis that is optimal (we do this in a statistical framework)
We emphasize online computation (i.e., how to update when new functional data becomes available)
A defining feature of classical regression analysis is that
sample elements are random pairs, \((y_i,x_i)\)
the function of interest \(\mathbb{E}(Y|X=x)\) is non-random
A defining feature of functional data analysis is that
sample elements are random functions, \(X^{(i)}\)
these are also functions of interest
The following figure (Figure 1) presents \(N=25\) sample elements (classical left plot, functional right plot)
FDA is the statistical analysis of samples of curves (i.e., samples of random variables taking values in spaces of functions)
FDA has heterogeneous, longitudinal aspects (“individual trajectories”)
Curves are continuums, so we never know the curve values at all points
The curves are only available at discrete points (e.g., \((Y^{(i)}_m , T^{(i)}_m) \in\mathbb R \times [0,1]\))
The points at which curves are available can differ across curves
The curves may be measured with error
Consider measurements taken from 1 random curve:
Functional data carry information along the curves and among the curves
Consider a second-order stochastic process with continuous trajectories, \(X = (X_t : t\in [0,1])\)
The mean and covariance functions are \[\begin{equation*} \mu(t) = \mathbb{E}(X_t)\text{ and } \Gamma (s,t) = \mathbb{E}\left\{ [X_s - \mu(s)] [X_t-\mu(t)]\right\},\, s,t\in [0,1] \end{equation*}\]
The framework we consider is one where independent sample path realizations \(X^{(i)}\), \(i=1,2\ldots,N\), of \(X\) are measured with error at discrete times
The data associated with the \(i\)th sample path \(X^{(i)}\) consists of the pairs \((Y^{(i)}_m , T^{(i)}_m) \in\mathbb R \times [0,1]\) generated as \[\begin{equation*} Y^{(i)}_m = X^{(i)}(T^{(i)}_m) + \varepsilon^{(i)}_m, \qquad 1\leq m \leq M_i \end{equation*}\]
The \(\varepsilon^{(i)}_m\) are measurement errors, and we allow \[\begin{equation*} \varepsilon^{(i)}_m = \sigma(T^{(i)}_m) e^{(i)}_m, \quad 1\leq m \leq M_i \end{equation*}\]
The \(e^{(i)}_m\) are independent copies of a centred variable \(e\) with unit variance, and \(\sigma(T^{(i)}_m)\) is some unknown bounded function which accounts for possibly heteroscedastic measurement errors
Our approach applies to both independent design and common design cases
Relative to the number of curves, \(N\), the number of points per curve, \(M_i\), may be small (“sparse”) or large (“dense”)
One key distinguishing feature of FDA is that of “replication” (i.e., common structure among curves)
Essentially, there is prior information in the \(N-1\) sample curves that can be exploited to learn about the \(N\)th, which is not available in, say, classical regression analysis
This common structure can be exploited for a variety of purposes
For instance, it will allow us to obtain estimates of the regularity of the curves that may vary across their domain \(t\in[0,1]\) (i.e., local regularity estimates)
This would not be possible in the classical nonparametric setting where we are restricted to a single curve only
“Smoothness […] such as the existence of continuous second derivatives, is often imposed for regularization and is especially useful if nonparametric smoothing techniques are employed, as is prevalent in FDA” (Wang, Chiou, and Müller 2016)
This is problematic since imposing an unknown (and global) degree of smoothness may be incompatible with the underlying stochastic process
A key feature of our approach is its data-driven locally adaptive nature
We consider a meaningful regularity concept for the data generating process based on probability theory
We propose simple estimates for local regularity and link process regularity to sample path regularity
A key element of our approach is “local regularity”, which here is the largest order fractional derivative admitted by the sample paths of \(X\) as measured by the value of \(H_t\), the “local Hölder exponent”, which may vary with \(t\)
More precisely, here “local regularity” is the largest value \(H_t\) for which, uniformly with respect to \(u\) and \(v\) in a neighborhood of \(t\), the second order moment of \((X_u-X_v)/|u-v|^{H_t}\) is finite
We can then assume \[\begin{equation*} \mathbb{E}\left[(X_u-X_v)^2\right]\approx L_t^2|u-v|^{2H_t} \end{equation*}\] when \(u\) and \(v\) lie in a neighborhood of \(t\)
If a function is smooth (i.e., continuously differentiable), then \(H_t=1\), otherwise the function is non-smooth with \(0<H_t<1\)
If a function is a constant function then \(L_t=0\), otherwise \(L_t>0\)
Let \([t-\Delta_*/2, t + \Delta_*/2] \cap [0,1]\), and define \[\begin{align*} \theta(u,v) &= \mathbb{E}\left[ (X_u-X_v)^{2} \right], \quad\text{ hence }\\ \theta(u,v) &\approx L_t^2 |u-v|^{2H_t} \quad \text{if } |u-v| \text{ is small and close to }t \end{align*}\]
Letting \(\Delta_* =2^{-1}e^{-\log(\bar M_i)^{1/3}}>0\), \(t_1=t-\Delta_*/2\), \(t_3= t + \Delta_*/2\), and \(t_2=(t_1+t_3)/2\) (the definition of \(t_1\) and \(t_3\) is adjusted in boundary regions), then we show that \[\begin{equation*} H_t \approx \frac{\log(\theta(t_1,t_3)) - \log(\theta(t_1,t_2))}{2\log(2)} \quad \text{if } |t_3-t_1| \text{ is small} \end{equation*}\]
Moreover, \[\begin{equation*} L_t \approx \frac{\sqrt{\theta(t_1,t_3)}}{|t_1-t_3|^{H_t} } \quad \text{if } |t_3-t_1| \text{ is small} \end{equation*}\]
The idea is to estimate \(\theta(t_1,t_3)\) and \(\theta(t_1,t_2)\) averaging vertically over curves
Given estimates \(\widehat\theta(t_1,t_3)\) and \(\widehat\theta(t_1,t_2)\), the estimators of \(H_{t}\) and \(L_t\) are given by \[\begin{equation*} \widehat H_t = \frac{\log(\widehat\theta(t_1,t_3)) - \log(\widehat\theta(t_1,t_2))}{2\log(2)},\quad \widehat L_t = \frac{\sqrt{\widehat\theta(t_1,t_3)}}{|t_1-t_3|^{\widehat H_t} } \end{equation*}\]
The estimator \(\widehat{\theta}(t_l,t_j)\) is the average of local curve smoothers, i.e., \[\begin{equation*} \widehat{\theta}(t_l,t_j)=\frac{1}{N}\sum_{i = 1}^N\left(\widetilde X^{(i)}(t_l)-\widetilde X^{(i)}(t_j)\right)^2 \end{equation*}\]
The smoother \(\widetilde X^{(i)}(t)\) depends on a bandwidth \(h_t\) that, post-iteration, adapts to the local regularity of the underlying process
The stochastic process \(X\) is a random function taking values in \(L^2 (\mathcal T)\), with \(\mathbb E (\| X\|^2) <\infty\)
The process is not deterministic with all sample paths equal to a common path
The increments of the process have any moment, and the distributions of the increments are sub-Gaussian
The functions \(X^{(i)}(t)\) may be nowhere differentiable
The process \(X\) may be non-stationary with non-stationary increments
The measurement errors \(\varepsilon^{(i)}\) may be heteroscedastic
The mean function \(\mu(t)\) may be smoother than the \(X^{(i)}(t)\) functions
\(0<L_t<\infty\) and \(0<H_t<1\)
We take the optimal bandwidth expression for \(h_t\) (which depends on \(H_t\) and \(L_t\)) that minimizes pointwise MSE using a general squared bias term (not the usual term one gets assuming twice differentiable curves)
Then, given an initial batch of \(N\) curves, we estimate \(H_t\) and \(L_t\) for each \(t\in \mathcal T_0\), which involves an iterative plug-in procedure:
begin with some starting values for the local bandwidths \(h_t\)
construct preliminary estimates of each curve for every \(t\in \mathcal T_0\) using the data pairs \((Y^{(i)}_m , T^{(i)}_m)\) and local bandwidth starting values
use these preliminary curve estimates to get starting values for \(H_t\) and \(L_t\) for every \(t\in \mathcal T_0\), and plug these into the optimal bandwidth expression
repeat 1-3 using the updated plug-in bandwidths; continue iterating \(H_T\), and \(L_t\) and \(h_t\) for every \(t\in \mathcal T_0\) until the procedure stabilizes (this occurs quite quickly, typically after 10 or so iterations)
To estimate the \(i\)th curve at a point \(t\) with local Hölder exponent \(H_t\) and local Hölder constant \(L_t\), the MSE-optimal bandwidth \(h^*_{t,HL}\) is \[\begin{equation*} h^*_{t,HL} = \left[ \frac{\sigma_t^2 \int K^2(u)du }{2H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{\bar M_i} \right]^{\frac{1}{2H_t+1}} \end{equation*}\]
The kernel function \(K(u)\) is provided by the user hence \(\int K^2(u)du\) and \(\int |u|^{2H_t}|K(u)|du\) can be computed given \(H_t\)
\(\sigma_t^2\) is estimated using one-half the squared differences of the two closest \(Y^{(i)}\) observations at \(t\), averaged across all curves
The design density \(f_T(t)\) is straightforward to estimate
We estimate \(H_t\) and \(L_t\) for some batch of \(N\) curves as outlined on the previous slide, then we recursively update them as online data arrives
In order to estimate \(\mu(t)\) at point \(t\) and \(\Gamma(s,t)\) at points \(s\) and \(t\), ideally we would use the (unknown, continuous) curves evaluated at these points, hence the ideal estimators given \(N\) curves would be \[\begin{align*} \widehat \mu(t)&=\frac{1}{N}\sum_{i=1}^N X^{(i)}(t)\\ \widehat \Gamma(s,t)&=\frac{1}{N}\sum_{i=1}^N \left(X^{(i)}(s)-\mu(s)\right)\left(X^{(i)}(t)-\mu(t)\right) \end{align*}\]
Of course, these are infeasible as we don’t observe the true curves, we observe \(M_i\) noisy sample pairs for each curve \(X^{(i)}\) measured at random points
That is, we observe \(Y^{(i)}_m=X^{(i)}(T_m^{(i)})+\varepsilon^{(i)}_m\) at discrete irregularly spaced \(T_m^{(i)}\), i.e., we observe vectors of pairs \((Y^{(i)}_m,T_m^{(i)})\) of length \(M_i\), \(i=1,\dots,N\)
Our estimates of \(\mu(t)\) and \(\Gamma(s,t)\), like those of \(X^{(i)}\), are local in nature and adapt to \(H_t\) and \(L_t\)
It can be shown that to estimate \(\mu(t)\) and \(\Gamma(s,t)\) at points \(s\) and \(t\) with local Hölder exponent \(H_t\) and local Hölder constant \(L_t\), the MSE-optimal bandwidths are given by \[\begin{align*} h^*_{t,\mu} &= \left[ \frac{\sigma_t^2 \int K^2(u)du }{2H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{N\bar M_i} \right]^{\frac{1}{2H_t+1}},\\ h^*_{t,\Gamma} &= \left[ \frac{\sigma_t^2 \int K^2(u)du }{4H_t L_t^2\times \int |u|^{2H_t}|K(u)|du\times f_T(t)}\times \frac{1}{N\bar M^2_i} \right]^{\frac{1}{2H_t+2}} %% Note this is the "sparse" Gamma bandwidth when N exceeds M (don't want to clutter with min of 2 bandwidths) \end{align*}\]
Using estimates of \(H_t\), \(L_t\), \(\sigma_t\) and \(f_T(t)\) from the batch of \(N\) curves, we smooth \(X^{(i)}(s)\) and \(X^{(i)}(t)\) using \(\widehat h_{t,\mu}\) and \(\widehat h_{t,\Gamma}\) to construct \(\widehat\mu(t)\) and \(\widehat\Gamma(s,t)\) (here we use, e.g., \(\widehat X^{(i)}(t) = \sum_{m=1}^{M_i} W_{m}^{(i)}(t;\widehat h_{t,\Gamma}) Y^{(i)}_m\))
Given observations from a new online curve \(X^{(i+1)}\), let the local bandwidth \(\widehat h_{t,HL}\) be that based on the estimates of \(H_t\) and \(L_t\) for recursion \(i\) (call this \(\widehat h_{t,HL}^{(i)}\))
Let \(\gamma_{i+1}=(\sum_{j=1}^iM_j)/(\sum_{j=1}^iM_j+M_{i+1})\) (which equals \(i/(i+1)\) if \(M_1=\dots=M_{i+1}\)), and let \(\widehat{\theta}_i(u,v)=\frac{1}{i}\sum_{j = 1}^i\left\{\widehat X^{(j)}(u)-\widehat X^{(j)}(v)\right\}^2\)
The recursively updated estimator of \(\theta(u,v)\) using \(\widehat h_{t,HL}^{(i)}\) is given by \[\begin{equation*} \widehat\theta_{i+1}(u,v) = \gamma_{i+1} \widehat\theta_{i}(u,v)+ (1-\gamma_{i+1}) \left\{\widehat{X}^{(i+1)}(u) - \widehat{X}^{(i+1)}(v)\right\}^2 \end{equation*}\]
The recursively updated estimators of \(H_t\) and \(L_t\) are then obtained via \[\begin{equation*} \widehat H_{t,i+1} = \frac{\log\big( \widehat\theta_{i+1}(t_1,t_3)\big) - \log \big(\widehat\theta_{i+1}(t_1,t_2)\big)}{2\log (2)},\quad\widehat L_{t,i+1} = \frac{\sqrt{\widehat\theta_{i+1}(t_1,t_3)}}{|t_1-t_3|^{\widehat H_{t,i+1}}} \end{equation*}\]
Updated estimators of \(\sigma(t)\) and \(f_T(t)\) are recursively computable
The recursively updated estimators of \(h_{t,HL}\), \(h_{t,\mu}\), \(h_{t,\Gamma}\) can then be computed (recall they depend on \(H_t\), \(L_t\), \(\sigma(t)\), and \(f_T(t)\)), and call these \(\widehat h_{t,HL}^{(i+1)}\), \(\widehat h_{t,\mu}^{(i+1)}\), and \(\widehat h_{t,\Gamma}^{(i+1)}\)
Using \(\widehat h_{t,\mu}^{(i+1)}\) or \(\widehat h_{t,\Gamma}^{(i+1)}\) we can estimate \(X^{(i+1)}(t)\) and recursively update the estimators of \(\mu(t)\) or \(\Gamma(s,t)\), again using \(\gamma_{i+1}\) and Robbins and Monro (1951) (\(\widehat h_{t,HL}^{(i+1)}\) will start the next recursion for \(X^{(i+2)}(t)\), etc.)
The “memory footprint” is determined by the grid \(\mathcal T_0\subset [0,1]\) (e.g., 100 equidistant points) since we construct estimates of \(\mu(\cdot)\), \(\sigma(\cdot)\), \(f_T(\cdot)\) on \(\mathcal T_0\) and estimates of \(\Gamma(\cdot,\cdot)\) on \(\mathcal T_0 \times \mathcal T_0\)
Thus, our procedures require only that we retain and update vectors and matrices of length \(\mathcal T_0\) and dimension \(\mathcal T_0 \times \mathcal T_0\)
Though this project has a lot of moving parts, we demonstrate that the approach can deliver consistent computationally feasible FDA
Most importantly, our method is data-driven and locally adaptive to the regularity of the stochastic process
We support both batch estimation and online updating using computationally attractive approaches
Though not mentioned explicitly so far, the pointwise asymptotics for the individual curve, mean curve, and covariance curve estimates are established and can be used to construct interval estimates etc.
R code exists and we expect to be releasing this publicly in the near future
Books:
Review Articles:
R packages:
FDA CRAN Task View:
Applications in Economics:
A Multifractional Brownian Motion (MfBm, Peltier and Lévy Véhel 1995), say \((W(t))_{t\geq 0}\), with Hurst index function \(H_t \in(0,1)\), is a centred Gaussian process with covariance function \[\begin{equation*} C(s,t) = \mathbb{E}\left[W(s)W(t)\right] = D(H_s,H_t )\left[ s^{H_s+H_t} + t^{H_s+H_t} - |t-s|^{H_s+H_t}\right],\, s, t\geq 0, \end{equation*}\] where \[\begin{equation*} D(x,y)=\frac{\sqrt{\Gamma (2x+1)\Gamma (2y+1)\sin(\pi x)\sin(\pi y)}} {2\Gamma (x+y+1)\sin(\pi(x+y)/2)}, \, D(x,x) = 1/2,\, x,y >0 \end{equation*}\]
Define the Hurst index function given \(0 <\underline H \leq \overline H <1\), a change point \(t_c \in (0,1)\), and a slope \(S>0\), by \[\begin{equation*} H_t = \underline H + \frac{\overline H - \underline H}{1+\exp(- S (t-t_c))}, \qquad t\in [0,1] \end{equation*}\]
The MfBm data are then generated as follows (Chan and Wood 1998)
for each \(i\), an integer \(M_i\) is generated as a realization of some random variable with mean \(\mathfrak m\) (e.g., Poisson), or \(M_i\) could be a constant
next, generate \(M_i\) independent draws \(T_1^{(i)},\ldots,T_{M_i}^{(i)}\) from a uniform random variable on \([0,1]\)
using the covariance formula above, the \(M_i\times M_i\) matrix \(C^{(i)}\) with the entries \[\begin{equation*} C^{(i)}(T_m^{(i)}, T_{m^\prime}^{(i)}),\quad 1\leq m,m^\prime \leq M_i, \end{equation*}\] is computed
The \(M_i\)-dimensional vector with components \(X^{(i)}(T_n^{(i)})\), \(1\leq m \leq M_i\) is the realization of a zero mean Gaussian distribution with covariance matrix \(C^{(i)}\)
While the increments for Brownian Motion (i.e., when \(H_t=1/2\)) are stationary and independent, increments for MfBm are neither
We use the terms “smooth” and “regular”, or their opposites “non-smooth” and “irregular” so we provide some background that might be helpful
We use “smoothness” of a function in the standard sense, i.e.,
smoothness is measured by the number of continuous derivatives over some domain (called the “differentiability class”)
at minimum, a function is considered smooth if it is “differentiable everywhere” (hence continuous)
at the other extreme, if it also possesses continuous derivatives of all orders it is said to be “infinitely differentiable” and is often referred to as a “\(C^{{\infty }}\) function”
Thus, “non-smooth” functions are not differentiable everywhere, and in the extreme may be “nowhere differentiable”
We use the term “irregular” to refer to non-smooth functions whose “Hölder continuity exponent” varies over some domain
We have in mind Hölder continuity where the “Hölder exponent” \(H\) defines the regularity of the function (also called its “Hurst” index)
But, in addition, we have in mind that such regularity might vary over \(t\) hence we write \(H_t\) and call this the “local Hölder exponent”
The Hölder continuity condition you may be familiar with is \[|X_u-X_v|\le L|u-v|^H,\quad 0<L<\infty,\quad 0<H< 1\]
So taking the square we have \[|X_u-X_v|^2\le L^2|u-v|^{2H}\]
Now, \(X\) is just a realization of a stochastic process, and the condition is that we put an expectation over all realizations, and since we want to conduct statistical analysis we can’t use \(\le\), we instead must use \(\approx\)
You need equality in order to get estimates of \(H\) and \(L\) (if we don’t have equality, forget it), but the good news is that almost all the processes you find in probability texts do indeed satisfy the almost equal condition we use (e.g., derivatives, squares, log(1+…) for Gaussian processes do satisfy this with equality), so in fact there is no loss in generality and this is not restrictive at all
Furthermore, we adopt an augmented Hölder continuity condition and allow \(H_t\) and \(L_t\) to vary on \(t\in[0,1]\), thus we work with \[\mathbb{E}\left[(X_u-X_v)^2\right]\approx L_t^2|u-v|^{2H_t},\quad u,v\text{ in some neighborhood of } t\]
The Hölder class MSE of one kernel smoothed curve at a point \(t\), where \(M_i\) is the number of observations for the \(i\)th curve, is given by \[MSE_i\leq C_1(t)h^{2H_t} + \frac{C_2(t)}{M_i f_T(t) h}\]
The constants in the above formula are \(C_1(t)=L_t^2\int |u|^{2H_t}|K(u)|du\) and \(C_2(t)=\sigma^2_t \int K^2(u)du\), where \(K(u)\) is the kernel function
Given this, the MSE-optimal bandwidth for curve \(i\) at point \(t\) is \[h^*_t=\left[\frac{C_2(t)}{2H_tC_1(t)M_i f_T(t)}\right]^{\frac{1}{2H_t+1}}\]
Using the Epanechnikov kernel given by \(\frac{3}{4}(1-u^2)\) on \([-1,1]\), it can be shown that \(C_1(t)=\frac{3L_t^2}{[2H_T+1][2H_t+3]}\) and \(C_2(t)=\frac{3}{5}\sigma^2_t\)
To construct a consistent estimator of \(\sigma^2_t\) appearing in the MSE-optimal bandwidth formula, we use the following expression: \[\begin{equation*} \widehat\sigma^2_t=\frac{1}{2N}\sum_{i=1}^N\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2 \end{equation*}\]
To obtain this expression, let \(m(t)\) denote the order statistic of discrete sample point \(m\) indexed at \(t\) (i.e., \(T^{(i)}_{m(t)}\) is the closest sample design point to \(t\), \(T^{(i)}_{m(t-1)}\) the second closest, etc.)
Recalling that \(\varepsilon^{(i)}_m = \sigma(T^{(i)}_m) e^{(i)}_m\), we can express \((Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)})\) in the expression above as follows: \[\begin{align*} Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}&=X^{(i)}(T^{(i)}_{m(t)})+\varepsilon^{(i)}_{m(t)}-X^{(i)}(T^{(i)}_{m(t-1)})-\varepsilon^{(i)}_{m(t-1)}\\ &=\left(X^{(i)}(T^{(i)}_{m(t)})-X^{(i)}(T^{(i)}_{m(t-1)})\right)+\left(\varepsilon^{(i)}_{m(t)}-\varepsilon^{(i)}_{m(t-1)}\right) \end{align*}\]
Given continuity of \(X^{(i)}(T^{(i)}_{m(t)})\), the first term on line 2 is negligible
Given this, and the independence of the \(e^{(i)}_{m(t)}\), we have \[\begin{align*} \mathbb{E}\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2 &\approx\mathbb{E}\left(\varepsilon^{(i)}_{m(t)}-\varepsilon^{(i)}_{m(t-1)}\right)^2=2\sigma^2(t), \end{align*}\] which leads to the expression \(\widehat\sigma^2_t=\frac{1}{2N}\sum_{i=1}^N\left(Y^{(i)}_{m(t)}-Y^{(i)}_{m(t-1)}\right)^2\)
A similar approach has been used in a classical homoscedastic nonparametric setting (Horowitz and Spokoiny 2001, Equation (2.9))
We modify this for use in an FDA setting, and by exploiting “replication” (i.e., by averaging vertically over all curves at point \(t\)) we obtain a simple method for computing the measurement noise variance that allows for heteroscedasticity of unknown form
Locally Adaptive Online FDA | V. Patilea & J. Racine