Estimating Fisher discriminant error in a linear integrator model of neural population activity

Decoding approaches provide a useful means of estimating the information contained in neuronal circuits. In this work, we analyze the expected classification error of a decoder based on Fisher linear discriminant analysis. We provide expressions that relate decoding error to the specific parameters of a population model that performs linear integration of sensory input. Results show conditions that lead to beneficial and detrimental effects of noise correlation on decoding. Further, the proposed framework sheds light on the contribution of neuronal noise, highlighting cases where, counter-intuitively, increased noise may lead to improved decoding performance. Finally, we examined the impact of dynamical parameters, including neuronal leak and integration time constant, on decoding. Overall, this work presents a fruitful approach to the study of decoding using a comprehensive theoretical framework that merges dynamical parameters with estimates of readout error.


Introduction
In recent years, neuronal decoding has emerged as a key aspect of understanding the neural code [1]. The aim of decoding algorithms is to read out the sensory-driven responses of a neuronal population and classify them following a given criterion. Popular criteria include Fisher information [2,3], mutual information [4], and machine learning approaches [5,6]. While many types of decoders exist [7], a linear readout of neural activity has often been employed to perform sensory classification [8,9] and predict motor decisions [10,11]. Further, different classes of linear readouts are amenable to mathematical analysis and capture biological learning rules such as Hebbian learning [12].
In this work, we formally analyze the optimal decoding error of a linear decoder based on Fisher linear discriminant analysis (LDA). Assuming discrete classes of stimuli, LDA provides an upper bound on linear decoding capacity [13]. In addition, LDA shows good agreement with decision-making tasks and offers a bridge between cortical activity and behavioral performance [14,15].
Importantly, most theoretical approaches based on neural decoding are not concerned with how linear decoders would be influenced by specific dynamical parameters of mod- where the noise correlation is adjusted via a parameter ρ. Each population receives two distinct inputs (ν 1 and ν 2 ) and a private source of noise whose gain is β x and β y , respectively. The stimulus-driven response of each population is described by a tuning curve relating stimulus orientation to firing rate. (B) Activity for populations x and y is shown at discrete time-points (solid black circles). The optimal decision boundary (c) obtained by LDA discriminates amongst the neural activity generated by each of the two inputs. Neural responses follow a Gaussian distribution. The shaded area shows the proportion of discrimination error for stimulus 2 eled neural systems [16]. Here, we address this concern by providing expressions that relate decoding error to the adjustable parameters of a rate-based population model with a linear neural integrator [17,18]. This model captures the average spiking activity of neuronal populations [19][20][21] and the quasi-linear responses of neurons found in many experimental contexts [22]. Preliminary results have been presented in previous work [13,14], yet the full analytical solution had remained incomplete and limited to positive noise correlation; we now present the complete solution.
The framework relies on the simplifying assumption that signal and noise correlations originate from independent sources. While this assumption does not hold in biological circuits, where signal and noise are related [1], it allows us to systematically explore a wide range of scenarios that describe the impact of neuronal inputs, noise, correlations, and dynamical parameters on linear decoding, where the contribution of each parameter can be examined independently. This paper begins by describing the neural integrator model and the LDA readout. Then, we provide expressions for LDA error that rely on the parameters of the integrator model. Finally, we consider the effect of correlation, noise, and dynamical parameters on neuronal decoding using both analytical expressions and numerical simulations.

Linear population model
As a starting point, we assume two independent neuronal populations, each projecting in a feedforward manner to a readout discriminating amongst two inputs, ν 1 and ν 2 , that are constant over time ( Fig. 1(A)). Each population's mean firing rate in response to stimuli is conceptualized by a tuning curve where a stimulus feature, for instance visual orientation, generates a graded response. This scenario is analogous to analyses that examine population responses after performing a dimensionality reduction to generate a "population tuning curve" [23]. While a more complex model could account for a heterogeneity of responses within each population, we choose to limit our model to two homogeneous populations in order for the classification problem to remain tractable.
The activity of each population is described by a linear neural integrator where x i and y i are the firing rates of each population in response to a given stimulus i, τ is a time constant, α is a leak term, ξ (t) is Gaussian white noise (N (0, 1)), and β is the gain of the noise. Network parameters τ , α, and β are bound to R >0 . We make no distinction between noise induced by stimuli and noise generated through intrinsic neural activity. While their effect on mean rate activity is similar [24], their impact on noise correlations differs [1]; in the model, we explicitly separate the effect of firing rate and noise correlation. This will be done by controlling noise correlation through a tunable parameter, as detailed in Sect. 4. An advantage of this formalism is that the effect of noise correlation can be systematically isolated from changes in firing rates and signal correlation that would be induced through reciprocal connections between the two populations. Further, depending on the choice of parameters, the addition of recurrent weights to Eq. (1) may prevent the system from reaching a stationary state, which is a fundamental assumption of LDA.

Fisher linear discriminant decoder
A linear decoder based on LDA reads out the activity of the population model in order to perform a binary discrimination ( Fig. 1(B)). Discrimination error generated by LDA provides an estimate of the statistical confidence in distinguishing pairs of stimuli based on network activity. We focus on pairwise discrimination given that error rates obtained from more than two stimuli are well approximated by values obtained from all combinations of pairwise comparisons [25]. LDA assumes that neural activity is sampled from a multivariate Gaussian distribution with class covariance matrix i and class mean vector μ i . Further, LDA assumes equal class covariance, therefore 1 = 2 = . LDA attempts to find a projection line w, perpendicular to the decision boundary, onto which the input space is projected. The optimal projection line maximizes the Fisher criterion J (w) defined as the ratio of the projected between-to within-class variance: Given the assumption of equal class covariance, we set W = 2 . By taking the derivative of J (w) with respect to w and setting it to zero, one finds the closed-form solution for the optimal projection line to be W = (2 ) -1 (μ 2μ 1 ). (2)

Formulating a model-based linear decoder
To analytically derive means (μ 1 and μ 2 ) and covariance ( ) from the neural population model, we rearrange Eq. (1) as follows, using population x as example: Given that a white noise process is by definition the time derivative of a Weiner process, ξ (t) = dW t /dt, we can rewrite Eq. (3) as with (4) is an Orstein-Uhlenbeck process with known solution Equation (5) is a mean reverting process whose stable state follows a Gaussian distribution. A full derivation of this process is found in Sections A.1-A.2. To summarize this derivation, the expected mean and variance are The stationary mean and variance of Eq. (5) are With the assumption that the mean of x is much larger than the variance, there is negligible probability that x would fall below zero. Imposing strictly positive values of x could be achieved by the addition of a constant and would not alter the results obtained from the linear classifier.
The readout of neural activity depends on the following feature space: where Z is obtained from the probability distribution of a multivariate Gaussian with mean μ i and covariance . Setting the parameter ρ = 0 would be equivalent to a so-called "diagonal decoder" where off-diagonal elements of the covariance matrix are neglected, thus ignoring noise correlations altogether [16].
The closed form solution of LDA (Eq. (2)) can be expressed using the parameters of the population model (Eq. (1)) as follows. First, the total within-class scatter S w is To alleviate the notation, we define μ T = [ μ x , μ y ] T = μ 0 -μ 1 , where μ u = ν u /α u , and ν u is the absolute difference between the two stimuli, given an index u that stands for either population x or y. In this way, Eq. (2) becomes From the law of total probability, the error rate of classification is given by where P[k = 1] is the probability that a randomly sampled point from any distribution belongs to class j and P[y = i|k = j] is the probability that a point is classified as belonging to class i when it belongs to class j. Given that the classifier is unbiased towards each of the two neural populations, P[k = 0] = P[k = 1] = 0.5. To calculate conditional probabilities P[y = i|k = j], one must define a threshold c that serves as a boundary between the two distributions. The value of c is chosen to be the midpoint between the means of the projected distributions. We calculate P[y = i|k = j] as the area under the curve of the density function for j in the region where i is the correct class. As a first step, we shift the projected distributions by a factor c, so that the threshold becomes zero to simplify the integration. More specifically, the unshifted threshold c, the means of the shifted distributions η i , and their variance ζ 2 are with bias term b. The error rate from Eq. (6) then becomes Details of the full integration of error can be found in Section A.3. The final expression is This expression is further simplified by introducing the squared Mahalanobis distance where Because of equal class covariance, the above expression has the property that Using Eq. (8), we rewrite d 2 from the network parameters: As the ratio μ u / σ 2 u appears often in the above solution, we simplify our notation by introducing This expression simplifies the Mahalanobis distance to The full derivation of expected error using Mahalanobis distance is found in Sections A.3-A.4. The above analysis provides a relationship between classification error and the network parameters of the population model. In the sections to follow, we explore the various links between these quantities.

Noise correlation
Neurons that are in close physical proximity exhibit correlations in their activity. An extensive body of work has examined the impact of these noise correlations on behavioral tasks [26] and the activity of brain circuits [27][28][29][30][31][32][33][34][35]. Noise correlations may be advantageous or detrimental to cognitive and sensory processing; however, the specific network-level properties that give rise to these effects have not been fully elucidated.
In the proposed model, the effect of noise correlation on classification error is highly dependent upon the sensory inputs (ν 1 and ν 2 ). We distinguish four main cases that lead to qualitatively different conclusions on the impact of noise correlations. Details of these analyses are provided in Sections A.5-A.6. A first case arises when the tuning curves of populations x and y are identical in terms of both their orientation preference and their gain ( Fig. 2(A)). In this case, r x → r y , leading to monotonically increasing error as a function of correlation. Intuitively, this happens because correlation forces the firing rate distributions to "stretch" towards each other. We verified the analytical solution by comparing it to numerical estimates of the error rate as a function of noise correlation. These numerical estimates were obtained with Eq. (1), where populations x and y both received inputs ν 1 = 11 and ν 2 = 14 in order for the model to mimick a scenario where the two populations have identical tuning properties. The goal here is not to capture the model's response to a continuum of stimulus values along the tuning curves, but rather to illustrate the behavior of the model using discrete stimuli. We set τ = 1, β = 1, and α = 1 for both populations. We then numerically generated 5000 points per stimulus class. A subset of 80% of the total number of data points were randomly selected to train the LDA classifier. The proportion of misclassified points was calculated based on the remaining data points. We found good agreement between the numerical estimates and analytical solution (Fig. 2(A)).
Note that the range of error may be increased by moving the firing rate distributions closer to each other without altering the overall shape of the function relating error and noise correlation. While the goal here was to show the distribution of readout error across a broad range of correlation values, we acknowledge that not all combinations of tuning curves and noise correlations are physiologically plausible. In fact, while noise correlations in cortex vary across experimental conditions, regions, and behavioral states, they are typically reported to be on the order of 0.1-0.3 for nearby cells [26]. Therefore, extreme values (both positive and negative) are unlikely to arise in living circuits.
In a second scenario, the two populations are offset in terms of their orientation preference ( Fig. 2(B)). We examined classification error in this scenario by setting the input of population x to ν 1 = 11 and ν 2 = 14, while population y was set to ν 1 = 14 and ν 2 = 11. Analytically, this scenario leads to r x → -r y , resulting in a monotonically decreasing error as correlation increases from -1 to 1. Intuitively, this scenario arises because correlation stretches the distributions of responses along parallel lines, decreasing the overlap between them.
A third case arises when the tuning curve of one of the two populations yields the same response for two stimuli (Fig. 2(C)). This happens if the tuning curve of population x exhibits a broad region where firing rate remains constant despite changes in stimulus orientation. Analytically, this would lead to r x = 0. We illustrate this scenario by setting ν 1 = 11 and ν 2 = 11 for population x, and ν 1 = 11 and ν 2 = 14 for population y. This case yields a "symmetrical" effect of correlation on readout error, where maximum error is found at ρ * = 0 and error tends towards zero as ρ approaches either 1 or -1.
Finally, a fourth scenario occurs when the two populations have tuning curves that are aligned in terms of orientation preference, but where one population has higher response gain (Fig. 2(D)). This case is defined by |r x | = |r y |. Error tends to zero as noise correlation (ρ) goes to either -1 or 1. The correlation associated with maximum error is found somewhere in between these extremes and is given by To illustrate this scenario, we set ν 1 = 11 and ν 2 =13 for population x, and ν 1 = 11 and ν 2 = 14 for population y. Graphically, this scenario arises when noise correlation "stretches" the distribution of responses along parallel lines and their centroids do not align on either dimension. Starting from a correlation of zero, as correlation increases, the distributions will stretch towards each other, thus increasing overlap and error. After a maximum overlap defined by ρ * , further stretching of the distributions will force them to spread too thinly for them to overlap, until the extreme case of a correlation of one, where both distributions would appear as perfectly parallel lines, leading to zero error.
A continuum of cases exists between the different scenarios illustrated in Fig. 2(A)-(D). For instance, the peak error (ρ * ) in Fig. 2(D) can shift to lower correlation values by offsetting one of the tuning curves, yielding a curve closer to Fig. 2(B).
In sum, the above results show that, depending upon the structure of the input delivered to the two neural populations, noise correlations produce widely different effects on classification error. While insights into these results can be obtained without the full formalism described here [34], such formalism becomes pivotal when examining the effect of specific network parameters, as described next.

Impact of noise gain on classification error
To explore the effect of network parameters on error, we first modify Eq. (9) as follows: where the ratio r x /r y can be expressed using network parameters We define a set containing all network parameters G u = {α u , τ u , β u , ν u }. If g is a subset of these parameters, we can manipulate them using a function f (g) while setting the other parameters to a constant c g . In this way, we can rewrite Eq. (10) as We can investigate the effect of network parameters on ρ * . For example, the effect of noise gain (β x and β y ) on ρ * when keeping all other parameters constant except for the input is expressed as for |r x | < |r y |.
For illustration purposes, we explored the scenario described in Fig. 2(A), where two populations have equivalent tuning properties. Keeping all parameters constant while altering both β x and β y simultaneously has no effect on ρ * (Fig. 3(A)). The main impact is an increase in the amount of classification error (Fig. 3(B)). This result is not surprising: increasing the gain of the noise worsens readout performance.
However, markedly different results emerge in a scenario where tuning curves are offset (Fig. 2(B)) and β x is altered while keeping β y unchanged. In this case, ρ * = f (β x )c β x with c β x given by and f (β x ) = 1/β x . Alterations in β x impact ρ * in a non-monotonic fashion (Fig. 3(C)). A small increase from β x = 1 to β x =2 shifts ρ * towards a more negative value. However, further increasing to β x = 3 and β x = 4 increases ρ * and alters the relationship between correlation and readout error (Fig. 3(D)). Hidden in these results is a counter-intuitive finding: under certain circumstances, increasing β x leads to a decrease in classification error. This can be seen with β x = 10 ( Fig.  3(D), dashed line), leading to lower error than β x = 3 (green line) and β x = 4 (red line) for negative correlations. Intuitively, this can happen when increasing β x stretches the distribution of activity for population x along a single dimension away from the classification boundary [13]. Similar findings are borne out of graphical explanations where noise covariance stretches the distribution of firing rates [36].
The benefits of noise gain are even more pronounced in a scenario where one population has higher gain than the other, as in Fig. 2(D). In this case, β x monotonically shifts ρ * towards decreasing values (Fig. 3(E)). For a broad range of positive correlation values, a high noise gain (β x > 1) leads to lower classification error (Fig. 3(F)).

Impact of dynamical parameters
The approach described in the previous section can be applied to study the impact of the model's dynamical parameters on readout error. The two parameters of interest are the leak term (α) and the time constant (τ ).
The effect of the time constants τ x and τ y on ρ * can be expressed as for |r x | < |r y |. To study the effect of a single term (e.g., τ x ), we set ρ * = f (τ x )c τ x with c τ x given by and f (τ x ) = τ x . Similarly, the role of leak terms α x and α y on ρ * is and f (α x ) = 1/α x . Taking one scenario as illustration, we examined the case where tuning curves are offset by a fixed orientation (r x → -r y ). In this case, the time constant affects the relation between noise correlation and readout error, with larger values of τ x shifting ρ * towards smaller negative values of correlation (Fig. 4(A)). The reason for this shift follows from an earlier example (Fig. 2(D)), where an increased correlation resulted in greater overlap between the firing rate distributions, but only up to a point beyond which these distributions became too narrow to overlap. With larger values of τ x , a given correlation does not create as much overlap as it would for smaller values of τ x , thus leading to a shift in ρ * . The overall impact of a larger time constant is a decrease in classification error ( Fig.  4(B)): as τ x increases, there is less overlap between the distributions of firing rate across stimuli ( Fig. 4(B), inset). By contrast, shifting the leak term α x towards higher values decreases ρ * (Fig. 4(C)) and increases overall readout error ( Fig. 4(D)). The impact of increasing α x on error is due to an increase in the overlap between firing rate distributions ( Fig. 4(D), inset). The inverse effects of τ x and α x on these distributions explain their opposite impact on ρ * .
More complex, non-monotonic relations between ρ * and values of τ x and α x are found in different scenarios where tuning curves of the two populations are aligned (Fig. 5(A)) or when the gain of one population is larger (Fig. 5(B)).
Together, these results show that the integration time constant and leak term of the population model mediate the impact of noise correlation on classification error by shifting the value ρ * at which correlation reaches maximal error. The impact of network parameters on readout error is therefore not straightforward to describe but is brought to light using a framework that derives error estimates from the dynamical parameters of a population model.

Discussion
This work described an analytical framework for performing Fisher linear decoding in a rate-based neural model. With this formalism, we began by capturing well-documented findings on the role of noise gain and correlations on discrimination error. Going further, the framework allowed us to analytically examine the mediating role of dynamical parameters (neuronal leak and time constant) on the relation between noise correlation and error. Overall, this framework suggests that linear decoding is highly sensitive to dynamical model parameters as well as the characteristics of the sensory input.
One surprising finding was the presence of conditions where increased neuronal noise led to reduced classification error. This result was especially prominent when the gain of the two population tuning curves was unmatched (Fig. 3(E)-(F)). Taken together, our findings cover all possible qualitative scenarios where noise correlations have either a beneficial, detrimental, or null effect on decoding [36].
A related approach termed the leaky competing accumulator model was proposed in order to account for perceptual decision making [37]. Some key differences exist between this model and ours. Firstly, our framework assumes a steady-state of neural activity that is characteristic of a decision point and does not capture the time-course of deliberation. Our framework assumes an optimal bound on decision accuracy given a linear decoder, representing a ceiling in accuracy that would be associated with long response times (typically >500 ms in human subjects). Secondly, the accumulator model provides explicit connections, through lateral inhibition, that modulate correlations. These lateral connections, however, may also impact firing rates. By comparison, our framework isolates analytically the contribution of firing rates and correlations, and examines their relative role on perceptual discrimination.
It would be challenging to speculate on whether the analytical results provided would generalize to other classes of neural network models, particularly those that include a nonlinear transfer function [38]. However, our work opens the door to such analyses by describing a framework for linking neuronal readout and dynamical modeling.
Limitations and future work. While the framework described here strived to cover all possible scenarios involving firing rates, noise correlations, and network parameters, it is important to emphasize that not all such scenarios are plausible from a physiological standpoint. In particular, the framework treats firing rates and noise correlations as independent contributors to decoding error and allows for implausible cases where increases in firing rate would lead to an increase, a decrease, or no impact on correlations. Interactions between stimulus and noise correlations are a crucial factor limiting the coding capacity of neural circuits [1,23] and should be considered alongside the dynamical parameters discussed in this work.
Several future directions based on the proposed framework will be worth exploring. First, the assumption of equal class covariances in LDA is challenged by experimental work showing input-dependent neuronal variance [39]. This assumption could be relaxed by replacing LDA with quadratic discriminant analysis, albeit at the cost of a more complex solution when relating readout error to model parameters.
An extension of the current framework could consider the impact of pooling more than two neural populations, as well as more than two stimuli, when performing decoding. This extension would be helpful in examining the interactions between several populations of neurons, each with a unique tuning curve. Going further, one could examine decoding error at the limit of a large number of neurons with heterogeneous tuning curves that vary in both orientation preference and gain [2].
Conclusion. In summary, this work described a theoretical framework that merges Fisher linear decoding with a population model of sensory integration. This approach highlighted the role of correlation, neuronal noise, and network parameters, revealing a broad range of potential outcomes where different conditions generated either detrimental, beneficial, or null impacts on classification performance. These results motivate further developments in theoretical work that systematically link neural network models to optimal decoders in order to reveal the impact of key neurophysiological variables on sensory information processing.

A.2 Expected value and variance
We sought to find the expected mean and variance of the random variable x such that The expected mean is Given the zero-mean property of Ito integrals, we have The expected variance is By Ito isometry, Hence, the expected variance can be concisely written as

A.3 Classification error
The classification error as a function of neural activity is given by Substituting the mean and variance from the previous section, this becomes

A.4 Mahalanobis distance
We began with the following definitions: Expanding η i yields we expanded W using the property u · υ = u T υ, Hence, the squared Mahalanobis distance between means is We can rewrite η i as Similarly, for the variance ς 2 ,

A.5 Derivation of error
We analyzed the extrema of the error function in relation to noise correlation by taking its first derivative through the chain rule with The first derivative is given by The second derivative is The third derivative is x + r 2 y -2ρr x r y 2ρ + 1ρ 2 (-2r x r y ) = 1 (1ρ 2 ) 2 2ρr 2 x + 2ρr 2 y -2ρ2ρr x r y + -2r x r y + 2r x r y ρ 2 = -2 (1ρ 2 ) 2 ρ 2 r x r yρ r 2 x + r 2 y + r x r y .

A.6 Extrema of error
We evaluated the extrema of error by finding the points where Eqs. (14)-(16) are equal to zero, We assumed that the ratios r x and r y are finite and the Euclidean distance between the distribution means is finite and non-null. In other words, if d 2 → ∞ it is exclusively due to the correlation coefficient. Then, We proceeded in a similar fashion for the second derivative (Eq. (15)): The third derivative (Eq. (16)) is 0 = dd 2 dρ ⇔ 0 = -2 (1ρ 2 ) 2 ρ 2 r x r yρ r 2 x + r 2 y + r x r y , ⇔ 0 = ρ 2 r x r yρ r 2 x + r 2 y + r x r y .
Depending on network parameters, two cases are possible. One case arises if one of the ratios, either r x or r y , is zero. This happens if the mean activity of one population is equal across inputs. If the mean activity of both units remained unchanged, the resulting multivariate distributions would overlap, thus breaking the basic assumptions justifying the choice of LDA. In this first case, 0 = dd 2 dρ ⇐ 0 = ρ if r x = 0 or r y = 0.
The last expression can be decomposed into four distinct cases. First, when r x → r y , ρ → r 2 y + r 2 y 2r y r y → 1.
Third, when r x = r y , we examined the positive and negative roots of ρ. The positive root is ρ + = r 2 x + r 2 y + max(r 2 x , r 2 y ) -min(r 2 x , r 2 y ) 2r x r y = max(r 2 x , r 2 y ) r x r y .
Because | max(r 2 x , r 2 y )| > |r x r y | from the assumption that one ratio is smaller than the other (or unequal, non-null), this means that |ρ + | > 1 ∀r x , r y . Since the correlation is bound in the range ]-1, 1[, the positive root must be rejected. The negative root does not suffer from the same problem, ρ -= r 2 x + r 2 y -max(r 2 x , r 2 y ) + min(r 2 x , r 2 y ) 2r x r y = min(r 2 x , r 2 y ) r x r y .

A.7 Minima and maxima
We determined upward and downward trends of the error curve by calculating the sign of the derivative between the potential maxima (considering that they are mutually exclusive). Taking Eqs. (14) -2 (1ρ 2 ) 2 ρ 2 r x r yρ r 2 x + r 2 y + r x r y ⇔ sign dε dρ = sign ρ 2 r x r yρ r 2 x + r 2 y + r x r y .
For the condition where either r x = 0 or r y = 0, we have already found the zeros of ρ 2 r x r yρ(r 2 x + r 2 y ) + r x r y to be ρand ρ + . To determine sign(dε/dρ), we need to know whether the extremum of the parabola is a minimum or a maximum, d 2 dρ 2 ρ 2 r x r yρ r 2 x + r 2 y + r x r y = 2 > 0.
Given ρ + > 1, Regardless of the conditions for r x and r y , following Eqs. (27)- (29), the error curve as a function of correlation increases from ρ = -1 until its maximum, found at a value of ρ * = 0 or ρ * = ρ -, and then decreases until ρ = 1.