Complex systems are made of a large number of interacting elements leading to non-trivial behaviors. They arise in various areas of research such as biology, social sciences, physics or communication networks. In particular in neuroscience, the nervous system is composed of billions of interconnected neurons interacting with their environment. Two specific features of this class of complex systems are that (i) external inputs and (ii) internal sources of random fluctuations influence their dynamics. Their theoretical understanding is a great challenge and involves high-dimensional non-linear mathematical models integrating non-autonomous and stochastic perturbations.

Modeling these systems gives rise to many different scales both in space and in time. In particular, learning processes in the brain involve three time-scales: from neuronal activity (fast), external stimulation (intermediate) to synaptic plasticity (slow). Here, fast time-scale corresponds to a few milliseconds and slow time-scale to minutes/hour, and intermediate time-scale generally ranges between fast and slow scales, although some stimuli may be faster than neuronal activity time-scale (*e.g.*, submilliseconds auditory signals [1]). The separation of these time-scales is an important and useful property in their study. Indeed, multiscale methods appear particularly relevant to handle and simplify such complex systems.

First, stochastic averaging principle [2, 3] is a powerful tool to analyze the impact of noise on slow-fast dynamical systems. This method relies on approximating the fast dynamics by its quasi-stationary measure and averaging the slow evolution with respect to this measure. In the asymptotic regime of perfect time-scale separation, this leads to a slow reduced system whose analysis enables a better understanding of the original stochastic model.

Second, periodic averaging theory [4], which has been originally developed for celestial mechanics, is particularly relevant to study the effect of fast deterministic and periodic perturbations (external input) on dynamical systems. This method also leads to a reduced model where the external perturbation is time-averaged.

It seems appropriate to gather these two methods to address our case of a noisy and input-driven slow-fast dynamical system. This combined approach provides a novel way to understand the interactions between the three time-scales relevant in our models. More precisely, we will consider the following class of multiscale stochastic differential equations (SDEs), with

${\u03f5}_{1},{\u03f5}_{2}>0$ two small parameters

$\{\begin{array}{c}d{\mathbf{v}}^{\u03f5}=\frac{1}{{\u03f5}_{1}}[F({\mathbf{v}}^{\u03f5},{\mathbf{w}}^{\u03f5},\mathbf{u}(\frac{t}{{\u03f5}_{2}}))]\phantom{\rule{0.2em}{0ex}}dt+\frac{1}{\sqrt{{\u03f5}_{1}}}\mathbf{\Sigma}\phantom{\rule{0.2em}{0ex}}dB(t),\hfill \\ d{\mathbf{w}}^{\u03f5}=G({\mathbf{v}}^{\u03f5},{\mathbf{w}}^{\u03f5})\phantom{\rule{0.2em}{0ex}}dt,\hfill \end{array}$

(1)

where ${\mathbf{v}}^{\u03f5}\in {\mathbb{R}}^{p}$ represents the fast activity of the individual elements, ${\mathbf{w}}^{\u03f5}\in {\mathbb{R}}^{q}$ represents the connectivity weights that vary slowly due to plasticity, and $\mathbf{u}(t)\in {\mathbb{R}}^{p}$ represents the value of the external input at time *t*. Random perturbations are included in the form of a diffusion term, and $(B(t))$ is a standard Brownian motion.

We are interested in the double limit

${\u03f5}_{1}\to 0$ and

${\u03f5}_{2}\to 0$ to describe the evolution of the slow variable

**w** in the asymptotic regime where both the variable

**v** and the external input are much faster than

**w**. This asymptotic regime corresponds to the study of a neuronal network in which both the external input

**u** and the neuronal activity

**v** operate on a faster time-scale than the slow plasticity-driven evolution of synaptic weights

**W**. To account for the possible difference of time-scales between

**v** and the input, we introduce the time-scale ratio

$\mu ={\u03f5}_{1}/{\u03f5}_{2}\in [0,\mathrm{\infty}]$. In the interesting case where

$\mu \in (0,\mathrm{\infty})$, one needs to understand the long-time behavior of the rescaled periodically forced SDE for any

${\mathbf{w}}_{0}$ fixed

$d\mathbf{v}=F(\mathbf{v},{\mathbf{w}}_{0},\mu t)\phantom{\rule{0.2em}{0ex}}dt+\mathbf{\Sigma}(\mathbf{v},{\mathbf{w}}_{0})\phantom{\rule{0.2em}{0ex}}dB(t).$

Recently, in an important contribution [

5], a precise understanding of the long-time behavior of such processes has been obtained using methods from partial differential equations. In particular, conditions ensuring the existence of a periodic family of probability measures to which the law of

**v** converges as time grows have been identified, together with a sharp estimation of the speed of mixing. These results are at the heart of the extension of the classical stochastic averaging principle [

2] to the case of periodically forced slow-fast SDEs [

6]. As a result, we obtain a reduced equation describing the slow evolution of variable

**w** in the form of an ordinary differential equation,

$\frac{d\mathbf{w}}{dt}=\overline{G}(\mathbf{w}),$

where $\overline{G}$ is constructed as an average of *G* with respect to a specific probability measure, as explained in Section 2.

This paper first introduces the appropriate mathematical framework and then focuses on applying these multiscale methods to learning neural networks.

The individual elements of these networks are neurons or populations of neurons. A common assumption at the basis of mathematical neuroscience [7] is to model their behavior by a stochastic differential equation which is made of four different contributions: (i) an intrinsic dynamics term, (ii) a communication term, (iii) a term for the external input, and (iv) a stochastic term for the intrinsic variability. Assuming that their activity is represented by the fast variable $\mathbf{v}\in {\mathbb{R}}^{n}$, the first equation of system (1) is a generic representation of a neural network (function *F* corresponds to the first three terms contributing to the dynamics). In the literature, the level of non-linearity of the function *F* ranges from a linear (or almost-linear) system to spiking neuron dynamics [8], yet the structure of the system is universal.

These neurons are interconnected through a connectivity matrix which represents the strength of the synapses connecting the real neurons together. The slow modification of the connectivity between the neurons is commonly thought to be the essence of learning. Unsupervised learning rules update the connectivity exclusively based on the value of the activity variable. Therefore, this mechanism is represented by the slow equation above, where $\mathbf{w}\in {\mathbb{R}}^{n\times n}$ is the connectivity matrix and *G* is the learning rule. Probably the most famous of these rules is the Hebbian learning rule introduced in [9]. It says that if both neurons A and B are active at the same time, then the synapses from A to B and B to A should be strengthened proportionally to the product of the activity of A and B. There are many different variations of this correlation-based principle which can be found in [10, 11]. Another recent, unsupervised, biologically motivated learning rule is the spike-timing-dependent plasticity (STDP) reviewed in [12]. It is similar to Hebbian learning except that it focuses on causation instead of correlation and that it occurs on a faster time-scale. Both of these types of rule correspond to *G* being quadratic in **v**.

Previous literature about dynamic learning networks is thick, yet we take a significantly different approach to understand the problem. An historical focus was the understanding of feedforward deterministic networks [13–15]. Another approach consisted in precomputing the connectivity of a recurrent network according to the principles underlying the Hebbian rule [16]. Actually, most of current research in the field is focused on STDP and is based on the precise times of the spikes, making them explicit in computations [17–20]. Our approach is different from the others regarding at least one of the following points: (i) we consider recurrent networks, (ii) we study the evolution of the coupled system activity/connectivity, and (iii) we consider bounded dynamical systems for the activity without asking them to be spiking. Besides, our approach is a rigorous mathematical analysis in a field where most results rely heavily on heuristic arguments and numerical simulations. To our knowledge, this is the first time such models expressed in a slow-fast SDE formalism are analyzed using temporal averaging principles.

The purpose of this application is to understand what the network learns from the exposition to time-dependent inputs. In other words, we are interested in the evolution of the connectivity variable, which evolves on a slow time-scale, under the influence of the external input and some noise added on the fast variable. More precisely, we intend to explicitly compute the equilibrium connectivities of such systems. This final matrix corresponds to the knowledge the network has extracted from the inputs. Although the derivation of the results is mathematically tough for untrained readers, we have tried to extract widely understandable conclusions from our mathematical results and we believe this paper brings novel elements to the debate about the role and mechanisms of learning in large scale networks.

Although the averaging method is a generic principle, we have made significant assumptions to keep the analysis of the averaged system mathematically tractable. In particular, we will assume that the activity evolves according to a linear stochastic differential equation. This is not very realistic when modeling individual neurons, but it seems more reasonable to model populations of neurons; see Chapter 11 of [7].

The paper is organized as follows. Section 2 is devoted to introducing the temporal averaging theory. Theorem 2.2 is the main result of this section. It provides the technical tool to tackle learning neural networks. Section 3 corresponds to application of the mathematical tools developed in the previous section onto the models of learning neural networks. A generic model is described and three different particular models of increasing complexity are analyzed. First, Hebbian learning, then trace-learning, and finally STDP learning are analyzed for linear activities. Finally, Section 4 is a discussion of the consequences of the previous results from the viewpoint of their biological interpretation.