 Research
 Open access
 Published:
A bioinspired geometric model for sound reconstruction
The Journal of Mathematical Neuroscience volume 11, Article number: 2 (2021)
Abstract
The reconstruction mechanisms built by the human auditory system during sound reconstruction are still a matter of debate. The purpose of this study is to propose a mathematical model of sound reconstruction based on the functional architecture of the auditory cortex (A1). The model is inspired by the geometrical modelling of vision, which has undergone a great development in the last ten years. There are, however, fundamental dissimilarities, due to the different role played by time and the different group of symmetries. The algorithm transforms the degraded sound in an ‘image’ in the time–frequency domain via a shorttime Fourier transform. Such an image is then lifted to the Heisenberg group and is reconstructed via a Wilson–Cowan integrodifferential equation. Preliminary numerical experiments are provided, showing the good reconstruction properties of the algorithm on synthetic sounds concentrated around two frequencies.
1 Introduction
Listening to speech requires the capacity of the auditory system to map incoming sensory input to lexical representations. When the sound is intelligible, this mapping (‘recognition’) process is successful. With reduced intelligibility (e.g. due to background noise), the listener has to face the task of recovering the loss of acoustic information. This task is very complex as it requires a higher cognitive load and the ability of repairing missing input. (See [28] for a review on noise in speech.) Yet, (normal hearing) humans are quite able to recover sounds in several effortful listening situations (see, for instance, [27]), ranging from sounds degraded at the source (e.g. hypoarticulated and pathological speech), during transmission (e.g. reverberation) or corrupted by the presence of environmental noise.
So far, work on degraded speech has informed us a lot on the acoustic cues that help the listener to reconstruct missing information (e.g. [18, 31]); the several adverse conditions in which listeners may be able to reconstruct speech sounds (e.g. [2, 28]); and whether (and at which stage of the auditory process) higherorder knowledge (i.e. our information about words and sentences) helps the system to recover lowerlevel perceptual information (e.g. [22]). However, most of these studies adopt a phenomenological and descriptive approach. More specifically, techniques from previous studies consist in adding synthetic noise to speech sound stimuli, performing spectral and temporal analyses on the stimuli with noise and the same ones without it to identify acoustic differences, linking the results of these analyses with the outcome from perceptual experiments. In some of these behavioural experiments, for instance, listeners are asked to identify speech units (such as consonants or words) when listening the noisy stimuli. Their accuracy scores provide a measure to the listeners’ speech recognition ability.
As it stands, a mathematical model informing us on how the human auditory system is able to reconstruct a degraded speech sound is still missing. The aim of this study is to build a neurogeometric model for sound reconstruction, stemming from the description of the functional architecture of the auditory cortex.
1.1 Modelling the auditory cortex
Knowledge about the functional architecture of the auditory cortex is scarce, and there are difficulties in the application of Gestalt principles for auditory perception. For these reasons, the model we propose is strongly inspired by recent advances in the mathematical modeling of the functional architecture of the primary visual cortex and the processing of visual inputs [9, 13, 24, 32], which recently yield very successful applications to image processing [10, 16, 20, 35]. This idea is not new: neuroscientists take models of V1 as a starting point for understanding the auditory system (see, e.g. [30] for a comparison, and [23] for a related discussion in speech processing). Indeed, biological similarities between the structure of the primary visual cortex (V1) and the primary auditory cortex (A1) are wellknown to exist.
An often cited V1–A1 similarity is their ‘topographic’ organization, a general principle determining how visual and auditory inputs are mapped to those neurons responsible for their processing [38]. Substantial evidence for V1–A1 relation is also provided by studies on animals and on humans with deprived hearing or visual functions showing crosstalk interactions between sensory regions [41, 44]. More relevant for our study is the existence of receptive fields of neurons in V1 and A1 that allow for a subdivision of neurons in ‘simple’ and ‘complex’ cells, which supports the idea of a ‘common canonical processing algorithm within cortical columns’ [42, p. 1]. Together with the appearance in A1 of singularities typical of V1 (e.g. pinwheels) [34, 41], these findings speak in favour of the idea that V1 and A1 share similar mechanisms of sensory input reconstruction. In the next section we present the mathematical model for V1 that will be the basis for our sound reconstruction algorithm.
1.2 Neurogeometric model of V1
The neurogeometric model of V1 finds its roots in the experimental results of Hubel and Wiesel [25], which inspired Hoffman [24] to model V1 as a contact space.^{Footnote 1} This model has then been extended to the socalled subRiemannian model in [8, 9, 13, 33]. On the basis of such a model, exceptionally efficient algorithms for image inpainting have been developed (e.g. [10, 15, 16]). These algorithms have now several medical imaging applications (e.g. [45]).
The main idea behind this model is that an image, seen as a function \(f:\mathbb {R}^{2}\to \mathbb {R}_{+}\) representing the grey level, is lifted to a distribution on \(\mathbb {R}^{2}\times P^{1}\), the bundle of directions of the plane.^{Footnote 2} Here, \(P^{1}\) is the projective line, i.e. \(P^{1} = \mathbb {R}/\pi \mathbb{Z}\). More precisely, the lift is given by \(Lf(x,y,\theta ) = \delta _{Sf}(x,y,\theta )f(x,y)\) where \(\delta _{S_{f}}\) is the Dirac mass supported on the set \(S_{f}\subset \mathbb {R}^{2}\times P^{1}\) of points \((x,y,\theta )\) such that θ is the direction of the tangent line to f at \((x,y)\). Notice that, under suitable regularity assumptions on f, \(S_{f}\) is a surface.
When f is corrupted (i.e. when f is not defined in some region of the plane), the lift is corrupted as well, and the reconstruction is obtained by applying a deeply anisotropic diffusion adapted to the problem. Such diffusion mimics the flow of information along the horizontal and vertical connections of V1 and uses as an initial condition the surface \(S_{f}\) and the values of the function f. Mathematicians call such a diffusion the subRiemannian diffusion in \(\mathbb {R}^{2}\times P^{1}\), cf. [1, 29]. One of the main features of this diffusion is that it is invariant by rototranslation of the plane, a feature that will not be possible to translate to the case of sounds, due to the special role of the time variable.
In what follows, we explain how similar ideas could be translated to the problem of sound reconstruction.
1.3 From V1 to sound reconstruction
The sensory input reaching A1 comes directly from the cochlea [14]: a spiralshaped, fluidfilled, cavity that composes the inner ear. Vibrations coming from the ossicles in the middle ear are transmitted to the cochlea, where they propagate and are picked up by sensors (socalled hair cells). These sensors are tonotopically organized along the spiral ganglion of the cochlea in a frequencyspecific fashion, with cells close to the base of the ganglion being more sensitive to lowfrequency sounds and cells near the apex more sensitive to highfrequency sounds, see Fig. 1. This early ‘spectrogram’ of the signal is then transmitted to higherorder layers of the auditory cortex.
Mathematically speaking, this means that when we hear a sound (that we can think as represented by a function \(s:[0,T]\to \mathbb {R}\)), our primary auditory cortex A1 is fed by its time–frequency representation^{Footnote 3}\(S:[0,T]\times \mathbb {R}\to \mathbb {C}\). If, say, \(s\in L^{2}(\mathbb {R}^{2})\), the time–frequency representation S is given by the shorttime Fourier transform of s, defined as
Here \(W:\mathbb {R}\to [0,1]\) is a compactly supported (smooth) window, so that \(S\in L^{2}(\mathbb {R}^{2})\). Since S is complexvalued, it can be thought as the collection of two blackandwhite images, \(S\) and argS. The function S depends on two variables, the first is time, that here we indicate with the letter τ, and the second is frequency, denoted by ω. Roughly speaking, \(S(\tau ,\omega )\) represents the strength of the presence of the frequency ω at time τ. In the following, we call S the sound image (see Fig. 2).
A first attempt to model the process of sound reconstruction into A1 is to apply the algorithm for image reconstruction described in Sect. 1.2. In a sound image, however, time plays a special role. Indeed,

1.
While for images the reconstruction can be done by evolving the whole image simultaneously, the whole sound image does not reach the auditory cortex simultaneously, but sequentially. Hence, the reconstruction can be performed only in a sliding window.

2.
A rotated sound image corresponds to a completely different input sound and thus the invariance by rototranslations is lost.
As a consequence, different symmetries have to be taken into account (see Appendix B) and a different model for both the lift and the processing in the lifted space is required.
In order to introduce this model, let us recall that, in V1, neural stimulation stems not only from the input but also from its variations. That is, mathematically speaking, the input image is considered as a realvalued function on a 2dimensional space, and the orientation sensitivity arises from the sensitivity to a first order derivative information on this function, i.e. the tangent directions to level lines. This additional variational information allows lifting the 2dimensional image space to the aforementioned contact space, and defining the subRiemannian diffusion [1, 11].
In our model of A1, we follow the same idea: we consider the variations of the input as additional variables. Input sound signals are timedependent realvalued functions subjected to a shorttime Fourier transform by the cochlea. As a result the A1 input is considered as a function of time and frequency. The first time derivative \(\nu =d\omega /d\tau \) of this object, corresponding to the instantaneous chirpiness of the sound, allows adding a supplementary dimension to the domain of the input. As in the case of V1, this gives rise to a natural lift of the signal to an augmented space, which in this case turns out to be \(\mathbb{R}^{3}\) with the Heisenberg group structure. (This structure very often appears in signal processing; see, for instance, [21] and Appendix B.)
As we already mentioned, the special role played by time in sound signals does not permit modeling the flow of information as a pure hypoelliptic diffusion, as was done for static images in V1. We thus turn to a different kind of model, namely Wilson–Cowan equations [43]. Such a model, based on an integrodifferential equation, has been successfully applied to describe the evolution of neural activations. In particular, it allowed theoretically predicting complex perceptual phenomena in V1, such as the emergence of hallucinatory patterns [12, 17], and has been used in various computational models of the auditory cortex [26, 37, 46]. Recently, these equations have been coupled with the neurogeometric model of V1 to great benefit. For instance, in [4–6] they allowed replicating orientationdependent brightness illusory phenomena, which had proved to be difficult to implement for noncorticalinspired models. See also [39] for applications to the detection of perceptual units.
On top of these positive results, Wilson–Cowan equations present many advantages from the point of view of A1 modelling: (i) they can be applied independently of the underlying structure, which is only encoded in the kernel of the integral term; (ii) they allow for a natural implementation of delay terms in the interactions; and (iii) they can be easily tuned via few parameters with a clear effect on the results. On the basis of these positive results, we emulate this approach in the A1 context. Namely, we will consider the lifted sound image \(I(\tau ,\omega ,\nu )\) to yield an A1 activation \(a(\tau ,\omega ,\nu )\) via the following Wilson–Cowan equations:
Here \((t,\omega ,\nu )\) are coordinates on the augmented space corresponding to time, frequency, and chirpiness, respectively; \(\alpha , \beta , \gamma >0\) are parameters; \(\sigma :\mathbb {C}\to \mathbb {C}\) is a nonlinear sigmoid; \(k_{\delta }(\omega ,\nu \\omega ',\nu ')\) is a weight modelling the interaction between \((\omega ,\nu )\) and \((\omega ',\nu ')\) after a delay of \(\delta >0\). The presence of this delay term models the fact that the timescale of the input signal and of the neuronal activation are comparable.
The proposed algorithm to process a sound signal \(s:[0,T]\to \mathbb {R}\) is the following:

A.
Preprocessing

(a)
Compute the time–frequency representation \(S:[0,T]\times \mathbb {R}\to \mathbb {C}\) of s, via standard shorttime Fourier transform (STFT);

(b)
Lift this representation to the Heisenberg group, which encodes redundant information about chirpiness, obtaining \(I:[0,T]\times \mathbb {R}\times \mathbb {R}\to \mathbb {C}\) (see Sect. 2.1 for details);

(a)

B.
Processing Process the lifted representation I via Wilson–Cowan equations adapted to the Heisenberg structure, obtaining \(a:[0,T]\times \mathbb {R}\times \mathbb {R}\to \mathbb {C}\).

C.
Postprocessing Project a to the processed time–frequency representation \(\hat{S}:[0,T]\times \mathbb{R}\to \mathbb{C}\) and then apply an inverse STFT to obtain the resulting sound signal \(\hat{s}:[0,T]\to \mathbb {R}\).
Remark 1
All the above operations can be performed in realtime, as they only require the knowledge of the sound on a short window \([t\delta ,t+\delta ]\).
Remark 2
Notice that in the presented algorithm we are assuming neural activations to be complexvalued functions, due to the use of the STFT. This is inconsistent with neural modelling, as it is known that the cochlea sends to A1 only the spectrogram of the STFT (that is, \(S\)), see [40]. When striving for a biologically plausible description, one can easily modify the above algorithm in this direction (i.e. by computing the lifted representation I starting from \(S\) instead than S). However, during the postprocessing phase, in order to invert the STFT and obtain an audible signal, one then needs to reconstruct the missing phase information via heuristic algorithms. See, for instance, [19].
1.4 Structure of the paper
In Sect. 2, we present the reconstruction model. We first present the lift procedure of a sound signal to a function on the augmented space, and then introduce the Wilson–Cowan equations modelling the cortical stimulus. In Sect. 3, we describe the numerical implementation of the algorithm, together with some of its crucial properties. This implementation is then tested in Sect. 4, were we show the results of the algorithm on some simple synthetic signals. Such numerical examples can be listened at www.github.com/dprn/WCA1, and should be considered as a very preliminary step toward the construction of an efficient corticalinspired algorithm for sound reconstruction. Finally, in Appendix B, we show how the proposed algorithm preserves the natural symmetries of sound signals.
2 The reconstruction model
As discussed in the introduction, the cochlea decomposes the input sound \(s:[0,T]\to \mathbb {R}\) in its time–frequency representation \(S:[0,T]\times \mathbb {R}\to \mathbb {C}\), obtained via a shorttime Fourier transform (STFT). This corresponds to interpreting the ‘instantaneous sound’ at time \(\tau \in [0,T]\), instead of as a sound level \(s(\tau )\in \mathbb {R}\), as a function \(\omega \mapsto S(\tau ,\omega )\) which encodes the instantaneous presence of each given frequency, with phase information.
2.1 The lift to the augmented space
In this section, we present an extension of the time–frequency representation of a sound, which is at the core of the proposed algorithm. Roughly speaking, the instantaneous sound will be represented as a function \((\omega , \nu )\mapsto I(\tau ,\omega ,\nu )\), encoding the presence of both the frequency and the chirpiness \(\nu = {d\omega }/{d\tau }\).
Assume for the moment that the sound has a single timevarying frequency, e.g.
If the frequency is varying slowly enough and the window of the STFT is large enough, its sound image (up to the choice of normalising constants in the Fourier transform) coincides roughly with
where \(\delta _{0}\) is the Dirac delta distribution centered at 0. That is, S is concentrated on the two curves \(\tau \mapsto (\tau ,\omega (\tau ))\) and \(\tau \mapsto (\tau ,\omega (\tau ))\), see Fig. 3. Let us focus only on the first curve.
Because of the sensitivity to variations of the input, as discussed in Sect. 1, the curve \(\omega (\tau )\) is lifted in a bigger space by adding a new variable \(\nu =d\omega /d\tau \). In mathematical terms, the 3dimensional space \((\tau ,\omega ,\nu )\) is called the augmented space. It will be the basis for the geometric model of A1 that we are going to present.
Up to now the curve \(\omega (\tau )\) was parameterized by one of the coordinates of the contact space (the variable τ), but it will be more convenient to consider it as a parametric curve in the space \((\tau ,\omega )\). More precisely, the original curve \(\omega (\tau )\) is represented in the space \((\tau ,\omega )\) as \(t\mapsto (t,\omega (t))\) (thus imposing \(\tau =t\)). Similarly, the lifted curve is parameterized as \(t\mapsto (t,\omega (t),\nu (t))\). To every regular enough curve \(t\mapsto (t,\omega (t))\), one can associate a lift \(t\mapsto (t,\omega (t),\nu (t))\) in the contact space simply by computing \(\nu (t)=d\omega /dt\). Conversely, a regular enough curve in the contact space \(t\mapsto (\tau (t),\omega (t),\nu (t))\) is a lift of planar curve \(t\mapsto (t,\omega (t))\) if \(\tau (t)=t\) and if \(\nu (t)=d\omega /dt\). Now, defining \(u(t)=d\nu /dt\), we can say that a curve in the contact space \(t\mapsto (\tau (t),\omega (t),\nu (t))\) is a lift of a planar curve if there exists a function \(u(t)\) such that
Letting \(q=(\tau ,\omega ,\nu )\), equation (2) can be equivalently written as the control system
where the \(X_{0}\) and \(X_{1}\) are two vector fields in \(\mathbb {R}^{3}\) given by
Notice that the two vector fields appearing in this formula generate the Heisenberg group. However, we are not dealing here with a subRiemannian structure, since the space \(\{X_{0}+uX_{1}\mid u\in \mathbb {R}\}\) is a line and not a plane. (One would get a plane by considering two controls, namely \(\{u_{0}X_{0}+u_{1}X_{1}\mid (u_{0},u_{1})\in \mathbb {R}^{2}\}\).)
Following [9], when s is a general sound signal, we lift each level line of \(S\). By the implicit function theorem, this yields the following subset of the contact space:
If \(S\in C^{2}\) and \(\operatorname{Hess}S\) is nondegenerate, the set Σ is indeed a surface. Finally, the external input from the cochlea is given by
Here \(\delta _{\Sigma }\) denotes the Dirac delta distribution concentrated at Σ. The presence of this distributional term is necessary for a welldefined solution to the evolution equation (WC). Such en equation is introduced in the next section.
2.2 Cortical activations in A1
On the basis of what described in the previous section and the wellknown tonotopical organization of A1 (cf. Sect. 1), we propose to consider A1 to be the space of \((\omega ,\nu )\in \mathbb {R}^{2}\). When hearing a sound \(s(\cdot )\), the external input fed to A1 at time \(t>0\) is then given as the slice at \(\tau =t\) of the lift I of s to the contact space. That is, hearing an ‘instantaneous sound level’ \(s(t)\) reflects in the external input \(I(t,\omega ,\nu )\) to the ‘neuron’ \((\omega ,\nu )\) in A1 as follows: The ‘neuron’ receives an external charge \(S(t,\omega )\) if \((t,\omega ,\nu )\in \Sigma \), and no charge otherwise, where Σ is defined in (3).
We model the neuronal activation induced by the external stimulus I by adapting to this setting the wellknown Wilson–Cowan equations. These equations are widely used and proved to be very effective in the study of V1 [12, 43]. According to this framework, the resulting activation \(a:[0,T]\times \mathbb {R}\times \mathbb {R}\to \mathbb {C}\) is the solution of the following equation with delay \(\delta >0\):
with initial condition \(a(t,\cdot ,\cdot )\equiv 0\) for \(t\le 0\). Here \(\alpha ,\beta , \gamma >0\) are parameters, \(k_{\delta }\) is an interaction kernel, and \(\sigma :\mathbb {C}\to \mathbb {C}\) is a (nonlinear) saturation function, or sigmoid. In the following, we let \(\sigma (\rho e^{i\theta })=\tilde{\sigma }(\rho ) e^{i\theta }\) where \(\tilde{\sigma }(x) = \min \{1,\max \{0,\kappa x\}\}\), \(x\in \mathbb {R}\), for some fixed \(\kappa >0\). The fact that the nonlinearity σ does not act on the phase is one of the key ingredients in proving that this processing preserves the natural symmetries of sound signals, see Proposition 4 in Appendix B.
When \(\gamma = 0\), equation (5) becomes the standard lowpass filter \(\partial _{t} a = \alpha a + I\), whose solution is the convolution of the input signal I with the function
Setting \(\gamma \neq 0\) adds a nonlinear delayed interaction term on top of this exponential smoothing, encoding the inhibitory and excitatory interconnections between neurons. Next section is devoted to the choice of the integral kernel \(k_{\delta }\).
Remark 3
In (5) we chose to consider a simple form for the interaction term. A more precise choice would indeed need to take into account the whole history of the process, for example, by considering
2.3 The neuronal interaction kernel
Considering A1 as a slice of the augmented space allows deducing a natural structure for neuron connections as follows. Going back to a sound composed by a single timevarying frequency \(t\mapsto \omega (t)\), we have that its lift is concentrated on the curve \(t\mapsto (\omega (t),\nu (t))\) such that
where \(Y_{0}(\omega ,\nu ) = (\nu ,0)^{\top }\), \(Y_{1}(\omega ,\nu ) = (0,1)^{\top }\), and \(u:[0,T]\to \mathbb {R}\).
As in the case of V1 [8], we model neuronal connections via these dynamics. In practice, this amounts to assuming that the excitation starting at a neuron \(X_{0}=(\omega ',\nu ')\) evolves as the stochastic process \(\{A_{t}\}_{t\ge 0}\) naturally associated with (6). This is given by the following stochastic differential equation:
where \(\{W_{t}\}_{t\ge 0}\) is a Wiener process. The generator of \(\{A_{t}\}_{t\ge 0}\) is the second order differential operator
In this formula, the vector fields \(Y_{0}\) and \(Y_{1}\) are interpreted as firstorder differential operators. Moreover, we added a scaling parameter \(b>0\), modelling the relative strength of the two terms.
It is natural to model the influence \(k_{\delta }(\omega ,\nu \\omega ',\nu ')\) of neuron \((\omega ',\nu ' )\) on neuron \((\omega ,\nu )\) at time \(\delta >0\) as the transition density of the process \(\{A_{t}\}_{t\ge 0}\). It is wellknown that such transition density is obtained by computing the integral kernel at time δ of the Fokker–Planck equation corresponding to (7) that reads
The existence of an integral kernel for (8) is a consequence of the hypoellipticity^{Footnote 4} of \((\partial _{t}  \mathcal{L}^{*})\). The explicit expression of \(k_{\delta }\) is wellknown, and we recall it in the following result, proved in Appendix A.
Proposition 1
The integral kernel of equation (8) is
where
3 Numerical implementation
For the numerical experiments, we chose to implement the proposed algorithm in Julia [7]. As already presented, this process consists in a preprocessing phase, in which we build an input function I on the 3D contact space, a main part, where I is used as the input of the Wilson–Cowan equation (WC), and a postprocessing phase, where the reconstructed sound is recovered from the result of the first part.
In the following, we present these phases separately.
3.1 Preprocessing
The input sound s is lifted to a time–frequency representation S via a classical implementation of STFT, i.e. by performing FFTs of a windowed discretised input. In the proposed implementation, we chose to use a standard Hann window (see, e.g. [36])
The resulting time–frequency signal is then lifted to the contact space through an approximate computation of the gradient \(\nabla S\) and the following discretisation of (4):
Discretisation issues
While the discretisation of the time and frequency domains is a wellunderstood problem, dealing with the additional chirpiness variable requires some care. Indeed, even if we assume that the significant frequencies of the input sound s belong to a bounded interval \(\Lambda \subset \mathbb {R}\), in general the set \(\{ \nu \in \mathbb {R}\mid I(\tau ,\omega ,\nu )\neq 0\}\) is unbounded. Indeed, one can check that as \((\tau ,\omega )\) moves to a point where the countour lines of \(S\) become vertical, the set of chirpinesses ν’s such that \(\nu \partial _{\omega }S(\tau ,\omega ) =\partial _{\tau }S(\tau , \omega )\) will converge to ±∞.
In the numerical implementation, we chose to restrict the admissible chirpinesses to a bounded interval \(N\subset \mathbb {R}\). This set is chosen in a case by case fashion in order to contain the relevant slopes for the examples under consideration. Work is ongoing to automate this procedure.
3.2 Processing
Equation (WC) can be solved via a standard forward Euler method. Hence, the delicate part of the numerical implementation is the computation of the interaction term.
As is clear from the explicit expression given in Proposition 1, \(k_{\delta }\) is not a convolution kernel. That is, \(k_{\delta }(\omega ,\nu \\omega ',\nu ')\) cannot be expressed as a function of \((\omega \omega ',\nu \nu ')\). As a consequence, a priori we need to explicitly compute all values \(k_{\delta }(\omega ,\nu \\omega ',\nu ')\) for \((\omega ,\nu )\) and \((\omega ',\nu ')\) in the considered domain. As is customary, in order to reduce computation times, we fix a threshold \(\varepsilon >0\) and for any given \((\omega ,\nu )\) we compute only values for \((\omega ',\nu ')\) in the compact set
The structure of \(\mathrm{K}_{\delta }^{\varepsilon }(\omega ,\nu )\) is given in the following, whose proof we defer to Appendix A.
Proposition 2
For any \(\varepsilon >0\) and \((\omega ,\nu )\in \mathbb {R}^{2}\), we have that \(\mathrm{K}_{\delta }^{\varepsilon }(\omega ,\nu )\) is the set of those \((\omega ',\nu ')\in \mathbb {R}^{2}\) that satisfy
Remark 4
One has \(C_{\varepsilon }\ge 0\) if and only if
Indeed, for any \((\omega ,\nu )\in \mathbb {R}^{2}\), the righthand side above corresponds to \(\max k_{\delta }(\omega ,\nu \\cdot ,\cdot )\), and thus \(\mathrm{K}^{\varepsilon }(\omega ,\nu ) = \varnothing \) for larger values of ε.
The above allows numerically implementing \(k_{\delta }\) as a family of sparse arrays. That is, let \(G\subset \Lambda \times N\) be the chosen discretisation of the significant set of frequencies and chirpinesses. Then to \(\xi = (\omega , \nu )\in G\) we associate the array \(M_{\xi }:G \to \mathbb {R}\) defined by
Therefore, up to choosing the tolerance \(\varepsilon \ll 1\) sufficiently small, the interaction term in (WC), evaluated at \(\xi =(\omega ,\nu )\in G\), can be efficiently estimated by
3.3 Postprocessing
Both operations in the preprocessing phase are inversible: the STFT by inverse STFT, and the lift by integration along the ν variable (that is, summation of the discretized solution). The final output signal is thus obtained by applying the inverse of the preprocessing (integration then inverse STFT) to the solution a of (WC). That is, the resulting signal is given by
The following guarantees that ŝ is realvalued and thus correctly represents a sound signal. From the numerical point of view, this implies that we can focus on solutions of (WC) in the halfspace \(\{\omega \geq 0\}\), which can then be extended to the whole space by mirror symmetry.
Proposition 3
It holds that \(\hat{s}(t)\in \mathbb {R}\) for all \(t>0\).
Proof
Let us denote
so that \(\hat{s} = \operatorname{STFT}^{1}(\hat{S})\). Moreover, for any function \(f(t,\omega ,\nu )\), we let \(f^{\star }(t,\omega ,\nu ):=\bar{f}(t,\omega ,\nu )\).
To prove the statement, it is enough to show that
This is trivially satisfied for \(t\le 0\), since in this case \(a(t,\cdot ,\cdot )\equiv 0\).
We now claim that if (10) holds on \([0,T]\) it holds on \([0,T+\delta ]\), which will prove it for all \(t\ge 0\). By definition of I and the fact that \(S(t,\omega )=\overline{S(t,\omega )}\), we immediately have \(I \equiv I^{\star }\). On the other hand, the explicit expression of \(k_{\delta }\) in (9) yields that
Then, for all \(t\le T+\delta \), we have
A simple argument, e.g. using the variation of constants method, shows that these two facts imply the claim, and thus the statement. □
4 Experiments
In Figs. 4–7 we present a series of experiments on simple synthetic sounds in order to exhibit some key features of our algorithm. These experiments can be reproduced via the code available at https://www.github.com/dprn/WCA1. For all experiments, the chosen delay is \(\delta =0.0625\) s and we present the STFT of the original and the processed sound. Each time, only the positive frequencies are shown: negative frequencies are recovered via the Hermitian symmetry of the Fourier transform on real signals.
The first example, Fig. 4, is a simple linear chirp such that the dominating frequency depends linearly on time (i.e. corresponding to \(\omega (t)=\mu t\) for some \(\mu \in \mathbb{R}\)). One observes that the processed sound presents the same feature but for a longer duration. The parameters in the experiment (\(\alpha =55\), \(\beta =1\), \(\gamma =55\), \(b=0.05\)) have been chosen to emphasize the effect of the modelling equation: the reconstruction should not present a tail that is as pronounced, however, this allows highlighting the diffusive effect along the lifted slope.
The second example, Fig. 5, corresponds to the same linear chirp as Fig. 4, that has been interrupted in its middle section, creating two disjoint linear chirps. The parameters are the same as in the previous experiment. Thanks to the transport effect of the algorithm, the gap between the two chirps is bridged in the processed signal. For this illustration, the interruption lasts about twice as long as the delay.
The third example, Fig. 6, consists of the sum of two linear chirps with different slopes. The slopes have been picked to suggest that linear continuations of the chirps should intersect. This is indeed what happens in the processed signal with parameters \(\alpha =53\), \(\beta =1\), \(\gamma =55\), \(b=0.01\). However, notice that the resulting crossing happens almost as a sum of the two chirps processed independently, with close to no interaction at the crossing. This is purely an effect of the lift procedure. The increasing chirp is (predominantly) lifted to a stratum corresponding to a positive slope, while the decreasing chirp is lifted to a negative slope stratum. De facto, their evolution under the Wilson–Cowan equation is decoupled in the 3D augmented space.
The fourth and last example, Fig. 7, corresponds to a nonlinear chirp, roughly corresponding to choosing \(\omega (\tau ) = \sin (m \tau )\) in (1). The chosen parameters are \(\alpha =53\), \(\beta =1\), \(\gamma =55\), \(b=0.2\). The construction of the model favors linearity in the evolution of perceived frequencies. We can observe how the more linear elements of the input result in more diffusion.
5 Conclusion
In this work we presented a sound reconstruction framework inspired by the analogies between visual and auditory cortices. Building upon the successful cortical inspired image reconstruction algorithms, the proposed framework lifts time–frequency representations of signals to the 3D contact space, by adding instantaneous chirpiness information. These redundant representations are then processed via adapted integrodifferential Wilson–Cowan equations.
The promising results obtained on simple synthetic sounds, although preliminary, suggest possible applications of this framework to the problem of degraded speech. The next step will be to test the reconstruction ability of normalhearing humans on originally degraded speech material compared to the same speech material after algorithm reconstruction. Such an endeavour will contribute to the understanding of the auditory mechanisms emerging in adverse listening conditions. It will furthermore help to deepen our knowledge on general organization principles underlying the functioning of the human auditory cortex.
Availability of data and materials
Not applicable.
Notes
A 3dimensional manifold M becomes a contact space once it is endowed with a smooth map \(M\ni q\mapsto {\mathcal{D}}(q)\) where \({\mathcal{D}}(q)\) is a a plane in the tangent space \(T_{q} M\) passing from q. There is an additional requirement on this map. Locally one can always write \({\mathcal{D}}(q)=\operatorname{span}\{X_{0}(q),X_{1}(q) \}\), where \(X_{0}\) and \(X_{1}\) are two smooth vector fields. Then at every point q one should require \(\operatorname{dim}(\operatorname{span} _{q}\{X_{0},X_{1},[X_{0},X_{1}]\})=3\). Here \([\cdot ,\cdot ]\) is the Lie bracket of the vector fields. The main consequence of this condition is that no surface can be tangent to \({\mathcal{D}}\) at all points.
By assigning to every \({\mathcal{D}}(q)\) an inner (Euclidean) product that is smooth as a function of q, we endow M with a subRiemannian structure. The simplest way of defining locally such a structure on a 3dimensional manifold is to assign two vector fields \(X_{0}\) and \(X_{1}\) postulating, on the one hand, that \({\mathcal{D}}(q)=\operatorname{span}\{X_{0}(q),X_{1}(q) \}\) (assigning in this way the contact structure) and, on the other hand, that they have norm one and are mutually orthogonal (assigning in this way the inner product).
The simplest example of subRiemannian structure on \(\mathbb {R}^{3}\) is given by the socalled Heisenberg group for which the vector fields \(X_{0} = (1,\nu ,0)^{\top }\) and \(X_{1} = (0,0,1)^{\top }\) are orthonormal (here we write coordinates in \(\mathbb {R}^{3}\) as \((\tau ,\omega ,\nu )\)). Such a structure is called Heisenberg group since defining \(X_{2}=(0,1,0)^{\top }\) one has the Lie brackets \([X_{0},X_{1}]=X_{2}\), \([X_{0},X_{2}]=[X_{1},X_{2}]=0\), that are the commutation relations appearing in quantum mechanics.
Note that in mathematics, the term ‘direction’ corresponds to what neurophysiologists call ‘orientation’, and vice versa. In this study, we use the mathematical terminology.
Actually, its spectrogram \(S:[0,T]\times \mathbb {R}\to [0,+\infty )\), see Remark 2.
That is, if f is a distribution defined on an open set Ω and such that \((\partial _{t}\mathcal{L}^{*})f\in C^{\infty }(\Omega )\), then \(f\in C^{\infty }(\Omega )\).
References
Agrachev A, Barilari D, Boscain U. A Comprehensive Introduction to SubRiemannian Geometry. Cambridge Studies in Advanced Mathematics. Cambridge: Cambridge University Press; 2020.
Assmann P, Summerfield Q. The Perception of Speech Under Adverse Conditions. New York: Springer; 2004. p. 231–308.
Barilari D, Boarotto F. Kolmogorov–Fokker–Planck operators in dimension two: heat kernel and curvature. SIAM J Control Optim. 2018.
Bertalmío M, Calatroni L, Franceschi V, Franceschiello B, Gomez Villa A, Prandi D. Visual illusions via neural dynamics: Wilson–Cowantype models and the efficient representation principle. J Neurophysiol. 2020;PMID:32159409.
Bertalmío M, Calatroni L, Franceschi V, Franceschiello B, Prandi D. A corticalinspired model for orientationdependent contrast perception: A link with Wilson–Cowan equations. In: Scale Space and Variational Methods in Computer Vision. Cham: Springer; 2019.
Bertalmìo M, Calatroni L, Franceschi V, Franceschiello B, Prandi D. Corticalinspired Wilson–Cowantype equations for orientationdependent contrast perception modelling. J Math Imaging Vis. 2020.
Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM Rev. 2017;59(1):65–98.
Boscain U, Chertovskih R, Gauthier JP, Remizov A. Hypoelliptic diffusion and human vision: a semidiscrete new twist on the Petitot theory. SIAM J Imaging Sci. 2014;7(2):669–95.
Boscain U, Duplaix J, Gauthier JP, Rossi F. Anthropomorphic image reconstruction via hypoelliptic diffusion. 2010.
Boscain UV, Chertovskih R, Gauthier JP, Prandi D, Remizov A. Highly corrupted image inpainting through hypoelliptic diffusion. J Math Imaging Vis. 2018;60(8):1231–45.
Bramanti M. An invitation to hypoelliptic operators and Hörmander’s vector fields. SpringerBriefs in Mathematics. Cham: Springer; 2014.
Bressloff PC, Cowan JD, Golubitsky M, Thomas PJ, Wiener MC. Geometric visual hallucinations, Euclidean symmetry and the functional architecture of striate cortex. Philos Trans R Soc Lond B, Biol Sci. 2001;356(1407):299–330.
Citti G, Sarti A. A Cortical Based Model of Perceptual Completion in the RotoTranslation Space. J Math Imaging Vis. 2006;24(3):307–26.
Dallos P. Overview: Cochlear Neurobiology. New York: Springer; 1996. p. 1–43.
Duits R, Franken E. Leftinvariant parabolic Evolutions on SE(2) and Contour Enhancement via Invertible Orientation Scores. Part I: Linear Leftinvariant Diffusion Equations on SE. Q Appl Math. 2010;68(2):255–92.
Duits R, Franken E. Leftinvariant parabolic evolutions on SE(2) and contour enhancement via invertible orientation scores. Part II: nonlinear leftinvariant diffusions on invertible orientation scores. Q Appl Math. 2010;68(2):293–331.
Ermentrout GB, Cowan JD. A mathematical theory of visual hallucination patterns. Biol Cybern. 1979;34:137–50.
Fernandes T, Ventura P, Kolinsky R. Statistical information and coarticulation as cues to word boundaries: A matter of signal quality. Percept Psychophys. 2007;69(6):856–64.
Fienup J. Phase retrieval algorithms: a comparison. Appl Opt. 1982;21(15):2758–69.
Franken E, Duits R. CrossingPreserving CoherenceEnhancing Diffusion on Invertible Orientation Scores. Int J Comput Vis. 2009;85(3):253–78.
Gröchenig K. Foundations of timefrequency analysis. Applied and Numerical Harmonic Analysis. Boston: Birkhäuser Boston; 2001.
Hannemann R, Obleser J, Eulitz C. Topdown knowledge supports the retrieval of lexical information from degraded speech. Brain Res. 2007;1153:134–43.
Hickok G, Poeppel D. The cortical organization of speech processing. Nat Rev Neurosci. 2007;8(5):393–402.
Hoffman WC. The visual cortex is a contact bundle. Appl Math Comput. 1989;32(2–3):137–67.
Hubel DH, Wiesel TN. Receptive fields of single neurons in the cat’s striate cortex. J Physiol. 1959;148(3):574–91.
Loebel A, Nelken I, Tsodyks M. Processing of Sounds by Population Spikes in a Model of Primary Auditory Cortex. Front Neurosci. 2007;1(1):197–209.
Luce PA, McLennan CT. Spoken Word Recognition: The Challenge of Variation. New York: Wiley; 2008. p. 590–609.
Mattys S, Davis M, Bradlow A, Scott S. Speech recognition in adverse conditions: A review. Lang Cogn Neurosci. 2012;27(7–8):953–78.
Montgomery R. A tour of subRiemannian geometries, their geodesics and applications. Mathematical Surveys and Monographs. vol. 91. Providence: Am. Math. Soc.; 2002.
Nelken I, Calford MB. Processing Strategies in Auditory Cortex: Comparison with Other Sensory Modalities. Boston: Springer; 2011. p. 643–56.
Parikh G, Loizou PC. The influence of noise on vowel and consonant cues. J Acoust Soc Am. 2005;118(6):3874–88.
Petitot J, Tondut Y. Vers une neurogéométrie. Fibrations corticales, structures de contact et contours subjectifs modaux. Math Sci Hum. 1999;145:5–101.
Petitot J, Tondut Y. Vers une Neurogéométrie. Fibrations corticales, structures de contact et contours subjectifs modaux. 1999;1–96.
Polger TW, Shapiro LA, Press OU. The multiple realization book. Oxford: Oxford University Press; 2016.
Prandi D, Gauthier JP. A semidiscrete version of the Citti–Petitot–Sarti model as a plausible model for anthropomorphic image reconstruction and pattern recognition. SpringerBriefs in Mathematics. Cham: Springer; 2017.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes. 3rd ed. Cambridge: Cambridge University Press; 2007.
Rankin J, Sussman E, Rinzel J. Neuromechanistic Model of Auditory Bistability. PLoS Comput Biol. 2015;11(11):e1004555.
Rauschecker JP. Auditory and visual cortex of primates: a comparison of two sensory systems. Eur J Neurosci. 2015;41(5):579–85.
Sarti A, Citti G. The constitution of visual perceptual units in the functional architecture of V1. J Comput Neurosci. 2015;38(2):285–300.
Sethares W. Tuning, Timbre, Spectrum, Scale. London: Springer; 2005.
Sharma J, Angelucci A, Sur M. Induction of visual orientation modules in auditory cortex. Nature. 2000;404(6780):841–7.
Tian B, Kuśmierek P, Rauschecker JP. Analogues of simple and complex cells in rhesus monkey auditory cortex. Proc Natl Acad Sci. 2013;110(19):7892–7.
Wilson HR, Cowan JD. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys J. 1972;12(1):1–24.
Zatorre RJ. Do you see what I’m saying? Interactions between auditory and visual cortices in cochlear implant users. Neuron. 2001;31(1):13–4.
Zhang J, Dashtbozorg B, Bekkers E, Pluim JPW, Duits R, ter Haar Romeny BM. Robust retinal vessel segmentation via locally adaptive derivative frames in orientation scores. IEEE Trans Med Imaging. 2016;35(12):2631–44.
Zulfiqar I, Moerel M, Formisano E. SpectroTemporal Processing in a TwoStream Computational Model of Auditory Cortex. Front Comput Neurosci. 2020;13:95.
Acknowledgements
The authors thank JeanPaul Gauthier for stimulating discussions.
Funding
The authors acknowledge the support of the ANR project SRGI ANR15CE400018 and of the ANR project Quaco ANR17CE40000701. This study was also supported by the IdEx Universite de Paris, ANR18IDEX0001, awarded to the last author, and by a public grant overseen by the French National Research Agency (ANR) as part of the program “Investissements d’Avenir” (reference: ANR10LABX0083).
Author information
Authors and Affiliations
Contributions
All authors have contributed equally. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Additional information
Abbreviations
Not applicable.
Appendices
Appendix A: Integral kernel of the Kolmogorov operator
The result in Proposition 1 is wellknown. For example, by applying [3, Proposition 9] and letting \(x=(\omega ,\nu )\) and \(x'=(\omega ',\nu ')\), one gets that the kernel is
where
and
Direct computations yield
Therefore,
Finally, the statement follows by letting
We now turn to an argument for Proposition 2. Observe that \(k_{\delta }(x\x')\ge \varepsilon \) if and only if
Then, we start by solving \(z^{\top }M z\le \eta \), for \(z\in \mathbb {R}^{2}\). One can check that this is verified if and only if
Since \(C_{\varepsilon }= 4\eta /\delta ^{2}\), the statement follows by computing the above at \(z = x'e^{\tau A}x\).
Appendix B: Heisenberg group action on the contact space
Recall that the shorttime Fourier transform of a signal \(s\in L^{2}(\mathbb {R})\) is given by
Here \(W:\mathbb {R}\to [0,1]\) is a compactly supported (smooth) window, so that \(S\in L^{2}(\mathbb {R}^{2})\). Fundamental operators in time–frequency analysis [21] are time and phase shifts, acting on signals \(s\in L^{2}(\mathbb {R})\) by
for \(\theta ,\lambda \in \mathbb {R}\). One easily checks that \(T_{\theta }\) and \(M_{\lambda }\) are unitary operators on \(L^{2}(\mathbb {R})\). By conjugation with the shorttime Fourier transform, they naturally define the unitary operators on \(L^{2}(\mathbb {R}^{2})\) given by
The relevance of the Heisenberg group in time–frequency analysis is a direct consequence of the commutation relation
Indeed, this shows that the operator algebra generated by \((T_{\theta })_{\theta \in \mathbb {R}}\) and \((M_{\lambda })_{\lambda \in \mathbb {R}}\) coincides with the Heisenberg group \(\mathbb{H}^{1}\) via the representation \(U:\mathbb{H}^{1}\to \mathcal{U}(L^{2}(\mathbb {R}^{2}))\) defined by
The above discussion shows that the Heisenberg group can be regarded as the natural space of symmetries of sound signals. In particular, any meaningful treatment of these signals should respect such a symmetry. In the case of our model, this is the content of the following result.
Proposition 4
The sound processing algorithm presented in this paper commutes with the Heisenberg group action (12) on sound signals. That is, if the input sound signal \(s\in L^{2}(\mathbb {R}^{2})\) yields ŝ as a result, then, for any \((\theta ,\lambda ,\zeta )\in \mathbb{H}^{1}\), the input \(U(\theta ,\lambda ,\zeta )s\) yields \(U(\theta ,\lambda ,\zeta )\hat{s}\) as a result.
Proof
We can schematically write the algorithm as:
Here Lift is the lift operator defined in Sect. 2.1, WC denotes the Wilson–Cowan evolution (WC), and Proj denotes the projection from the augmented space to the time–frequency representation, defined by
Observe that (11) shows that U induces a representation of \(\mathbb{H}^{1}\) on \(L^{2}(\mathbb {R}^{2})\), the codomain of the STFT, which we will denote by Ũ. Thus, to prove the statement it suffices to show that
Recall now that Lift associates with \(S\in L^{2}(\mathbb {R}^{2})\) a distribution of the form \(\operatorname{Lift}[ S] (\tau ,\omega ,\nu )=S(\omega ,\nu )\delta _{\Sigma }(\tau ,\omega ,\nu )\) for some \(\Sigma \subset \mathbb {R}^{3}\). Due to the fact that Σ is defined via the modulus of S, it is unaffected by the phase factors appearing in the representation Ũ. That is, the lift of \(\tilde{U}(\theta ,\lambda ,\zeta )S\) is given by
It is then immediate to check that \([\operatorname{Proj}\circ \operatorname{Lift}, \tilde{U}(\theta , \lambda ,\zeta )]=0\).
We are left to verify that the operator WC commutes with \(\tilde{U}(\theta ,\lambda ,\zeta )\). The commutation is trivial for the linear terms. On the other hand, the nonlinearity introduced in the integral term commutes with \(\tilde{U}(\theta ,\lambda ,\zeta )\) thanks to the fact that \(\sigma (\rho e^{i\phi }) = e^{i\phi }\sigma (\rho )\) for all \(\rho >0\), \(\phi \in \mathbb {R}\). □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boscain, U., Prandi, D., Sacchelli, L. et al. A bioinspired geometric model for sound reconstruction. J. Math. Neurosc. 11, 2 (2021). https://doi.org/10.1186/s13408020000994
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13408020000994