A bio-inspired geometric model for sound reconstruction

The reconstruction mechanisms built by the human auditory system during sound reconstruction are still a matter of debate. The purpose of this study is to propose a mathematical model of sound reconstruction based on the functional architecture of the auditory cortex (A1). The model is inspired by the geometrical modelling of vision, which has undergone a great development in the last ten years. There are, however, fundamental dissimilarities, due to the different role played by time and the different group of symmetries. The algorithm transforms the degraded sound in an ‘image’ in the time–frequency domain via a short-time Fourier transform. Such an image is then lifted to the Heisenberg group and is reconstructed via a Wilson–Cowan integro-differential equation. Preliminary numerical experiments are provided, showing the good reconstruction properties of the algorithm on synthetic sounds concentrated around two frequencies.


Introduction
Listening to speech requires the capacity of the auditory system to map incoming sensory input to lexical representations.When the sound is intelligible, this mapping ("recognition") process is successful.With reduced intelligibility (e.g., due to background noise), the listener has to face the task of recovering the loss of acoustic information.This task is very complex as it requires a higher cognitive load and the ability of repairing missing input ( [27] for a review on noise in speech).Yet, (normal hearing) humans are quite able to recover sounds in several effortful listening situations (e.g.see for instance [26], ranging from sounds degraded at the source (e.g., hypoarticulated and pathological speech), during transmission (e.g., noise, reverberation) or corrupted because of physiological deficits (e.g.hearing loss; [27] among others).
So far, work on degraded speech has informed us a lot on the acoustic cues that help the recogniser to reconstruct missing information (e.g., [17,30]); the several adverse conditions in which listeners are able to reconstruct speech sounds (e.g., [2,27]); and whether (and at which stage of the auditory process) higher-order knowledge (i.e.our information about words and sentences) helps the system to recover lower-level perceptual information (e.g., [20]).
However, most of these studies are typically based on a statistical approach.A mathematical model informing us on how the human auditory system is able to reconstruct a degraded speech sound is still missing.The aim of this study is to build a neuro-geometric model for sound reconstruction.
Mathematical modelling of sensory input reconstruction has made a lot of progresses in the field of vision [31], [13], and [8].In later years, algorithms inspired by the structure of the primary visual cortex (V1) have been very successful in image processing and in particular for image Figure 1: Perceived pitch of a sound depends on the location in the cochlea that the sound wave stimulated.High-frequency sound waves, which correspond to high-pitched noises, stimulate the basal region of the cochlea (red).Low-frequency sound waves are targeted to the apical region of the cochlear structure and correspond with low-pitched sounds (purple).Image from [40].reconstruction tasks, see, .e.g., [18,15,34,9].To our knowledge, such work has not been done for sound processing, probably due to the lack of information regarding the primary auditory cortex (A1) with respect to V1.
The model proposed here is highly inspired by the one successfully applied for the primary visual cortex.The analogy between the structure of V1 and A1 is well-grounded on the existence of several biological similarities between the two cortex.For neuroscientists, models of the visual cortex are taken as a starting point for understanding mechanisms of the auditory system (see, for instance, [29] for a comparison, [21] for a related discussion in speech processing).A welloften cited case is the "topographic" organization of the cortex, a general principle according to which the processing of sensory information strongly lies on for mapping visual input and auditory-frequency input to neurons [36].
Within the specific case of the auditory system, sensors (so-called hair cells) are tonotopically organized along the spiral ganglion of the cochlea in a frequency-specific fashion, with cells close to the base of the ganglion being more sensitive to low-frequency sounds and cells near the apex more sensitive to high-frequency sounds, see Figure 1.This early 'spectrogram' of the signal is then transmitted to higher-order layers of the auditory cortex.Strong evidence for V1-A1 analogy comes from studies on animals and on humans with deprived hearing or visual functions showing cross-talk interactions between sensory regions [38,42].More relevant for our study is the existence of receptive fields of neurons in V1 and A1 ("simple" and "complex" cells), which supports the idea of a "common canonical processing algorithm within cortical columns" [39]: p.1.The presence of S-cells/C-cells and the appearance of singularities in the feature preference map typical of V1 (e.g., pinwheels) in certain situations [38,33] speaks in favour of the idea that V1 and A1 share similar mechanisms of sensory input reconstruction.However, there are certain differences to take into account, as explained in the following.
In the next section we present the mathematical model for V1 that will be the basis for our sound reconstruction algorithm.

Neuro-geometric model of V1
The neuro-geometric model of V1 can be traced back to the work of [22], which, inspired by the experimental results of [23], first proposed to model the primary visual cortex as a contact space.This model has then been extended to the so-called sub-Riemannian model in [32], [13], and [8,7].On the basis of such a model, exceptionally efficient algorithms for image inpainting have been developed (e.g.[9,14,15]), resulting in several medical imaging applications (e.g., [43]).
The main idea behind this model is that an image, seen as a function f : R 2 → R + representing the grey level, is lifted into a surface in the bundle of the direction of the plane R 2 × P 1 .Here, P1 is the space of directions on the plane measured without orientation 1 , namely the circle S 1 in which antipodal points are identified.The lift is realized by adding to each point of the image the direction of the tangent line to the level set of f .Under suitable regularity assumptions on f , such lift is a surface S f .When f is corrupted (i.e. when f is not defined in some region of the plane), the lift is corrupted as well and the reconstruction is obtained by applying a deeply anisotropic diffusion adapted to the problem.Such diffusion mimics the flow of information along the horizontal and vertical connections of V1 and uses as an initial condition the surface S f and the values of the function f .Mathematicians call such a diffusion the sub-Riemannian diffusion in R 2 × P 1 , cf. [28,1].One of the main features of this image reconstruction model is that it is invariant by rototranslation of the plane, a feature that will not be possible to translate to the case of sounds, due to the special role of the time variable.
In what follows, we explain how similar ideas could be translated to the problem of sound reconstruction.

From V1 to sound reconstruction
A sound (that we can think as represented by a function s : [0, T ] → R) is transformed to its time-frequency representation S : [0, T ] × R → C, which can be thought as the collection of two black-and-white images: |S| and arg S. The function S depends on two variables: the first one is time, that here we indicate with the letter τ , and the second one is frequency, denoted by ω.Roughly speaking, |S(τ, ω)| represents the strength of the presence of the frequency ω at time τ .In the following, we call S the sound image (see Figure 2).
A first attempt to model the process of sound reconstruction into A1 is to apply the same algorithm of image reconstruction discussed above.In a sound image, however, time plays a special role.Indeed: 1.While for images the reconstruction can be done by evolving the whole image at the same time, the sound image is not reaching the auditory cortex all at the same time and thus in this case the reconstruction can be performed only in a sliding window.
2. A rotated sound image corresponds to a completely different original sound and thus the invariance by rototranslations is lost.
As a consequence, different symmetries have to be taken into account (see Appendix B) and a different model for both the lift and the processing in the lifted space is required.
In order to introduce this model, let us recall that, in V1, neural stimulation stems not only from the input but also from its variations.That is, mathematically speaking, the input image is considered as a real valued function on a 2-dimensional space, and the orientation sensitivity arises from the sensitivity to a first order derivative information on this function.Furthermore, the geometric relation between the perceived orientation and the derivatives of the input signal yields a variational problem on an underlying non-commutative structure.This structure, endowed with a metric naturally associated with the variational problem, gives rise to the sub-Riemannian diffusion [1,10].
In our model, we follow these same principles in order to study sound inputs à la V1.Input sound signals are time dependent real valued functions transformed, by the action of the cochlea, via a short time Fourier transform.As a result the neuronal input is considered as a function of time and frequency.The first time derivative of this object allows to add a supplementary dimension to the domain of the input.Variation of the perceived frequency ν = dω/dτ can be understood as chirpiness, and this notion gives rise to a natural lift of the signal to the contact space in the sense of [22,31], i.e., R 3 with the Heisenberg group structure.(This structure very often appears in signal processing, see for instance [19] and Appendix B.) As in the case of V1, this observation implies the presence of a non-commutative structure associated with this relation.The hypoelliptic operator associated with this structure is the celebrated Kolmogorov operator [24,25].
As we already mentioned, the special role played by the time in sound signals, does not permit to model the flow of information as a pure hypoelliptic diffusion, as was done for static images in V1.We thus turn to a more sophisticated and flexible model known as Wilson-Cowan equations [41].Such a model, based on a differo-integral equation, has been successfully applied to describe the evolution of neural activations and allowed to predict complex perceptual phenomena in V1, such as the emergence of hallucinatory pattern [16,11].Recently, these equations have been coupled with the neuro-geometric model of V1 to great benefit.For instance, in [5,4] they allowed to replicate orientation-dependent brightness illusory phenomena, which had proved to be a hurdle for non-cortical-inspired models.See also [37], for applications to the detection of perceptual units.
On top of these positive results, Wilson-Cowan equations present many advantages from the point of view of A1 modelling: i) they can be applied independently of the underlying structure, which is only encoded in the kernel of the integral term; ii) they allow for a natural implementation of delay terms in the interactions; iii) they can be easily tuned via few parameters with a clear effect on the results.Motivated by these positive results, we emulate this approach in the A1 context.Namely, we will consider the lifted sound image I(τ, ω, ν) to yield an A1 activation a(τ, ω, ν) via the following Wilson-Cowan equations: Here, (τ, ω, ν) are coordinates on the contact space corresponding to time, frequency, and chirpiness, respectively; α, β, γ > 0 are parameters; σ : C → C is a non-linear sigmoid; w(ω, ν ω , ν ) is a weight modelling the interaction between (ω, ν) and (ω , ν ) via the kernel of the Kolmogorov operator (i.e., taking into account the contact structure of A1); and δ > 0 is a delay.The presence of this delay term models the fact that the time-scale of the input signal and of the neuronal activation are comparable.Wilson-Cowan equations with delay have been applied, e.g., to feedback stabilisation of deep-brain stimulation, cf.[12].
The proposed algorithm to treat a sound signal s : [0, T ] → R, is the following: A. Preprocessing: In Section 2, we present the lift to the contact space and the associated diffusion operator.In Section 3 we apply Wilson-Cowan equations to obtain our model for sound reconstruction.In Section 4, we describe the numerical implementation of the algorithm, together with some of its crucial properties.
This implementation is then tested in Section 5, were we show the results of the algorithm on some simple synthetic signals.Such numerical examples can be listened at www.github.com/dprn/WCA1, and should be considered as a very preliminary step toward the construction of an efficient cortical-inspired algorithm for sound reconstruction.
Finally, in Appendix B, we show how the proposed algorithm preserves the natural symmetries of sound signals.

The lift to the contact space
It is well-known that the coclhea decomposes the input sound s : [0, T ] → R in its time-frequency representation S : [0, T ] × R → C, obtained via a short-time Fourier tranform (STFT).This corresponds to interpret the "instantaneous sound" at time τ ∈ [0, T ], instead than as a sound level s(τ ) ∈ R, as a function ω → S(τ, ω) which encodes the instantaneous presence of each given frequency, with phase information.
In this section, we present an extension of this procedure, which is at the core of the proposed algorithm.Roughly speaking, the instantaneous sound will be represented as a function (ω, ν) → I(τ, ω, ν), encoding the presence of both the frequency and the chirpiness ν = dω/dτ .
Assume for the moment that the sound is a single (positive) time varying frequency, e.g., If the frequency is varying slowly enough and the window of the STFT is large enough, its sound image (up to the choice of normalisations constants in the Fourier transform) coincides roughly with where δ is the Dirac delta distribution centered at 0. That is, S is concentrated on the two curves τ → (τ, ω(τ )) and τ → (τ, −ω(τ )), see Figure 3. Let us focus only on the first curve.
Because of the sensitivity to directions (due to the presence of S-cells and pinwheels, cf.Section 1), the curve ω(τ ) is lifted in a bigger space by adding a new variable ν = dω/dτ .In mathematical terms the 3-dimensional space (τ, ω, ν) is called the contact space.It will be the basis for the geometric model of A1 that we are going to present.
Up to now the curve ω(τ ) was parameterized by one of the coordinates of the contact space (the variable τ ), but it will be more convenient to introduce a new parameter that we call t.The original curve ω(τ ) is then represented in the space (τ, ω) as t → (t, ω(t)) (thus imposing τ = t).
Similarly, the lifted curve is parameterized as t → (t, ω(t), ν(t)).To every regular enough curve t → (t, ω(t)), one can associate a lift t → (t, ω(t), ν(t)) in the contact space simply by computing ν(t) = dω/dt.Vice-versa, a regular enough curve in the contact space t → (τ (t), ω(t), ν(t)) is a lift of planar curve t → (t, ω(t)) if τ (t) = t and if ν(t) = dω/dt.Now, defining u(t) = dν/dt we can say that a curve in the contact space t → (τ (t), ω(t), ν(t)) is a lift of a planar curve if there exists a function u(t) such that: Letting q = (τ, ω, ν), equation ( 2) can be equivalently written as the control system where the X 0 and X 1 are the two vector fields in R 3 Notice that defining X 2 = (0, 1, 0) , we have the commutation relations (here Since these are precisely the commutation relations appearing in quantum mechanics, one says that {X 0 , X 1 , X 2 } generates the Heisenberg algebra.The corresponding Lie group structure on R 3 is called the Heisenberg group.Following [8], when s is a general sound signal, we lift each level line of |S|.By the implicit function theorem, this yields the following subset of the contact space: If |S| ∈ C 2 and Hess |S| is non-degenerate, the set Σ is indeed a surface.Finally, the external input from the cochlea is given by I(τ, ω, ν) = S(τ, ω)δ Σ (τ, ω, ν). ( Here, δ Σ denotes the Dirac delta distribution concentrated at Σ.The presence of this distributional term is necessary for having a well-defined solution to the evolution equation (WC) introduced in the next section.

The tonotopical model of A1
On the basis of what described in the previous section and the well-known tonotopical organization of A1 (cf.Section 1), we propose to consider A1 to be the space of (ω, ν) ∈ R 2 .The external input at time t > 0 corresponding to a sound s(•) to A1 is then given as the slice at τ = t of the lift I of s to the contact space.That is, hearing an "instantaneous sound level" s(t) reflects in the external input I(t, ω, ν) to the "neuron" (ω, ν) in A1 as follows: The "neuron" receives an external charge S(t, ω) if (t, ω, ν) ∈ Σ, and no charge otherwise.Interpreting A1 as a slice of the contact space allows to deduce a natural structure for neuron connections as follows.Going back to a sound composed by a single time-varying frequency t → ω(t), we have that its lift is concentrated on the curve t → (ω(t), ν(t)), such that where Y 0 (ω, ν) = (ν, 0) , Y 1 (ω, ν) = (0, 1) , and u : [0, T ] → R. It is well-known that the PDE naturally associated with the above control system is the following Kolmogorov equation for some parameter b > 0. In this formula, the vector fields Y 0 and Y 1 are interpreted as first-order differential operators.The operator (∂ t − L) is well-known to be hypoelliptic, meaning that if f is a distribution defined on an open set Ω such that (∂ t − L)f ∈ C ∞ (Ω), then f ∈ C ∞ (Ω).As a consequence, this operator admits a smooth kernel k τ , that we will consider to model neuronal connections.In the following result, proved in Appendix A, we recall the explicit expression of k τ .
Proposition 1.The integral kernel of equation ( 4) is where, In the next section we describe how the redundant representation I is actually processed.

The reconstruction model
The procedure presented here is strongly inspired by the successful results obtained by Wilson-Cowan equations in the primary visual cortex [41,11].According to this framework, the resulting signal a : [0, T ] × R × R → C is the solution of the following equation with delay τ > 0 (which can be seen as encoding the dendritic delays of neuronal connections): with initial condition a(t, •, •) ≡ 0 for t ≤ 0. Here, α, β, γ > 0 are parameters, w is an interaction kernel, and σ : C → C is a (non-linear) saturation function, or sigmoid.In the following, we let σ(ρe iθ ) = σ(ρ)e iθ where σ(x) = min{1, max{0, κx}}, x ∈ R, for some fixed κ > 0. When γ = 0, the above reduces to the standard low-pass filter ∂ t a = −αa + I, whose solution is the convolution of the input signal I with the function ϕ(t) given by ϕ(t) = e −tα for t ∈ (0, +∞) and ϕ(t) = 0 elsewhere.Setting γ = 0 adds a non-linear interaction term on top of this exponential smoothing.
In neuroscience applications, this encodes the inhibitory and excitatory interconnections between neurons.As described in Section 2.1, the interaction weight is chosen as w = k τ , the integral kernel of (4) at time τ = δ.This integral kernel is well known, and its explicit expression can be found, e.g., in [3].See also Section 4.

Numerical implementation
For the numerical experiments, we chose to implement the proposed algorithm in Julia [6].As already presented, this process consists in a preprocessing phase, in which we build an input function I on the 3D contact space, a main part, where I is used as the input of the Wilson-Cowan equation (WC), and a post-processing phase, where the reconstructed sound is recovered from the result of the first part.

Pre-processing
The input sound s is lifted to a time-frequency representation S via a classical implementation of STFT, i.e., by performing FFTs of a windowed discretised input.In the proposed implementation we chose to use a standard Hann window (see, e.g., [35]): The resulting time-frequency signal is then lifted to the contact space through an approximate computation of the gradient ∇|S| and the following discretisation of (3): Discretisation issues While the discretisation of the time and frequency domains is a well understood problem, dealing with the additional chirpiness variable requires some care.We will henceforth assume that the significant frequencies of the input sound s belong to a bounded interval Λ ⊂ R. Nevertheless, the set of chirpinesses ν's such that ν∂ ω |S|(τ, ω) = −∂ τ |S|(τ, ω) for some τ ∈ [0, T ] and ω ∈ Λ will cover in general the whole real line.This is due to the presence of points where the contour lines of |S| become vertical.
In the numerical implementation we chose to restrict the admissible chirpinesses to a bounded interval N ⊂ R obtained as follows.Let : R → [0, +∞) be the chirpiness density function, which associates to ν ∈ R the measure (ν) of the set of those (τ, ω) Here p ∈ [0, 1] is some tolerance, that in our numerical tests has been taken as p = 0.95.

Processing
Equation (WC) can be solved via a standard forward Euler method.Hence, the delicate part of the numerical implementation is the computation of the interaction term.
As it is clear from the explicit expression given in Proposition 1, k τ is not a convolution kernel in the sense that k τ (ω, ν ω , ν ) cannot be expressed as a function of (ω − ω , ν − ν ).As a consequence, a priori we need to explicitly compute all values k τ (ω, ν ω , ν ) for (ω, ν) and (ω , ν ) in the considered domain.As it is customary, in order to reduce computation times, we fix a threshold ε > 0 and for any given (ω, ν) we compute only values for (ω , ν ) in the compact set The structure of K ε (ω, ν) is given in the following, whose proof we defer to Appendix A. Proposition 2. For any ε > 0 and (ω, ν) ∈ R 2 , we have that Indeed, for any (ω, ν) ∈ R 2 , the r.h.s.above corresponds to max k τ (ω, ν •, •), and thus K ε (ω, ν) = ∅ for larger values of ε.
The above allows to numerically implement k τ as a family of sparse arrays.That is, let G ⊂ Λ × N be the chosen discretisation of the significant set of frequencies and chirpinesses.Then, to ξ = (ω, ν) ∈ G we associate the array M ξ : G → R defined by Therefore, up to choosing the tolerance ε 1 sufficiently small, the interaction term in (WC), evaluated at ξ = (ω, ν) ∈ G, can be efficiently estimated by

Post-processing
Both operations in the pre-processing phase are inversible: the STFT by inverse STFT, and the lift by integration along the ν variable (that is, summation of the discretized solution).The final output signal is thus obtained by applying the inverse of the pre-processing (integration then inverse STFT) to the solution a of (WC).That is, the resulting signal is given by The following guarantees that ŝ is real-valued and thus correctly represents a sound signal.From the numerical point of view, this implies that we can focus on solutions of (WC) in the half-space {ω ≥ 0}, which can then be extended to the whole space by mirror symmetry.Proposition 3. It holds that ŝ(t) ∈ R for all t > 0.
To prove the statement it is then enough to show that This is trivially satisfied for t ≤ 0, since in this case a(t, •, •) ≡ 0. We now claim that if the above holds on [0, T ] it holds on [0, T + τ ], which will prove (6).By definition of I and the fact that S(t, −ω) = S(t, ω), we immediately have I ≡ I .On the other hand, the explicit expression of k τ in (5) yields that Then, for all t ≤ T + τ , we have A simple argument, e.g., using the variation of constants method, shows that these two facts imply the claim, and thus the statement.

Experiments
In Figures 4 and 5 we present a simple synthetic experiment, where the input sound is assumed to consist of two distinct frequencies depending linearly on time.One observes that the processed sound presents the same features, but with a longer duration.This is especially true for the higher-frequency component, since this is taken to be dominant in the starting signal.

Conclusion
In this work we presented a sound reconstruction framework inspired by the analogies between visual and auditory cortices.Building upon the successful cortical inspired image reconstruction algorithms, the proposed framework lifts time-frequencies representations of signals to the 3D contact space, by adding instantaneous chirpiness information.These redundant representations are then processed via adapted diffeo-integral Wilson-Cowan equations.
The promising results obtained on simple synthetic sounds, although preliminary, suggest possible applications of this framework to the problem of degraded speech.This should be done via psycholinguistic experiments, testing the reconstruction ability of normal-hearing humans on originally degraded speech material compared to the same material after algorithm reconstruction.Such an endeavour will contribute to further the understanding of the auditory mechanisms emerging in effortful listening conditions and help to refine our knowledge on current theories and models of human speech perception as well as on general organization principles underlying the functioning of the human cortex.for θ, λ ∈ R. One easily checks that T θ and M λ are unitary operators on L 2 (R).By conjugation with the short-time Fourier transform, they naturally define the unitary operators on L 2 (R 2 ) given by T θ S(τ, ω) = e −2πiωθ S(τ − θ, ω), M λ S(τ, ω) = S(τ, ω − λ).
The relevance of the Heisenberg group in time-frequency analysis is a direct consequence of the commutation relation T θ M λ T −1 θ M −1 λ = e −2πiλθ Id .Indeed, this shows that the operator algebra generated by (T θ ) θ∈R and (M λ ) λ∈R coincides with the Heisenberg group H 1 via the representation U : H 1 → U(L 2 (R 2 )) defined by Since L 2 (R 3 ) = L 2 (R 2 ) ⊗ L 2 (R), the above representation trivially yields the representation U ⊗ Id on U(L 2 (R 3 )) that we still denote, abusing notation, by U .Recall now that the lift operation defined in Section 2 associates with S ∈ L 2 (R 2 ) a distribution of the form I(τ, ω, ν) = S(ω, ν)δ Σ (τ, ω, ν) for some Σ ⊂ R 3 .Due to the fact that Σ is defined via the modulus of S it ignores the phase factors appearing in the representation U .Thus, U naturally extends to this class of inputs as One observes that the above coincides with the result of the lift operation applied to U (θ, λ, ζ)S.That is, the lift operation commutes with the H 1 symmetry underlying time-frequency analysis.
The same goes for the evolution (WC).The commutation is trivial for the linear terms.On the other hand, the non-linearity introduced in the integral term commutes with U thanks to the fact that σ(ρe iφ ) = e iφ σ(ρ) for all ρ > 0, φ ∈ R.

Figure 4 :
Figure 4: Experiment on a synthetic sound with two linearly varying frequencies.Left: The STFT of the original sound.Right: The STFT of the processed sound with parameters α = 16, β = 1, γ = 18, τ = 0.0625s, and b = 4.In both cases only the positive frequencies are shown: the others are recovered via the Hermitian symmetry of the Fourier transform on real signals.

Figure 5 :
Figure 5: Sound waves corresponding to the STFT of Figure 4.The discontinuity at the end of the resulting sound (right) is due to the cut-off of the window function used for the STFT at the boundary.