Greedy low-rank algorithm for spatial connectome regression

Recovering brain connectivity from tract tracing data is an important computational problem in the neurosciences. Mesoscopic connectome reconstruction was previously formulated as a structured matrix regression problem (Harris et al. in Neural Information Processing Systems, 2016), but existing techniques do not scale to the whole-brain setting. The corresponding matrix equation is challenging to solve due to large scale, ill-conditioning, and a general form that lacks a convergent splitting. We propose a greedy low-rank algorithm for the connectome reconstruction problem in very high dimensions. The algorithm approximates the solution by a sequence of rank-one updates which exploit the sparse and positive definite problem structure. This algorithm was described previously (Kressner and Sirković in Numer Lin Alg Appl 22(3):564–583, 2015) but never implemented for this connectome problem, leading to a number of challenges. We have had to design judicious stopping criteria and employ efficient solvers for the three main sub-problems of the algorithm, including an efficient GPU implementation that alleviates the main bottleneck for large datasets. The performance of the method is evaluated on three examples: an artificial “toy” dataset and two whole-cortex instances using data from the Allen Mouse Brain Connectivity Atlas. We find that the method is significantly faster than previous methods and that moderate ranks offer a good approximation. This speedup allows for the estimation of increasingly large-scale connectomes across taxa as these data become available from tracing experiments. The data and code are available online.


Introduction
Neuroscience and machine learning are now enjoying a shared moment of intense interest and exciting progress.Many computational neuroscientists find themselves inspired by unprecedented datasets to develop innovative methods of analysis.Exciting examples of such next-generation experimental methodology and datasets are large-scale recordings and precise manipulations of brain activity, genetic atlases, and neuronal network tracing efforts.Thus, techniques which summarize many experiments into an estimate of the overall brain network are increasingly important.Many believe that uncovering such network structures will help us unlock the principles underlying neural computation and brain disorders [17].Initial versions of such connectomes [26] are already being integrated into large-scale modeling projects [41].We present a method which allows us to perform these reconstructions faster, for larger datasets.
Structural connectivity refers to the synaptic connections formed between axons (outputs) and dendrites (inputs) of neurons, which allow them to communicate chemically and electrically.We represent such networks as a weighted, directed graph encoded by a nonnegative adjacency matrix W .The network of whole-brain connections or connectome is currently studied at a number of scales [46,25]: Microscopic connectivity catalogues individual neuron connections but currently is restricted to small volumes due to difficult tracing of convoluted geometries [24].Macroscopic connectivity refers to connections between larger brain regions and is currently known for a number of model organisms [7].Mesoscopic connectivity [33] lies between these two extremes and captures projection patterns of groups of hundreds to thousands of neurons among the 10 6 -10 10 neurons in a typical mammalian brain.
Building on previous work [20,26], we present a scalable method to infer spatiallyresolved mesoscopic connectome from tracing data.We apply our method to data from the Allen Mouse Brain Connectivity Atlas [35] to reconstruct mouse cortical connectivity.This resource is one of the most comprehensive publicly available datasets, but similar data are being collected for fly [23], rat [6], and marmoset [31], among others.Our focus is presenting and profiling an improved algorithm for connectome inference.By developing scalable methods as in this work, we hope to enable the reconstruction of high-resolution connectomes in these diverse organisms.

Mathematical formulation of a spatial connectome regression problem
We focus on the mesoscale because it is naturally captured by viral tracing experiments (Figure 1).In these experiments, a virus is injected into a specific location in the brain, where it loads the cells with proteins that can then be imaged, tracing out the projections of those neurons with cell bodies located in the injection site.The source and target signals, within and outside of the injection sites, are measured as the fraction of fluorescing pixels within cubic voxels.These form the data matrices X ∈ R n X ×n inj and Y ∈ R n Y ×n inj , where parameters n X and n Y are the number of locations in the discretized source and target regions of the d-D brain, and n inj is the number of injections.In general, n X and n Y may be unequal, e.g. if injections were only delivered to the right hemisphere of the brain.Each experiment only traces out the projections from that particular injection site.By performing many such experiments, with multiple mice, and varying the injection sites to cover the brain, one can then "stitch" together a mesoscopic connectome for the average mouse.We refer the interested reader to [35] for more details of the experimental procedures.
We present a new low-rank approach to solving the smoothness-regularized optimization problem posed by [20].Specifically, they considered solving the regularized least squares Figure 1: In this paper, we present an improved method for the mesoscale connectome inference problem.(A) The goal is to find a voxel-by-voxel matrix W so that the pattern of neural projections y arising from an injection x is reproduced by matrix-vector multiplication, y ≈ W x. The vectors x and y contain the fraction of fluorescing pixels in each voxel from viral tracing experiments.(B) An example of the data, in this case a coronal slice from a tracing experiment delivered to primary motor cortex (MOp).Bright green areas are neural cells expressing the green fluorescent protein.(C) The raw data are preprocessed to separate the injection site (red/orange) from its projections (green).Fluorescence values in the injection site enter into the source vector x, whereas fluorescence everywhere else is stored in the entries of the target vector y.The x and y are discretized volume images of the brain reshaped into vector form.Entry W ij models the expected fluorescence at location i from one unit of source fluorescence at location j, a linear operator mapping from source images to target images.Image credit (B and C): Allen Institute for Brain Science. problem where the minimum is taken over nonnegative matrices.The operator P Ω defines an entrywise product (Hadamard product) P Ω (M ) = M • Ω, for any matrix M ∈ R n Y ×n inj , and Ω is a binary matrix, of the same size, which masks out the injection sites where the entries of Y are unknown. 1We take the smoothing matrices L y ∈ R n Y ×n Y and L x ∈ R n X ×n X to be discrete Laplace operator, i.e. the graph Laplacians of the voxel face adjacency graphs for discretized source and target regions.We choose a regularization parameter λ and set λ = λn inj n X to avoid dependence on n X , n Y and n inj , since the loss term is a sum over n Y × n inj entries and the regularization sums over n Y × n X many entries.
We now comment on the typical parameters for problem (1).The mouse brain gridded at 100 µm resolution contains approximately n X , n Y ∈ O(10 5 ) voxels in 3-D.On the other hand, the number of experiments n inj is less than O(103 ).By projecting the 3-D cortical data into 2-D, as we do in this paper, we can reduce the size by an order of magnitude to n X , n Y ∈ O(10 4 ), but focusing on the cortex reduces n inj to O(102 ).Since n inj n X , n Y , least squares estimation of W (i.e.λ = 0) is highly underdetermined and will remain underdetermined unless orders of magnitude more tracing experiments are performed.Regularization is thus essential for filling the gaps in injection coverage.Furthermore, the vast size of the n Y × n X matrix W for whole-brain connectivities has motivated our search for scalable and fast low-rank methods.

Previous methods of mesoscale connectome regression
Much of the work on mesoscale mouse connectomes leverages the data and processing pipelines of the Allen Mouse Brain Connectivity Atlas available at http://connectivity. brain-map.org[29,35].In the early examples of such work, [35] used viral tracing data to construct regional connectivity matrices.Nonnegative matrix regression was used to estimate the regional connectivity.First, the injection data were processed into a pair of matrices X Reg and Y Reg containing the regionalized injection volumes and projection volumes, respectively.The rows of these matrices are the regions and the columns index injection experiments.[35] then used nonnegative least squares to fit a region-by-region matrix W Reg such that Y Reg ≈ W Reg X Reg .Due to numerical ill-conditioning and a lack of data, some regions were excluded from the analysis.Similar techniques have been used to estimate regional connectomes in other animals.[49] took a different approach, using a likelihood-based Markov chain Monte Carlo method to infer regional connectivity and weight uncertainty from the Allen data.
[20] made a conceptual and methodological leap when they presented a method to use such data for spatially-explicit mesoscopic connectivity.The Allen Mouse Brain Atlas is essentially a coordinate mapping which discretizes the average mouse brain into cubic voxels, where each voxel is assigned to a given region in a hierarchy of brain regions.They used an assumption of spatial smoothness to formulate (1), where the specific smoothing term results in a high-dimensional thin plate spline fit [48].They then solved (1) using the generic quasi-Newton algorithm L-BFGS-B [8].This technique was applied to the mouse visual areas but limited to small datasets since W was dense.Using a simple low-rank version based on projected gradient descent, [20] argued that such a method could scale to larger brain areas.However, the initial low-rank implementation turned out to be too slow to converge for large-scale applications.Times to convergence were not reported in the original paper, but the full rank version typically took around a day, while the low-rank version needed multiple days to reach a minimum. 2  [26] simplified the mathematical problem by assuming that the injections were delivered to just a single voxel at the injection center of mass.Using a kernel smoother led to a method which is explicitly low-rank, with smoothing performed only in the injection space (columns of W ).This kernel method was applied to the whole mouse brain, yielding the first estimate of voxel-voxel whole-brain connectivity for this animal.However, these assumptions do not hold in reality: The injections affect a volume of the brain that encompasses much more than the center of mass. 3 We also expect that the connectivity is also smooth across projection space (rows of W ), since the incoming projections to a voxel are strongly correlated with those of nearby voxels.These inaccuracies mean that the kernel method is prone to artifacts, in particular ones arising from the injection site locations, since there is no ability for that method to translate the source of projections smoothly away from injection sites.It is thus imperative to develop an efficient method for the spline problem that works for large datasets.

Continuous formulation motivates the need for sophisticated solvers
We will now describe, for the first time, the continuous mathematical properties of this problem, in order to illuminate why it is challenging to solve.Equation ( 1) can be seen as a discrete version of an underlying continuous problem (similar to [42], among others) where we define the cost as The cost is minimized over W : T × S → R, the continuous connectome, in an appropriate Sobolev space (square-integrable derivatives up to fourth order on T × S is sufficiently regular).The function W may be seen as the kernel of an integral operator from S to T .These regions S and T are both compact subsets of R d representing source and target regions of the brain.The mask region Ω i ⊂ T is the subset of the brain excluding the injection site.Finally, the discrete Laplacian terms L have been replaced by the continuous Laplacian operator ∆ on S × T .The parameter λ again sets the level of smoothing. 4 For simplicity, consider S = T = the whole brain, Ω i = T for all i = 1, . . ., n inj and relax the constraint of nonnegativity on W. Taking the first variational derivative of (2) and setting it to zero yields the Euler-Lagrange equations for this simplified problem: where for convenience we have defined the data covariance functions f (x , x) = , analogous to XX T and Y X T .The operator ∆ 2 is the biharmonic operator or bi-Laplacian.Equation ( 3) is a fourth-order partial integro-differential equation in 2d dimensions.
Iterative solutions via gradient descent or quasi-Newton methods to biharmonic and similar equations can be slow to converge [1].It takes many iterations to propagate the highly local action of the biharmonic differential operator across global spatial scales due to the small stable step size [42], whereas the integral part is inherently nonlocal.Very slow convergence is what we have found when applying methods like gradient descent to 4 One may consider rescaling λ as before, but subtle differences arise.In the continuous versus discrete cases the units of the equations are different, since the functions X i (x) and Y i (y) are now viewed as densities.Furthermore, there is a mismatch in units between (1) and (2), because the graph Laplacian is unitless whereas the Laplace operator is not.This explains the lack of any dependence on the grid size in the scaling of the discrete problem.Regardless, choosing the exact scaling to make the continuous and discrete cases match is not necessary for the more qualitative argument we are making.
problem (1), also for low-rank versions.This includes quasi-Newton methods such as L-BFGS [8].When we attempted to solve the whole-cortex top view and flatmap problems as in Sections 3.2 and 3.3, the method had not converged (from a naive initialization) after a week of computation.These difficulties motivated the development of the method we present here.

Outline of the paper
We present a greedy, low-rank algorithm tailored to the connectome inference problem.To leverage powerful linear methods, we consider solutions to the unconstrained problem where all of the matrices and parameters are as in (1).In practice, solutions to the linear problem ( 4) are often very close to those of the nonnegative problem (1), since the data matrices X and Y and the "true" W are nonnegative.Setting any negative entries in the computed solution W * to zero is adequate, or can serve as an initial guess to an iterative solver for the slower nonnegative problem.Equation ( 4) is another regularized least-squares problem.In Section 2.1, we show that taking the gradient and setting it equal to zero leads to a linear matrix equation in the unknown W .This takes the form of a generalized Sylvester equation with coefficient matrices formed from the data and Laplacian terms.The data matrices are, in fact, of low-rank since n inj n X , n Y , and thus we can expect a low-rank approximation W ≈ U V to the full solution to perform well (see [20], although we do not know how to justify this rigorously).We provide a brief survey of some low-rank methods for linear matrix equations in Section 2.2.We employ a greedy solver that finds rank-one components u i v i one at a time, explained in Section 2.3.After a new component is found, it is orthogonalized and a Galerkin refinement step is applied.This leads to Algorithm 1, our complete method.
We then test the method on a few connectome fitting problems.First, in Section 3.1, we test on a fake "toy" connectome, where we know the truth.This is a test problem consisting of a 1-D brain with smooth connectivity [20].We find that the output of our algorithm converges to the true solution as the rank increases and as the stopping tolerance decreases.Next, we present two benchmarks using real viral tracing data from the isocortices of mice, provided by the Allen Institute for Brain Science.In each case, we work with 2-D data in order to limit the problem size and because the cortex is a relatively flat, 2-D shape.It has also been argued that such a projection also denoises such data [47,18].In Section 3.2, we work with data that are averaged directly over the superior-inferior axis to obtain a flattened cortex.We refer to this as the top view projection.In contrast, for Section 3.3, the data are flattened by averaging along curved streamlines of cortical depth.We call this the flatmap projection.
Finally, Section 4 discusses the limitations of our method and directions for future research.Our data and code are described in Section 5 and freely available for anyone who would like to reproduce the results.
2 Greedy low-rank method

Linear matrix equation for the unknown connectivity
We now derive the equivalent of the "normal equations" for our problem.Denote the objective function (4) as J(W ), with decomposition Writing J loss indexwise, we obtain (note that Ω • Ω = Ω) .
The derivative reads or in the vector form where X α is the α-th column of X and likewise for Ω.Setting the derivative equal to zeros leads to the system of normal equations where vec(W ) is the vector of all columns of W stacked on top of each other.This linear system features the following (n Y n X ) × (n Y n X ) matrix, consisting of n inj + 3 Kronecker products, Note that without the observation mask, Ω is a matrix of all ones, and the first term compresses to XX ⊗ I n Y .The linear system (5) can be recast as the linear matrix equation with the operator A(W ) := λB(W ) + C(W ), where The smoothing term B can be expressed as a squared standard Sylvester operator B(W ) = L(L(W )), where L(W ) := L y W + W L x .The operator L is the graph Laplacian operator on the discretization of T × S. Furthermore, the right hand side D is a matrix of rank n inj , since it is an outer product of two rank n inj matrices.

Numerical low-rank methods for linear matrix equations
Because of the potentially high dimensions n X , n Y , directly solving the algebraic matrix equation ( 7) is numerically inefficient since the solution will be a dense n Y × n X matrix, making even storing it infeasible.However, the rank of the right hand side of ( 7) is at most n inj n X , n Y .It is often observed and theoretically shown [16,3,22] that the solutions of large matrix equations with low-rank right hand sides exhibit rapidly decaying singular values.Hence, the solution W is expected to have small numerical rank in the sense that few of its singular values are larger than machine precision or the experimental noise floor.Intuitively, since we also seek very smooth solutions, this also helps control the rank, since high frequency components tend to be associated with small singular values.This motivates us to approximate the solution of ( 7) by a low-rank approximation The low-rank factors are then typically computed by iterative methods which never form the approximation U V explicitly.
Several low-rank methods for computing U, V have been proposed, starting from methods for standard Sylvester equations AX + XB = D, e.g., [2,4,5,44] and more recently for general linear matrix equations like (7) [13,3,43,14,22,40].However, these methods are specialized and require the problem to have particular structures or properties (e.g., B, C have to form a convergent splitting of A), which are not present in the problem at hand.The main structures present in (7) are positive definiteness and sparsity of L x , L y .
An approach that is applicable to the matrix equation ( 7) is a greedy method as proposed by [27], which is based on successive rank-1 approximations of the error.Because this method is quite general, we tailored specific components of the algorithm to our problem.Three main challenges were overcome: First, we choose a simpler stopping criterion for the ALS routine.Second, specific solvers were chosen for the three main sub-problems of the algorithm, which maximizes its efficiency.Third, we developed a GPU implementation of the Galerkin refinement, to make this bottleneck step more efficient.We advocate for this method in the rest of the paper.

Description and application of the greedy low-rank solver
Here we briefly review the algorithm from [27] and explain how it is specialized for our particular problem.Assume there is already an approximate solution W j ≈ W * of the linear matrix equation A(W ) = D, equation (7), with solution W * .We will improve our solution by an update of rank one: The update vectors u j+1 , v j+1 are computed by minimizing an error functional that we will soon define.Since the operator A is positive definite, it induces the A-inner product X, Y A = Tr(Y A(X)) and the A-norm Y A := Y, Y A .So we find u j+1 , v j+1 by minimizing the squared error in the A-norm: Discarding constant terms, noting that X, Y A = Y, X A , and setting Notice that the rank-1 decomposition uv is not unique, because we can rescale the factors by any nonzero scalar c such that (uc)(v/c) represents the same matrix.This reflects the fact that the optimization problem ( 8) is not convex.However, it is convex in each of the factors u and v separately.We obtain the updates u j+1 , v j+1 via an alternating linear (ALS) scheme [36].Although we only consider low-rank approximations of matrices here, ALS methods are also used for computing low-rank approximations of higher order tensors by means of polyadic decompositions, e.g., [21,45].First, a fixed v is used in (8) and a minimizing u is computed which is in the next stage kept fixed and ( 8) is solved for a minimizing v.For a fixed vector v with v = 1 the minimizing problem is and, hence, û is obtained by solving the linear system of equations The second half iteration starts from the fixed u = û/ û and tries to find a minimizing v by solving which can be derived by similar steps.The linear systems (9a) and (9b) inherit the sparsity from L x , L y and Ω. Therefore they can be solved by sparse direct or iterative methods.We use a sparse direct solver for (9a), as this was faster than the alternatives.The coefficient matrix B in (9b) is the sum of a sparse (Laplacian terms) and a low-rank (rank n inj data terms) matrices.In this case, we solve (9b) using the Sherman-Morrison-Woodbury formula [15] and a direct solver for the sparse inversion.Both half steps form the ALS iteration which should be stopped when the iterates are close enough to a critical point, which might be difficult to check.Here we propose a simpler approach compared to the one in [27].Since we rescale u and v such that u 2 = v 2 = 1, the norm of the other factor is equal to the norm of the full matrix.In other words, û 2 = ûv 2 after solving for û, and hence û 2 should converge to the norm of the exact solution.This motivates a simple criterion: we stop the ALS when (1 where û and v are taken from two consecutive ALS steps, and δ < 1 is a small threshold.It turns out that a relatively crude tolerance of δ = 0.1, corresponding to 2-4 ALS iterations, is sufficient in practice for the overall convergence of the algorithm.
The second stage of the method is a non-greedy Galerkin refinement of the low-rank factors.Suppose a rank j approximation W j = j i=1 u i v i of W has been already computed.Let U ∈ R n Y ×j and V ∈ R n X ×j have orthonormal columns, spanning the spaces span{u 1 , . . ., u j } and span{v 1 , . . ., v j }, respectively.We compute a refined approximation U ZV for Z ∈ R j×j by imposing the following condition onto the residual: Algorithm 1: Greedy rank-1 method with Galerkin projection for ( 7) Pick initial vector v for ALS with v = 1, then get rank-1 update: Solve Eqn.(10) for Z j (CG to tolerance τ /2) Galerkin update 11 This leads to the dense, square matrix equation in Equation ( 10) is a projected version of (7) and inherits its structure including the positive definiteness of the operator which acts on Z. Instead of using a direct method to solve (10) as in [27], we employ an iterative method similar to [40].Due to the positive definiteness, the obvious method of choice is a dense, matrix-valued conjugate gradient method (CG).Moreover, we reduce the number of iterations significantly by taking the solution Z from the previous greedy step as an initial guess.The improved solution W j+1 = U ZV yields a new residual R j+1 = D − A(W j+1 ) onto which the ALS scheme is applied to obtain the next rank-1 updates.The complete procedure is illustrated in Algorithm 1.
This Galerkin refinement substantially improves the greedy approximation, leading to a faster convergence rate [27].The ALS stage is primarily used to sketch the projection bases for the Galerkin solution, which justifies the limited number of ALS steps.Use of the Galerkin refinement in low-rank decomposition literature can be traced back to the greedy approximation in the CP tensor format [34], as well as orthogonal matching pursuit approaches in sparse recovery and compressed sensing [38] and deflation strategies in lowrank matrix completion [19].

Performance of the greedy low-rank solver on three problems
There are three test problems to which we apply Algorithm 1: a toy problem with synthetic data (Section 3.1), the top view projected mouse connectivity data (Section 3.2), and the  flatmap projected data (Section 3.3).These tests show that the method easily scales to whole-brain connectome reconstruction.We investigate the computational complexity and convergence of the greedy algorithm.Since the matrices in (9a) are sparse, the ALS steps need O(nr 2 n inj ) operations in total for the final solution rank r, where n = max(n X , n Y ).In turn, if the solution of (10) takes γ CG iterations, this step will have a cost of O(γr 3 n inj ).Although γ can be kept at the same level for all j, it depends on the stopping tolerance τ , as does the rank r.We will therefore investigate the cost in terms of the total computation time and the corresponding solution accuracy for a range of solution rank values.
The numerical experiments were performed on an Intel ® E5-2650 v2 CPU with 8 threads and 64Gb RAM.We employ an Nvidia ® P100 GPU card for some subtasks: The Galerkin update relies on dense linear algebra to solve (10) by the CG method, so this stage admits an efficient GPU implementation.Algorithm 1 is implemented in MATLAB ® R2017b, and was run on the Balena High Performance Computing Service at the University of Bath.See Section 5 for additional data and code resources.We measure errors in the solution using the root mean squared error.Given any reference solution W of size n Y ×n X , e.g. the truth or a large-rank solution when the truth is unknown, and a low-rank solution W r , the RMS error is computed as We also report the relative error in the Frobenius norm E rel (W r , W ) =

Test problem: a toy brain
We use the same test problem as in [20], a one-dimensional "toy brain."The source and target space are S = T = [0, 1].The true connectivity kernel corresponds to a Gaussian profile about the diagonal plus an off-diagonal bump: The input and output spaces were discretized using n X = n Y = 200 uniformly lattice points.Injections are delivered at n inj = 5 locations in S, with a width of 0.12 + 0.1 , where ∼ Uniform(0, 1).The values of X are set to 1 within the injection region and 0 elsewhere, Ω ij = 1 − X ij , Y is set to 0 within the injection region, and we add Gaussian noise with standard deviation σ = 0.1.The matrices L x = L y are the 3-point graph Laplacians for the 1-d chain.
We depict the true toy connectivity W true as well as a number of low-rank solutions output by our method in Figure 2.Both the mask and regularization are required for good performance: If we remove the mask, setting Ω equal to the matrix of all ones, then there are holes in the data at the location of the injections.If we try fitting with λ = 0, i.e. no smoothing, then the method cannot fill in holes or extrapolate outside the injection sites.It is only with the combination of all ingredients that we recover the true connectivity.
In Table 1 we show the performance of the algorithm for ranks r = 10, 20, 40, 60, and 80.The output W is compared to W true as well as the rank 140 output of the algorithm.The stopping tolerance was τ = 10 −7 to ensure that the algorithm has reached this maximal rank.We see that the RMS distance to the reference solution W 140 decreases as we increase the rank, as does the relative distance.However, the RMS and relative distances from W true asymptote to roughly 0.07 and 10%, respectively, by rank 40.This shows that rank 40 is a suitable maximum rank for this problem given the input data and noise.
The computing time of the greedy method (in this example we use the CPU only version) remains in the order of seconds even for the largest considered ranks.In contrast, the unpreconditioned CG method needs thousands of iterations (and hundreds of second of time) to compute a solution within the same order of accuracy.Since it is unclear how to develop a preconditioner for Eq. ( 5), especially for a non-trivial Ω, in the next sections we focus only on the greedy algorithm.

Mouse cortex: top view connectivity
We next benchmark Algorithm 1 on mouse cortical data projected into a top-down view.See Section 5 for details about how we obtained these data.Here, the problem sizes are We run the low-rank solver with the target solution rank varying from r = 125 to 1000.The stopping tolerances τ were decreased geometrically from 10 −3 for r = 125 to 10 −6 for r = 1000.This delivers accurate but cheap solution to the Galerkin system (10) while ensuring that the algorithm reached the target rank.
These low-rank solutions are compared to the full-rank solution W full with r = n X = 22 377 found by L-BFGS [8], similar to [20], which used L-BFGS-B to deal with the nonnegativity constraint.Note that this full rank algorithm was initialized from the output of the low-rank algorithm.This led to a significant speedup: The full rank method, initialized naively, had not reached a similar value of the cost function (4) after a week of computation, but this "warm start" allowed it to finish within hours.
The computing times and errors are presented in Figure 3.We see that the RMS errors are relatively small for ranks above 500, below 10 −6 .Neither the RMS or relative error seem to have plateaued at rank 1000, but they are small.At rank 1000, the vector ∞ error (maximum absolute deviation of the matrices as vectors, not the matrix ∞-norm) W 1000 − W full ∞ is less than 10 −6 , which is certainly within experimental uncertainty.In Figure 5, the value of the cost function J(W r ) is plotted against the rank r of the approximation W r for the top view and flatmap data.Apparently, around r = 500 the cost function begins to stagnate indicating that the approximation quality does not significantly improve any more.Hence, we continue the investigation with the numerical rank set to r = 500.
We analyze the leading singular vectors of the solution.The output of the algorithm is W r = U ZV , which is not the SVD of W r because Z is not diagonal.We perform a final SVD of the Galerkin matrix, Z = Ũ Σ Ṽ and set Û = U Ũ and V = V Ṽ , so that W r = Û Σ V is the SVD of the solution.
The first four of these singular vectors are shown in Figure 6.The brain is oriented with the medial-lateral axis aligned left-right and anterior-posterior axis aligned top-bottom, as in a transverse slice.The midline of the cortex is in the center of the target plots, whereas it is on the left edge of the source plots.We observe that the leading component is a strong projection from medial areas of the cortex near the midline to nearby locations.The second component provides a correction which adds local connectivity among posterior areas  and anterior areas.Note that increased anterior connectivity arises from negative entries in both source and target vectors.The sign change along the roughly anterior-posterior axis manifests as a reduction in connectivity from anterior to posterior regions as well as from posterior to anterior regions.The third component is a strong local connectivity among somatomotor areas located medially along the anterior-posterior axis and stronger on the lateral side where the barrel fields, important sensory areas for whisking, are located.Finally, the fourth component is concentrated in posterior locations, mostly corresponding to the visual areas, as well as more anterior and medial locations in the retrosplenial cortex (thought to be a memory and association area).The visual and retrosplenial parts of the component show opposite signs, reflecting stronger local connectivity within these regions than distal connectivity between them.
These patterns in Figure 6 are reasonable, since connectivity in the brain is dominantly local with some specific long-range projections.We also observe that the projection patterns  (left components Û Σ) are fairly symmetric across the midline.This is also expected due to the mirroring of major brain areas in both hemispheres, despite the evidence for some lateralization, especially in humans.The more specific projections between brain regions will show up in later, higher frequency components.However, it becomes increasingly difficult to interpret lower energy components as specific pathways, since these combine in complicated ways.

Mouse cortex: flatmap connectivity
Finally, we test the method on another problem which is a flatmap projection of the brain (see Section 5 for details).This projection more faithfully represents areas of the cortex which are missing from the top view since they curl underneath that vantage point.The flatmap is closer to the kind of transformation used by cartographers to flatten the globe, whereas the top view is like a satellite image taken far from the globe.
The problem size is now larger by roughly a factor of three relative to the top view.Here, n Y = 126 847 and n X = 63 435.The number of experiments is the same, n inj = 126, whereas the regularization parameter is set to λ = 3 × 10 7 .The smoothing parameter was set to give the same level of smoothness, measured "by eye," in the components as in the top view experiment.The tolerances τ were as in the top view case.
In this case, the computing time of the full solver would be excessively large, so we do not estimate the error by comparison to the full solution, instead taking the solution with r = 1000 as the reference solution W = W 1000 .The computing times and the errors are shown in Figure 4. Here, the benefits by using the GPU implementation for solving (10) were more significant than for the top view case.We obtained the rank 500 solution in approximately 1.5 hours, which is significantly less than pure CPU implementation, which took 6.4 hours.Comparing Figs. 3 and 4, the computation times for the flatmap problem with r = 500 and 1000 are roughly twice as large as for the top view problem.On the other hand, for r = 125 and 250, the compute times are about three times as long for flatmap versus top view.The observed scaling in compute time appears slightly slower than O(n) in these tests.The growth rate of the computing time on the GPU is better than that of the CPU version since the matrix multiplications, which dominate the CPU cost for large ranks, are calculated in nearly constant time, mainly due to communication overhead, on the GPU.The RMS error between rank 500 and 1000 is again less than 10 −6 , so we believe rank 500 is probably a very good approximation to the full solution.Figure 5 shows the costs versus the approximation rank.Again, we see that r = 500 is reasonable and the distance from W is smaller than 10%.
The four dominant singular vectors of the flatmap solution are shown in Figure 7, oriented as in Figure 6, with the anterior-posterior axis from top-bottom and the medial-lateral axis from left-right.The first two factors are directly comparable between the two problem outputs, although we see more structure in the flatmap components.This could be due to employing a projection which more accurately represents these 3-D data in 2-D, or due to the choice of smoothing parameter λ.The third and fourth components, on the other hand, are comparable to the fourth and third components in the top view problem, respectively.Again, these patterns are reasonable and expected.The raw 3-D data that were fed into the top view and flatmap projections were the same, but the greedy algorithm is run using different projected datasets.It is reassuring that we can interpret the first few factors and directly compare them against those in the top view.

Dropping the nonnegativity constraint does not strongly affect the solutions
In order to apply linear methods, we relaxed the nonnegativity constraint when formulating problem (4) (the unconstrained problem), as opposed to the original problem (1) (the problem with nonnegativity constraint).We now show that the resulting solutions are not significantly different between the two problems.This justifies the major simplification that we have made.In all of our experiments with the test problem (Section 3.1), the resulting matrices were nearly nonnegative.The solution W 40 has 48 out of 40 000 negative entries.These negative entries were all greater than -0.0023 in the lower-left corner of the matrix (see Figure 2), where the truth is approximately zero.
We were able to solve the top view problem with the nonnegative constraint using L-BFGS-B by initializing with W full projected onto the nonnegative orthant.Let W proj be the matrix with entries (W proj ) ij = max(0, (W full ) ij ), and let W nonneg denote the solution to the constrained problem obtained in this way.Comparing the nonnegative versus unconstrained solutions, we found that E(W full , W nonneg ) = 3.99e-04.Projecting W full onto the nonnegative orthant leads to E(W proj , W nonneg ) = 3.67e-04.In either case the ∞-norm difference is 0.009.These results show that the solution to the unconstrained problem is close to the solution of the constrained problem, and that the projection of the solution to the unconstrained problem is also close to the constrained solution.Algorithm 1 thus offers an efficient way to approximate the solution to the more difficult nonnegative problem, while retaining low rank.

Discussion
We have studied a numerical method specifically tailored for the important neuroscience problem of connectome regression from mesoscopic tract tracing experiments.This connectome inference problem was formulated as the regression problem (4).The optimality conditions for this problem turn out to be a linear matrix equation in the unknown connectivity W , which we propose to solve with Algorithm 1.Our numerical results show that the low-rank greedy algorithm, as proposed by [27], is a viable choice for acquiring low-rank factors of W with a computation cost that was significantly smaller compared to other approaches [20,3,28].This allows us to infer the flatmap matrix, with approximately 140× more entries than previously obtained for the visual system, while taking significantly less time: computing the flatmap solution took hours versus days for the smaller low-rank visual network [20].The first few singular vector components of these cortical connectivities are interpretable and reasonable from a neuroanatomy standpoint, although a full anatomical study of this inferred connectivity is outside the scope of the current paper.
The main ingredients of Algorithm 1 are solving the large, sparse linear systems of equations at each ALS iteration and solving the dense but small projected version of the original linear matrix equation for the Galerkin step.We had to carefully choose the solvers for each of these phases of the algorithm.The Galerkin step forms the principal bottleneck due to the absence of direct numerical methods to handle dense linear matrix equations of moderate size.We employed a matrix-valued CG iteration to approximately solve (10), implementing it on the GPU for speed.This lead to cubic complexity in r at this step.One could argue that equipping this CG iteration with a preconditioner could speed up its convergence, but so far we were not successful in finding a preconditioner that both reduced the number of CG steps and the computational time.A future research direction could be to derive an adequate preconditioning strategy for the problem structure in (7), that would increase the efficiency of any Krylov method.
Matrix-valued Krylov subspace methods [13,28,3,37] offer an alternative class of possible algorithms to solving the overall linear matrix equation (7).However, for rapid convergence of these methods we typically need a preconditioner.In our tests on (7), these approaches performed poorly, because rank truncations (e.g. via thin QR or SVD) are required after major subcalculations which occur at every iteration.Computing these decompositions quickly became expensive because of the sheer amount of necessary rank truncations in the Krylov method.If a suitable preconditioner for our problem would be found, it would make sense to give low-rank matrix-valued Krylov methods another try.
The original regression problem proposed by [20] (1) demands that the solution W be nonnegative.So far, this constraint is not considered by the employed algorithm.However, for the test problem and data we have tried, the computed matrix turns out to be majority nonnegative.We find typically small negative entries that can be safely neglected without sacrificing accuracy.Although a mostly nonnegative solution is not generally expected when solving the unconstrained problem (4), it appears that such behavior is typical for nonnegative data matrices X and Y .
Working directly with nonnegative factors U ≥ 0 and V ≥ 0 was originally proposed by [20], where they applied a projected gradient method to find such an approximation for connectome of mouse visual areas albeit very slowly.Such a formulation is preferred, since it leads to a nonnegative W , and it allows interpreting the leading factors as the most important neural pathways in the brain.Modifying Algorithm 1 to compute nonnegative low-rank factors or enforcing that the low-rank approximation U V ≈ W is nonnegative-a nonlinear constraint-is a much harder goal to achieve.For instance, even if one generated nonnegative factor matrices U and V , e.g. by changing the ALS step to nonnegative ALS, the orthogonalization and Galerkin update each destroy this nonnegativity.New methods of NMF which incorporate regularizations similar to our Laplacian terms [12,9] are an area of ongoing research, and the optimization techniques developed there could accelerate the nonnegative low-rank formulation of (1).These include other techniques developed with neuroscience in mind, such as neuron segmentation and calcium deconvolution [39] as well as sequence identification [30].The greedy method we have presented is an excellent way to initialize the nonnegative version of the problem, similar to how SVD is used to initialize NMF.We hope to improve upon nonnegative low-rank methods in the future.
Model (1) is certainly not the only approach to solving the connectome inference problem.The loss term P Ω (W X − Y ) 2  F is standard and arises from Gaussian noise assumptions combined with missing data and is standard loss in matrix completion problems with noisy observations, e.g., [32,10].The regularization term is a thin plate spline penalty [48].This is one of many possible choices for smoothing, among them penalties such as grad(W ) 2 or the total variation semi-norm [42,11], which favors piecewise-constant solutions.While we recognize that there are many possible choices for the regularizer, the thin plate penalty is reasonable, linear and thus convenient to work with.Previous work [20] has shown that it is useful for the connectome problem.Testing other forms of regularization is a worthy goal but not straightforward to implement at scale.This is outside the scope of the current paper.
Finally, the most exciting prospects for this class of algorithms is what can be learned when we apply them to next-generation tract tracing datasets.Such techniques can be used to resolve differences between the rat [6] brain and mouse [35], or uncover unknown topographies, see [41] in these and other animals (like the marmoset [31]).The mesoscale is also naturally the same resolution as obtained by wide-field calcium imaging.Spatial connectome modeling could elucidate the largely mysterious interactions different sensory modalities, proprioception, and motor areas, hopefully leading to better understanding of integrative functions.

Data and code
We tested our algorithm on two datasets (top view and flatmap) generated from Allen Institute for Brain Science Mouse Connectivity Atlas data http://connectivity.brain-map.org.These data were obtained with the Python SDK allensdk version 0.13.1 available from http://alleninstitute.github.io/AllenSDK/.Our data pulling and processing scripts are available from https://github.com/kharris/allen-voxel-network.
We used the allensdk to retrieve 10 µm injection and projection density volumetric data for 126 wildtype experiments in cortex.These data were projected from 3-D to 2-D using either the top view or flatmap paths and saved as 2-D arrays.Next, the projected coordinates were split into left and right hemispheres.Since wildtype injections were always delivered into the right hemisphere, this becomes our source space S whereas the union of left and right are the target space T .We constructed 2-D 5-point Laplacian matrices on these grids with "free" Neumann boundary conditions on the cortical edge.Finally, the 2-D projected data were downsampled 4 times along each dimension to obtain 40 µm resolution.The injection and projection data were then stacked into the matrices X and Y , respectively.The mask Ω was set via Ω ij = 1 {X ij ≤0.4} .MATLAB code which implements our greedy low-rank algorithm (1) is included in the repository: https://gitlab.mpi-magdeburg.mpg.de/kuerschner/lowrank_connectome.We also include the problem inputs X, Y, L x , L y , Ω for our three example problems (test, top view, and flatmap) as MATLAB files.Note that Ω is stored as 1 − Ω in these files, as this matches the convention of [20].

) 4 .Figure 2 :
Figure 2: Toy brain test problem.Top: true connectivity map W true (left) and the low-rank solution with rank = 40 and λ = 100 (right).Bottom: solutions with Ω = 1 (left) and λ = 0 (right).The locations of simulated injections are shown by the grey bars.This shows that both the mask (Ω) and smoothing (λ > 0) are necessary for good recovery.

Figure 3 : 2 rFigure 4 : 2 rFigure 5 :
Figure 3: Computing times and errors for the top view data.The errors are computed with reference to the full-rank solution W = W full .Full rank time: 6 × 10 5 s (see text).

Figure 6 :
Figure 6: Top four singular vectors of the top view connectivity with r = 500.Left: scaled target vectors Û Σ. Right: source vectors V .

Figure 7 :
Figure 7: Top four singular vectors of the flatmap connectivity with r = 500.Left: scaled target vectors Û Σ. Right: source vectors V .

Table 1 :
Computing times and errors for the toy brain test problem.The output W is compared to truth and a rank 140 reference solution.
rel (W, W true