Application of Efficient Matrix Inversion to the Decomposition of Hierarchical Matrices
On this page, you can find my Bachelor's Thesis, supplementary material and errata.
Keywords: hierarchical matrices, linear systems, matrix inversion, nested dissection, numerical math, optimization, parallel computing.
Contents
Summary
(Copied from § 1 of the written thesis.)
Solving linear systems – that is, given M ∈ ℝ^{n × n} and b^{(1)}, …, b^{(k)} ∈ ℝ^{n} for n, k ∈ ℕ, finding x^{(1)}, …, x^{(k)} ∈ ℝ^{n} such that M x^{(l)} = b^{(l)} for l ∈ {1, …, k} – is one of the fundamental tasks of numerical mathematics.
We are studying systems where the coefficient matrix M is symmetric and positive definite and has a special hierarchical structure that is obtained by performing a procedure known as nested dissection that yields a structure like the following
where the blank regions are known to be zero and the shaded blocks may contain anything. Such hierarchical matrices (Hmatrices) are stored as a tree of submatrices with ordinary matrix representations at the leafs. Many operations for such matrices can be formulated elegantly using recursion.
Previous Work
Maurer and Wieners have adapted block LU decomposition for such matrices and presented an implementation closely related to finiteelement computations for distributed memory systems.
Meanwhile, Sanders, Speck and Steffen have presented an algorithm that combines Newton and Strassen inversion to obtain a workefficient matrix inversion algorithm (NeSt) for symmetric and positive definite matrices with polylogarithmic time complexity.
Our Contribution
We have combined block LU decomposition of Hmatrices with nested dissection structure – which is block LL^{T} decomposition for symmetric and positive definite matrices – with matrix inversion to obtain a new scheme that we call block LDL^{T} decomposition. It computes a decomposition into a lower triangular block matrix and a block diagonal matrix where the latter is inverted.
Using NeSt for matrix inversion, this algorithm's work is dominated almost exclusively by computing matrixmatrix products for which the most highly optimized libraries are available offtheshelf. Furthermore, it is important that matrixmatrix multiplication has a logarithmic critical path length if computed in parallel while Cholesky factorization and substitution which are needed for the block LL^{T} version both have a linear critical path. Therefore, our variant is supposed to take out more of the hardware and scale better to highly parallel systems. On the down side, the LDL^{T} variant has to perform more work for the decomposition.
Once the decomposition is computed, both algorithms can solve systems for the various righthand sides with equal work required. However, the LDL^{T} variant has again a shorter critical path and only needs matrixvector products, which are again readily available in highly optimized versions.
We have implemented block LL^{T} and LDL^{T} decomposition for sharedmemory systems using the C++ programming language. The source code is available as free software.
Tests show that for a machine with 32 CPUs and 500 GiB memory, the achieved ratio of effective floating point instructions per unit time is better roughly by a factor of two which sometimes compensates for the additional work that is to be done so with regard to overall execution time, there are problems where either of the algorithms outperforms the other. However, the advantage of block LDL^{T} is not significant and for smaller systems, the LL^{T} version is much faster.
On the other hand, we could show a significant improvement on the execution time when it comes to solving individual linear systems, once a decomposition is computed.
Other than one might expect, the additional errors introduced by computing the inverse matrix via an iterative procedure – as opposed to a direct Cholesky factorization – are significant but very moderate (≈ 10 %), at least for the wellconditioned systems we have investigated.
Downloads
File  Signature  Description 

thesis_desktop.pdf 
thesis_desktop.pdf.sig 
Written thesis with colorized embedded hyperlinks. Good for reading onscreen. 
thesis_print.pdf 
thesis_print.pdf.sig 
Written thesis without colorized embedded hyperlinks. The graphics still use color. Good for printing. 
raw_data.sqlite 
raw_data.sqlite.sig 
Raw experimental data in an SQLite database file
(schema.sql ).

report1.0.tar.gz 
report1.0.tar.gz.sig 
Source code for the report (includes experimental data). 
code1.0.tar.gz 
code1.0.tar.gz.sig 
Source code for the implementation of the algorithms. 
blasbenchmarks1.0.tar.gz 
blasbenchmarks1.0.tar.gz.sig 
Source code for the examples and benchmarks we have written for comparing some linear algebra libraries. 
Errata
Here I am listing errata that were found in the printed thesis. Included are only errors that might affect understanding, not just spelling and grammar mistakes (There are just too many of them!). Also not listed are insights that things might have better be done differently.
 The variables k_{120} and k_{127} in figure 5.6 should have be named d_{120} and d_{127} to match the names used in the surrounding text.

In the first paragraph of § 1.1, it should be
distributed
, notshared
memory systems. 
In figure 4.7, the type
of M^{−1} in
the
InversePreconditioner
class is wrong. It should beFullMatrix
, notArrangedMatrix
.