# Application of Efficient Matrix Inversion to the Decomposition of Hierarchical Matrices

On this page, you can find my Bachelor's Thesis, supplementary material and errata.

Keywords: hierarchical matrices, linear systems, matrix inversion, nested dissection, numerical math, optimization, parallel computing.

# Summary

(Copied from § 1 of the written thesis.)

Solving linear systems – that is, given Mn × n and b(1), …, b(k)n for n, k, finding x(1), …, x(k)n such that M x(l) = b(l) for l ∈ {1, …, k} – is one of the fundamental tasks of numerical mathematics.

We are studying systems where the coefficient matrix M is symmetric and positive definite and has a special hierarchical structure that is obtained by performing a procedure known as nested dissection that yields a structure like the following where the blank regions are known to be zero and the shaded blocks may contain anything. Such hierarchical matrices (H-matrices) are stored as a tree of sub-matrices with ordinary matrix representations at the leafs. Many operations for such matrices can be formulated elegantly using recursion.

## Previous Work

Maurer and Wieners have adapted block LU decomposition for such matrices and presented an implementation closely related to finite-element computations for distributed memory systems.

Meanwhile, Sanders, Speck and Steffen have presented an algorithm that combines Newton and Strassen inversion to obtain a work-efficient matrix inversion algorithm (NeSt) for symmetric and positive definite matrices with poly-logarithmic time complexity.

## Our Contribution

We have combined block LU decomposition of H-matrices with nested dissection structure – which is block LLT decomposition for symmetric and positive definite matrices – with matrix inversion to obtain a new scheme that we call block LDLT decomposition. It computes a decomposition into a lower triangular block matrix and a block diagonal matrix where the latter is inverted.

Using NeSt for matrix inversion, this algorithm's work is dominated almost exclusively by computing matrix-matrix products for which the most highly optimized libraries are available off-the-shelf. Furthermore, it is important that matrix-matrix multiplication has a logarithmic critical path length if computed in parallel while Cholesky factorization and substitution which are needed for the block LLT version both have a linear critical path. Therefore, our variant is supposed to take out more of the hardware and scale better to highly parallel systems. On the down side, the LDLT variant has to perform more work for the decomposition.

Once the decomposition is computed, both algorithms can solve systems for the various right-hand sides with equal work required. However, the LDLT variant has again a shorter critical path and only needs matrix-vector products, which are again readily available in highly optimized versions.

We have implemented block LLT and LDLT decomposition for shared-memory systems using the C++ programming language. The source code is available as free software.

Tests show that for a machine with 32 CPUs and 500 GiB memory, the achieved ratio of effective floating point instructions per unit time is better roughly by a factor of two which sometimes compensates for the additional work that is to be done so with regard to overall execution time, there are problems where either of the algorithms outperforms the other. However, the advantage of block LDLT is not significant and for smaller systems, the LLT version is much faster.

On the other hand, we could show a significant improvement on the execution time when it comes to solving individual linear systems, once a decomposition is computed.

Other than one might expect, the additional errors introduced by computing the inverse matrix via an iterative procedure – as opposed to a direct Cholesky factorization – are significant but very moderate (≈ 10 %), at least for the well-conditioned systems we have investigated.

File Signature Description
`thesis_desktop.pdf` `thesis_desktop.pdf.sig` Written thesis with colorized embedded hyperlinks. Good for reading on-screen.
`thesis_print.pdf` `thesis_print.pdf.sig` Written thesis without colorized embedded hyperlinks. The graphics still use color. Good for printing.
`raw_data.sqlite` `raw_data.sqlite.sig` Raw experimental data in an SQLite database file (`schema.sql`).
`report-1.0.tar.gz` `report-1.0.tar.gz.sig` Source code for the report (includes experimental data).
`code-1.0.tar.gz` `code-1.0.tar.gz.sig` Source code for the implementation of the algorithms.
`blas-benchmarks-1.0.tar.gz` `blas-benchmarks-1.0.tar.gz.sig` Source code for the examples and benchmarks we have written for comparing some linear algebra libraries.

# Errata

Here I am listing errata that were found in the printed thesis. Included are only errors that might affect understanding, not just spelling and grammar mistakes (There are just too many of them!). Also not listed are insights that things might have better be done differently.

• The variables k120 and k127 in figure 5.6 should have be named d120 and d127 to match the names used in the surrounding text.
• In the first paragraph of § 1.1, it should be distributed, not shared memory systems.
• In figure 4.7, the type of M−1 in the `InversePreconditioner` class is wrong. It should be `FullMatrix`, not `ArrangedMatrix`.