NEW ALGORITHMS OF FAST DIAGONALIZATION OF REAL SYMMETRIC MATRICES.
Author: Yurii F. Sigolaev (St. Petersburg). E-mail: y_sigolaev@hotmail.com
Last revision: January 9, 2007
The performance of personal computers was dramatically improved during the last decade. Therefore, cardinal revision of the majority of algorithms becomes an urgent problem. In so doing, it is necessary to lift the contradiction between the 32-bit architecture (imposing significant limitations upon the volume of information being processed) and increased processor speed.
I have developed a new complex of algorithms of diagonalization of dense real symmetric matrices taking into account this factor. This complex (SDIAG) is realized in C and has a number of advantages compared to other known packages realizing diagonalization algorithms:
- noticeable increase in the computation speed;
- considerable saving of RAM. The difference (by a factor of 6 when finding all eigenvectors and by a factor of 8 when finding a part of eigenvectors) between all the known high-speed modern diagonalization methods and the algorithms I suggest is so large that my algorithms will apparently remain urgent with the 64-bit architecture also.
These results were attained thanks to the solution of a basic problem of numerical analysis of tridiagonal matrices, development of new algorithm for vector multiplication and improvement of the block algorithms for packed matrices. Some of the algorithms use the IEEE-arithmetic support and use SSE2 and SSE3 for efficiency. For a comparative analysis, I used Intel MKL package because of the common source of the code, LAPACK package.
Description of the new algorithms and the references are given on the page devoted to P4 processor: http://www.thesa-store.com/products/.
Please address all remarks and suggestions to: y_sigolaev@hotmail.com
Tel/Fax: 7(812)2691898
Comparison results.
I compared the results I obtained with those from the known Intel MKL package (Package ID: w_mkl_p_9.0.017). Hardware configuration: Core 2 Duo 2.66 GHz processor (Processor E6700, Speed: 3.00 GHz (BIOS setup: Overclock Options [FSB1200/DDR2-800])), Data cache size : L1 32 KB, L2 4096 KB, L3 0 KB, 1066 MHz FSB, Motherboard: P5WDG2 WS Professional, 1066 MHz FSB, DDR2 800 MHz (2 GB), OS Windows XP Professional SP2.
1. Consider a packed symmetric Toplitz matrix:
Aij = arcsin(1 / (i - j + 1))/(i - j + 1)3 (i , j = 1, 2, ..., n; i >= j)
Application of the algorithms described above in Sections 1.1, 2 and 3.1 on the page devoted to P4 CPU to the solution of full algebraic problem for this matrix gave results showing that the DSYEVD Intel MKL procedure (currently the only procedure combining calculation speed and accuracy but requiring a huge memory resource) is not effective. The comparative results obtained using one core (OMP_NUM_THREADS=1) and two cores (OMP_NUM_THREADS=2) are given in Tables 1 and 2, respectively. The SDIAG speed on one core exceeds the DSYEVD Intel MKL speed on two cores.
TABLE 1
| n | t (sec) | Δt (%) | |
| SDIAG 3.0 |
DSYEVD Intel MKL 9.0 |
||
|
1000 2000 3000 4000 5000 6000 7000 8000 9000 |
0.81 5.1 15.9 35.8 67.9 114 178 263 370 |
1.06 7.8 25.1 56.9 111 186 297 436 618 |
31 53 58 59 63 63 67 66 67 |
TABLE 2
| n | t (sec) | Δt (%) | |
| SDIAG 3.0 |
DSYEVD Intel MKL 9.0 |
||
|
1000 2000 3000 4000 5000 6000 7000 8000 9000 |
0.56 3.5 11.0 24.8 46.9 79.2 124 184 259 |
0.81 5.6 17.9 40.4 75.7 131 200 298 418 |
45 60 63 63 61 65 61 62 61 |
2. One of the remarkable differences in the speed of diagonalization procedure is difference in the speed of basic operations of linear algebra BLAS Level 2 and BLAS Level 3. Let's find product of 2 square matrices:
Aij = sin(i - j - 1) (i , j = 1, 2, ..., n).
Bij = sin(i - j + 1) (i , j = 1, 2, ..., n).
The algorithm of rapid matrix multiplication was developed. It is not incorporated yet in algorithm of diagonalization and possesses a number of advantages as compared with known one by Strassen's:
Minimal influence of rounding errors on the precision of calculations.
Minimal RAM demands.
The comparative results obtained using one core (OMP_NUM_THREADS=1) are given in Tables 3.
Break for BLAS a level 2 is 35-36% for IA32 and EM64T platforms.
TABLE 3
| n | t (sec) | Δt (%) | |
| DGEMM SDIAG 3.2 |
DGEMM Intel MKL 10.0 |
||
|
1800 3600 5400 7200 9000 |
1.05 8.09 26.9 63.3 123 |
1.57 12.4 41.9 99.1 194 |
49.5 53.3 55.8 56.6 57.7 |