NEW ALGORITHMS OF FAST DIAGONALIZATION OF REAL SYMMETRIC MATRICES.

 

Author: Yurii F. Sigolaev (St. Petersburg). E-mail: y_sigolaev@hotmail.com

Last revision: January 9, 2007

 

     The performance of personal computers was dramatically improved during the last decade. Therefore, cardinal revision of the majority of algorithms becomes an urgent problem. In so doing, it is necessary to lift the contradiction between the 32-bit architecture (imposing significant limitations upon the volume of information being processed) and increased processor speed.

      I have developed a new complex of algorithms of diagonalization of dense real symmetric matrices taking into account this factor. This complex (SDIAG) is realized in C and has a number of advantages compared to other known packages realizing diagonalization algorithms:

 

-          noticeable increase in the computation speed;

-          considerable saving of RAM. The difference (by a factor of 6 when finding all eigenvectors and by a factor of 8 when finding a part of eigenvectors) between all the known high-speed modern diagonalization methods and the algorithms I suggest is so large that my algorithms will apparently remain urgent with the 64-bit architecture also.

 

      These results were attained thanks to the solution of a basic problem of numerical analysis of tridiagonal matrices, development of new algorithm for vector multiplication and improvement of the block algorithms for packed matrices. Some of the algorithms use the IEEE-arithmetic support and use SSE2 and SSE3 for efficiency. For a comparative analysis, I used Intel MKL package because of the common source of the code, LAPACK package.

      Description of the new algorithms and the references are given on the page devoted to P4 processor:  http://www.thesa-store.com/products/.

     

Please address all remarks and suggestions to: y_sigolaev@hotmail.com

Tel/Fax: 7(812)2691898

   Comparison results.

    

I compared the results I obtained with those from the known Intel MKL package (Package ID: w_mkl_p_9.0.017). Hardware configuration: Core 2 Duo 2.66 GHz processor (Processor E6700, Speed: 3.00 GHz (BIOS setup: Overclock Options [FSB1200/DDR2-800])), Data cache size : L1 32 KB, L2 4096 KB, L3 0 KB, 1066 MHz FSB, Motherboard:  P5WDG2 WS Professional, 1066 MHz FSB, DDR2 800 MHz (2 GB), OS Windows XP Professional SP2.

1. Consider a packed symmetric Toplitz matrix:

Aij = arcsin(1 / (i - j + 1))/(i - j + 1)3 (i , j = 1, 2, ..., n; i >= j)

Application of the algorithms described above in Sections 1.1, 2 and 3.1 on the page devoted to P4 CPU to the solution of full algebraic problem for this matrix gave results showing that the DSYEVD Intel MKL procedure (currently the only procedure combining calculation speed and accuracy but requiring a huge memory resource) is not effective. The comparative results obtained using one core (OMP_NUM_THREADS=1) and two cores (OMP_NUM_THREADS=2) are given in Tables 1 and 2, respectively. The SDIAG speed on one core exceeds the DSYEVD Intel MKL speed on two cores.

TABLE 1

n t (sec)  Δt (%)
SDIAG 3.0 DSYEVD
Intel  MKL 9.0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0.81

5.1

15.9

35.8

67.9

114

178

263

370

1.06

7.8

 25.1

56.9

111

186

297

436

618

31

53

 58

 59

63

63

67

66

67

TABLE 2

n t (sec)  Δt (%)
SDIAG 3.0 DSYEVD
Intel  MKL 9.0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0.56

3.5

11.0

24.8

46.9

79.2

124

184

259

0.81

5.6

 17.9

40.4

75.7

131

200

298

418

45

60

 63

 63

61

65

61

62

61

 

2. One of the remarkable differences in the speed of diagonalization procedure is difference in the speed of basic operations of linear algebra BLAS Level 2 and BLAS Level 3. Let's find product of 2 square matrices:

Aij = sin(i - j - 1)                    (i , j = 1, 2, ..., n).

Bij = sin(i - j + 1)                   (i , j = 1, 2, ..., n).

The algorithm of rapid matrix multiplication was developed. It is not incorporated yet in algorithm of diagonalization and possesses a number of advantages as compared with known one by Strassen's:

The comparative results obtained using one core (OMP_NUM_THREADS=1) are given in Tables 3.

    Break for BLAS a level 2 is 35-36% for IA32 and EM64T platforms.

TABLE 3

n t (sec)  Δt (%)
DGEMM SDIAG 3.2 DGEMM
Intel  MKL 10.0

1800

3600

5400

7200

9000

1.05

8.09

26.9

63.3

123

1.57

 12.4

41.9

99.1

194

49.5

 53.3

 55.8

56.6

57.7