gelato.org IMPACT Advanced Compiler Technology

University of Illinois Urbana Champaign

     | OpenIMPACT | Current Status | Software Releases | FAQ

OpenSSL 0.9.7b optimized with OpenIMPACT


10/1/03 Update:
OpenSSL 0.9.7b has a security hole - see http://www.openssl.org/news/secadv_20030930.txt. New OpenIMPACT compiled binaries for OpenSSL 0.9.7c will soon be available here.

These tarballs contain a complete installation of OpenSSL 0.9.7b. This version uses OpenSSL's hand written assembly bignum routines. This OpenSSL distribution does not include the mdc2, idea, or rc5 algorithms.

The -cspec version has control speculation enabled. It is therefore slightly faster, but requires the general speculation kernel patch. If you do not wish to apply the patch, the -nocspec version is available.

openssl-0.9.7c.tar.gz - Coming soon OpenSSL source code.
openssl-0.9.7c-nocspec.tar.gz - Coming soon OpenSSL 0.9.7b with control speculation turned off. This version works on unpatched kernels
openssl-0.9.7c-cspec.tar.gz - Coming soon OpenSSL 0.9.7b with control speculation turned on. This version requires the general speculation kernel patch.
http://www.openssl.org/ The OpenSSL web site.

Benchmarking and Compilation:

Compiler arguments:
gcc - Compiled with GCC 3.2 using gcc -O3 -fomit-frame-pointer.
ecc - Compiled with ECC 7.0 using ecc -O3.
oicc (cspec) - Compiled with the OpenIMPACT compiler using oicc -O.
oicc (nocspec) - Compiled with oicc -O --no-cspec.
RedHat 7.2 stock - OpenSSL from the RPM distributed with RedHat Linux 7.2. Compiled using gcc -fPIC -O2 -mfixed-range=f32-f127. The GCC version used is not indicated.

Assembly language tweaks:
OpenSSL includes hand written assembly language bignum routines for several platforms, including IA-64. These routines are used heavily by the RSA and DSA algorithms. Depending on the compiler, these routines give between a 3x and 6x speedup to the RSA and DSA algorithms.

As gcc is the only compiler that supports the assembly language bignum routines as written, I could not use the standard Makefile with ecc and oicc. In this case, I first compiled using the C bignum routines. I then ran the assembly source through GNU as by hand, swapped the new bignum object (bn_asm.o) into libcrypto.a, and relinked openssl. The assembly file is free of any C code, so this method does not introduce one compiler's optimizations into the output of another.

Benchmarking setup:
OpenSSL's built in speed test was used for benchmarking. The machine used was an unloaded 900MHz rx2600 (2 x Itanium II, 8 GB RAM) running kernel 2.4.21-pre5 with the control speculation patch. The command openssl speed was executed five times and the results averaged to generate the tables below. The full output is available in the benchmarking output section.

Performance:
Speedup over gcc for symmetric ciphers on 8KB blocks (graph)
Algorithm gcc ecc oicc (cspec) oicc (nospec) RedHat 7.2 stock (0.9.6b)
md2 notes 1.00x 1.28x 0.82x 0.82x 0.55x
md4 1.00x 1.05x 1.04x 1.04x 1.01x
md5 1.00x 1.13x 1.10x 1.10x 1.01x
hmac (md5) 1.00x 1.14x 1.09x 1.09x 1.02x
sha1 notes 1.00x 1.57x 1.37x 1.37x 1.14x
rmd160 1.00x 1.66x 1.90x 1.90x 0.99x
rc4 notes 1.00x 2.08x 2.34x 2.34x 1.95x
des cbc 1.00x 1.16x 1.37x 1.37x 0.93x
des ede3 1.00x 1.16x 1.35x 1.35x 0.90x
rc2 cbc 1.00x 1.11x 1.06x 1.06x 0.91x
blowfish cbc notes 1.00x 0.78x 1.11x 1.12x 0.98x
cast cbc 1.00x 1.07x 1.18x 1.19x 1.10x
aes-128 cbc 1.00x 1.19x 1.17x 1.11x N/A
aes-192 cbc 1.00x 1.20x 1.25x 1.15x N/A
aes-256 cbc 1.00x 1.16x 1.25x 1.11x N/A

Notes:
md2: oicc spends much of its time waiting for data from the L1 cache. Loads and stores tend to be clustered, so the requests overwhelm the L1 data cache. A secondary loss of performance is branch misprediction, although L1 stalls dominate.
sha1: The main loop of sha1 is much larger than the L1 instruction cache. ecc generates a more compact loop than oicc, giving it a slight advantage.
rc4: ecc and oicc stall roughly the same amount. Virtually all of ecc's stalls are due to the D-cache (expensive L3 misses), while oicc's are spread across the D-cache (~50%), branch misprediction (~25%), and register spills (~25%). gcc spends 50% of its run time waiting on data from the L3.
blowfish:oicc does a better job of finding parallelism (averages 2.4 instructions/cycle) than ecc (1.9) or gcc (2.0).

Speedup over gcc for asymmetric ciphers* (graph)
Algorithm gcc ecc oicc (cspec) oicc (nocspec) RedHat 7.2 stock (0.9.6b)
rsa (512 bit, sign) 1.00x 1.01x 0.97x 0.97x 0.84x
rsa (512 bit, verify) 1.00x 1.08x 0.95x 0.96x 1.03x
rsa (1024 bit, sign) 1.00x 1.02x 0.95x 0.96x 0.70x
rsa (1024 bit, verify) 1.00x 1.08x 0.96x 0.97x 0.97x
rsa (2048 bit, sign) 1.00x 1.02x 0.95x 0.97x 0.72x
rsa (2048 bit, verify) 1.00x 1.07x 0.97x 0.98x 0.85x
rsa (4096 bit, sign) 1.00x 1.01x 0.97x 0.98x 0.69x
rsa (4096 bit, verify) 1.00x 1.05x 0.98x 0.99x 0.75x
dsa (512 bit, sign) 1.00x 1.03x 0.97x 0.97x 0.71x
dsa (512 bit, verify) 1.00x 1.01x 0.95x 0.95x 0.70x
dsa (1024 bit, sign) 1.00x 1.03x 0.96x 0.97x 0.73x
dsa (1024 bit, verify) 1.00x 1.04x 0.95x 0.98x 0.74x
dsa (2048 bit, sign) 1.00x 1.02x 0.97x 0.98x 0.70x
dsa (2048 bit, verify) 1.00x 1.01x 0.96x 0.98x 0.69x
*The asymmetric ciphers are a poor test of compiler performance. They rely heavily on the hand tuned assembly code for speed, which is why the numbers vary little between compilers. This table is included merely for completeness.

Full benchmarking output:
This section contains the raw output from running openssl speed 2>&1 > log during the benchmarking tests.
gcc: 1 2 3 4 5
ecc: 1 2 3 4 5
oicc (cspec): 1 2 3 4 5
oicc (nocspec): 1 2 3 4 5
RedHat 7.2 stock (0.9.6b): 1 2 3 4 5