OpenSSL 0.9.7b optimized with OpenIMPACT
10/1/03 Update:
OpenSSL 0.9.7b has a security hole - see http://www.openssl.org/news/secadv_20030930.txt. New
OpenIMPACT compiled binaries for OpenSSL 0.9.7c will soon be available here.
These tarballs contain a complete installation of OpenSSL 0.9.7b. This
version uses OpenSSL's hand written assembly bignum routines. This
OpenSSL distribution does not include the mdc2, idea, or rc5
algorithms.
The -cspec version has control speculation enabled. It is therefore
slightly faster, but requires the general speculation kernel patch. If
you do not wish to apply the patch, the -nocspec version is
available.
|
openssl-0.9.7c.tar.gz - Coming soon
|
OpenSSL source code. |
|
openssl-0.9.7c-nocspec.tar.gz - Coming soon
|
OpenSSL 0.9.7b with control speculation turned off. This version works
on unpatched kernels
|
|
openssl-0.9.7c-cspec.tar.gz - Coming soon
|
OpenSSL 0.9.7b with control speculation turned on. This version requires the general speculation kernel patch.
|
|
http://www.openssl.org/
|
The OpenSSL web site. |
Benchmarking and Compilation:
Compiler arguments:
gcc - Compiled with GCC 3.2 using gcc -O3 -fomit-frame-pointer.
ecc - Compiled with ECC 7.0 using ecc -O3.
oicc (cspec) - Compiled with the OpenIMPACT compiler using oicc -O.
oicc (nocspec) - Compiled with oicc -O --no-cspec.
RedHat 7.2 stock - OpenSSL from the RPM distributed with RedHat Linux 7.2. Compiled using gcc -fPIC -O2 -mfixed-range=f32-f127. The GCC version used is not indicated.
Assembly language tweaks:
OpenSSL includes hand written assembly language bignum routines for
several platforms, including IA-64. These routines are used heavily by the
RSA and DSA algorithms. Depending on the compiler, these routines give
between a 3x and 6x speedup to the RSA and DSA algorithms.
As gcc is the only compiler that supports the assembly language bignum
routines as written, I could not use the standard Makefile with ecc and oicc.
In this case, I first compiled using the C bignum routines. I then ran the
assembly source through GNU as by hand, swapped the new bignum object
(bn_asm.o) into libcrypto.a, and relinked openssl. The assembly file is free
of any C code, so this method does not introduce one compiler's optimizations
into the output of another.
Benchmarking setup:
OpenSSL's built in speed test was used for benchmarking. The machine used was
an unloaded 900MHz rx2600 (2 x Itanium II, 8 GB RAM) running kernel
2.4.21-pre5 with the control speculation patch. The command openssl
speed was executed five times and the results averaged to
generate the tables below. The full output is available in the benchmarking output section.
Performance:
|
Speedup over gcc for symmetric ciphers on 8KB blocks (graph)
|
| Algorithm |
gcc |
ecc |
oicc (cspec) |
oicc (nospec) |
RedHat 7.2 stock (0.9.6b) |
| md2 notes |
1.00x |
1.28x |
0.82x |
0.82x |
0.55x |
| md4 |
1.00x |
1.05x |
1.04x |
1.04x |
1.01x |
| md5 |
1.00x |
1.13x |
1.10x |
1.10x |
1.01x |
| hmac (md5) |
1.00x |
1.14x |
1.09x |
1.09x |
1.02x |
| sha1 notes |
1.00x |
1.57x |
1.37x |
1.37x |
1.14x |
| rmd160 |
1.00x |
1.66x |
1.90x |
1.90x |
0.99x |
| rc4 notes |
1.00x |
2.08x |
2.34x |
2.34x |
1.95x |
| des cbc |
1.00x |
1.16x |
1.37x |
1.37x |
0.93x |
| des ede3 |
1.00x |
1.16x |
1.35x |
1.35x |
0.90x |
| rc2 cbc |
1.00x |
1.11x |
1.06x |
1.06x |
0.91x |
| blowfish cbc notes |
1.00x |
0.78x |
1.11x |
1.12x |
0.98x |
| cast cbc |
1.00x |
1.07x |
1.18x |
1.19x |
1.10x |
| aes-128 cbc |
1.00x |
1.19x |
1.17x |
1.11x |
N/A |
| aes-192 cbc |
1.00x |
1.20x |
1.25x |
1.15x |
N/A |
| aes-256 cbc |
1.00x |
1.16x |
1.25x |
1.11x |
N/A |
Notes:
md2: oicc spends much of its time waiting for data from the L1 cache. Loads and stores tend to be clustered, so the requests overwhelm the L1 data cache. A secondary loss of performance is branch misprediction, although L1 stalls dominate.
sha1: The main loop of sha1 is much larger than
the L1 instruction cache. ecc generates a more compact loop than
oicc, giving it a slight advantage.
rc4: ecc and oicc stall roughly the same amount.
Virtually all of ecc's stalls are due to the D-cache (expensive L3
misses), while oicc's are spread across the D-cache (~50%), branch
misprediction (~25%), and register spills (~25%). gcc spends 50%
of its run time waiting on data from the L3.
blowfish:oicc does a better job of finding
parallelism (averages 2.4 instructions/cycle) than ecc (1.9) or gcc
(2.0).
|
Speedup over gcc for asymmetric ciphers* (graph) |
| Algorithm |
gcc |
ecc |
oicc (cspec) |
oicc (nocspec) |
RedHat 7.2 stock (0.9.6b) |
| rsa (512 bit, sign) |
1.00x |
1.01x |
0.97x |
0.97x |
0.84x |
| rsa (512 bit, verify) |
1.00x |
1.08x |
0.95x |
0.96x |
1.03x |
| rsa (1024 bit, sign) |
1.00x |
1.02x |
0.95x |
0.96x |
0.70x |
| rsa (1024 bit, verify) |
1.00x |
1.08x |
0.96x |
0.97x |
0.97x |
| rsa (2048 bit, sign) |
1.00x |
1.02x |
0.95x |
0.97x |
0.72x |
| rsa (2048 bit, verify) |
1.00x |
1.07x |
0.97x |
0.98x |
0.85x |
| rsa (4096 bit, sign) |
1.00x |
1.01x |
0.97x |
0.98x |
0.69x |
| rsa (4096 bit, verify) |
1.00x |
1.05x |
0.98x |
0.99x |
0.75x |
| dsa (512 bit, sign) |
1.00x |
1.03x |
0.97x |
0.97x |
0.71x |
| dsa (512 bit, verify) |
1.00x |
1.01x |
0.95x |
0.95x |
0.70x |
| dsa (1024 bit, sign) |
1.00x |
1.03x |
0.96x |
0.97x |
0.73x |
| dsa (1024 bit, verify) |
1.00x |
1.04x |
0.95x |
0.98x |
0.74x |
| dsa (2048 bit, sign) |
1.00x |
1.02x |
0.97x |
0.98x |
0.70x |
| dsa (2048 bit, verify) |
1.00x |
1.01x |
0.96x |
0.98x |
0.69x |
Full benchmarking output:
This section contains the raw output from running openssl speed
2>&1 > log during the benchmarking tests.
gcc:
1
2
3
4
5
ecc:
1
2
3
4
5
oicc (cspec):
1
2
3
4
5
oicc (nocspec):
1
2
3
4
5
RedHat 7.2 stock (0.9.6b):
1
2
3
4
5