Tuesday, January 27, 2015

Hardware accelerated SHA-1

There are many articles out there describing hardware accelerated AES, which has been out for a few years now, but SHA instructions are not out yet, but should be appearing in ARM and x86 instruction sets sometime this year. So I figure it's time to start talking about them now. The first and formost standard for the SHA-1 algorithm is NIST's current FIPS-180 document, specifically section 6.1.2 "SHA-1 Hash Computation" part 3, which says:

T = ROTL5(a) + ft(b, c, d) + e + Kt + Wt

This formula (which I call the T-formula) is probably the most computationally intensive part of SHA-1, and one of the most surprising things about my research into the ARM NEON and x86 SHA instruction sets is that neither of them implement this formula exactly as is.

SHA-1 Instructions

sha1h aRotate left 30
sha1su0 a, b, csha1msg1 a, bMessage schedule part 1
sha1su1 a, bsha1msg2 a, bMessage schedule part 2
sha1c hash, e, msgsha1rnds4 hash, msg, 0Rounds 0 to 20
sha1p hash, e, msgsha1rnds4 hash, msg, 1Rounds 20 to 40
sha1m hash, e, msgsha1rnds4 hash, msg, 2Rounds 40 to 60
sha1p hash, e, msgsha1rnds4 hash, msg, 3Rounds 60 to 80
sha1nexte hash, msgInvisible e operand

As you can see, ARM and x86 are a bit different. From my research, it seems that ARM omits "Kt" from the T-formula, and x86 omits "e" from the T-formula. These implementation choices imply that SHA-1 optimizations on each architecture will use these instructions very differently. x86 has a publication documenting the SHA instruction set extensions, and if there are any conflicts with this article, then the final word is there. X86 omits "e" in the T-formula, so it must be calculated with the sha1nexte instruction. In pseudo-code, a representative sample of 4 SHA-1 rounds on x86 can be computed by:

work18 = sha1msg2(sha1msg1(work14, work15) ^ work16, work17);
hash19 = sha1rnds4(hash18, sha1nexte(hash17, work18), 3);

ARM on the other hand, has absolutely no documentation about their SHA instruction set, except how to determine if the processor supports it. While this is handy, it does very little to describe the inner workings of each instruction. So the following is just a guess on my part, but given that the inputs to (sha1c, sha1p, and sha1m) are two vectors, there is enough information to compute 4 rounds of SHA-1, just like the x86 sha1rnds4 instruction. ARM omits "Kt" in the T-formula, so it must be added to the work vector. In pseudo-code, a representative sample of 4 SHA-1 rounds on ARM can be computed by:

work18 = sha1su1(sha1su0(work14, work15, work16), work17);
hash19 = sha1p(hash18, sha1h(hash17[0]), work18 + SHA1_K3V);

where SHA1_K3V is a vector of 4 uint32's all of which are SHA1_K3.

For more information see also: sha1.rs