Specialized double-SHA256 with 64 byte inputs with SSE4.1 and AVX2 #13191

pull sipa wants to merge 7 commits into bitcoin:master from sipa:201709_dsha256_64 changing 16 files +1347 −204
  1. sipa commented at 6:36 PM on May 8, 2018: member

    This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:

    • Generic C++ (reusing the normal SHA256 code)
    • Specialized C++ for 64-byte inputs, but no special instructions
    • 4-way using SSE4.1 intrinsics
    • 8-way using AVX2 intrinsics

    On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):

    • 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
    • 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
    • 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
    • 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
    • 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)
  2. Benchmark Merkle root computation 0df017889b
  3. Refactor SHA256 code 57f34630fb
  4. sipa force-pushed on May 8, 2018
  5. sipa force-pushed on May 8, 2018
  6. sipa force-pushed on May 8, 2018
  7. sipa force-pushed on May 8, 2018
  8. sipa force-pushed on May 8, 2018
  9. sipa force-pushed on May 8, 2018
  10. sipa force-pushed on May 8, 2018
  11. sipa force-pushed on May 8, 2018
  12. laanwj added the label Validation on May 9, 2018
  13. laanwj commented at 1:18 PM on May 9, 2018: member

    Looks like these whole chains of functions are unused after this, except in the merkle tests:

    BlockMerkleBranch → ComputeMerkleBranch → MerkleComputation
    ComputeMerkleRootFromBranch
    

    Might want to move some functions there.

  14. in src/crypto/sha256_avx2.cpp:48 in 6904dc15cd outdated
      43 | +    __m256i t2 = Add(Sigma0(a), Maj(a, b, c));
      44 | +    d = Add(d, t1);
      45 | +    h = Add(t1, t2);
      46 | +}
      47 | +
      48 | +__m256i inline Read8(const unsigned char* chunk, int offset) {
    


    laanwj commented at 1:33 PM on May 9, 2018:

    Read8 and Write8 appear to read and write values respectively in opposite order, I suppose this is intentional?

    unrelated: I also wonder if this could be done with a parallel instruction instead of calling into ReadBE32/WriteBE32 for each component, as we know the host endianness.


    sipa commented at 3:28 PM on May 9, 2018:

    Fixed! Indeed, there exist byte-shuffle intrinsics for SSE4 and AVX2; I've used those instead of individual byteswaps.

  15. sipa force-pushed on May 9, 2018
  16. in src/bench/crypto_hash.cpp:57 in b36eac01fc outdated
      51 | @@ -52,6 +52,14 @@ static void SHA256_32b(benchmark::State& state)
      52 |      }
      53 |  }
      54 |  
      55 | +static void DSHA256_64b(benchmark::State& state)
      56 | +{
      57 | +    std::vector<uint8_t> in(64 * 1024,0);
    


    kristapsk commented at 5:06 PM on May 9, 2018:

    Shouldn't there be a space between "1024," and "0"?

  17. TheBlueMatt cross-referenced this on May 9, 2018 from issue Add POWER8 ASM for 4-way SHA256 by TheBlueMatt
  18. theuni commented at 7:54 PM on May 9, 2018: member

    Concept ACK! @sipa Please see the build-system comments on #13203, I didn't realize this was a different PR. I'll follow-up here for non-power changes.

  19. in src/crypto/sha256.cpp:500 in d473e1e0d0 outdated
     481 |      if (__get_cpuid(1, &eax, &ebx, &ecx, &edx) && (ecx >> 19) & 1) {
     482 |          Transform = sha256_sse4::Transform;
     483 |          TransformD64 = TransformD64Wrapper<sha256_sse4::Transform>;
     484 | -        assert(SelfTest(Transform));
     485 | -        return "sse4";
     486 | +#if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL)
    


    theuni commented at 8:30 PM on May 9, 2018:

    Just because libbitcoinconsensus doesn't take advantage of 4way? Or some buildsystem limitation?


    sipa commented at 8:31 PM on May 9, 2018:

    Both.

    I didn't feel like creating copies of all the architecture-specialized libs for use within libbitcoinconsensus - especially as it doesn't benefit it.

  20. ghost commented at 6:35 AM on May 10, 2018: none

    Great job! Any plans to add support for AVX512?

    I know there are not so many people with AVX512 CPU's but why not add it from now?

    All Intel CPU's will have support for AVX512 in coming months (mainstream + server).

  21. ghost commented at 6:36 AM on May 10, 2018: none
  22. laanwj commented at 6:58 AM on May 10, 2018: member

    I know there are not so many people with AVX512 CPU's but why not add it from now?

    That is not how open source development works. A PR is for reviewing the code. Future improvements can be done in future PRs. For example @TheBlueMatt adds support for POWER8 instructions in #13203. I might add ARM intrinsics support at some point.

  23. sipa commented at 7:46 PM on May 11, 2018: member

    @laanwj The algorithm in MerkleComputation is actually potentially more efficient (better memory locality) than the one in ComputeMerkleRoot now, apart from the fact that it doesn't take advantage of multi-way hashing. I'd like to keep it around for a bit and see if I can adapt it to use multi-way instead, in which case it could be used for everything. I can also move it and move it back if used. What do you think? @Kick1986 AVX512 is cool, but it's low-impact (even machines that support it right now do it with reduced clock rate), and I'm unable to benchmark it. Feel free to add in follow-up work yoursel, though.

  24. ghost commented at 12:47 AM on May 12, 2018: none

    @sipa Thanks for answering me! You are right about AVX512 clock speed, it will go mainstream sometime 2020. I will do my best to add AVX512 to BTC code before that. Thanks

  25. sipa force-pushed on May 12, 2018
  26. sipa commented at 7:14 PM on May 12, 2018: member

    @laanwj I moved the unused Merkle branch functions to the test code. Also:

    I might add ARM intrinsics support at some point.

    :+1: That sounds far more impactful than POWER8, to be honest ;) @theuni I've addressed some of your build system comments from #13203, but left the ENABLE_AVX2 and ENABLE_SSE41 macros, because there isn't a clean platform independent way to use compiler defines. __AVX2__ exists in both GCC and MSVC, but there is no __SSE4_1__ in MSVC (and worse, there is no way to test for SSE4 at all there; you have to test whether the FP code is x87 based or SSE based). @kristapsk Ok, I've added a space.

  27. gmaxwell commented at 7:00 PM on May 15, 2018: contributor

    Concept ACK. Lightly tested ACK.

    We should open a issue to track doing a specialized 1-way 64-byte SSE4 function for this later as that is a pretty much guaranteed performance gain (as the non-specialized 1-way SSE4 that does more work is faster than the specialized non-SSE4 1-way code) which can be done by someone who knows assembly but knows little about Bitcoin.

    We might want to add a note to explain the specialization along the lines of: the 64-byte input (and 32-byte input in the second SHA256 invocation) mean that most of the input to the message expansion is zeros, which lets us drop out a lot of additions.

  28. sipa commented at 12:25 AM on May 17, 2018: member

    @gmaxwell Yes, perhaps. But the impact of that would be pretty low at best, as every system in which that optimized 1-way code can be used also supports 4-way 64-byte optimized code already; meaning such an implementation would only be used for up to 3 of the last hashes in each level of a Merkle tree.

  29. sipa commented at 7:29 PM on May 20, 2018: member

    @theuni Could you have a look at the build system changes again?

  30. theuni commented at 4:47 PM on May 29, 2018: member

    @sipa thanks for the fixups. utACK build-system changes.

  31. in src/consensus/merkle.cpp:133 in 0bc5ec3732 outdated
     130 | @@ -131,9 +131,23 @@ static void MerkleComputation(const std::vector<uint256>& leaves, uint256* proot
     131 |  }
     132 |  
     133 |  uint256 ComputeMerkleRoot(const std::vector<uint256>& leaves, bool* mutated) {
    


    theuni commented at 4:55 PM on May 29, 2018:

    Nit: pass leaves by value. It can be std::move'd from BlockMerkleRoot.


    sipa commented at 9:22 PM on May 29, 2018:

    Done.

  32. in src/crypto/sha256.cpp:64 in 57f34630fb outdated
     126 | -        Round(f, g, h, a, b, c, d, e, 0x8cc70208, w11 += sigma1(w9) + w4 + sigma0(w12));
     127 | -        Round(e, f, g, h, a, b, c, d, 0x90befffa, w12 += sigma1(w10) + w5 + sigma0(w13));
     128 | -        Round(d, e, f, g, h, a, b, c, 0xa4506ceb, w13 += sigma1(w11) + w6 + sigma0(w14));
     129 | -        Round(c, d, e, f, g, h, a, b, 0xbef9a3f7, w14 + sigma1(w12) + w7 + sigma0(w15));
     130 | -        Round(b, c, d, e, f, g, h, a, 0xc67178f2, w15 + sigma1(w13) + w8 + sigma0(w0));
     131 | +        Round(a, b, c, d, e, f, g, h, 0x428a2f98 + (w0 = ReadBE32(chunk + 0)));
    


    theuni commented at 5:31 PM on May 29, 2018:

    Since there's arithmetic involved now and not just a simple cast, please give the constants an 'ul' suffix just to be explicit.


    sipa commented at 9:22 PM on May 29, 2018:

    Fixed (also in SSE4 and AVX2 code).

  33. in src/crypto/sha256.cpp:158 in 0150c8d85e outdated
     153 | +    uint32_t g = 0x1f83d9abul;
     154 | +    uint32_t h = 0x5be0cd19ul;
     155 | +
     156 | +    uint32_t w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14, w15;
     157 | +
     158 | +    Round(a, b, c, d, e, f, g, h, 0x428a2f98 + (w0 = ReadBE32(in + 0)));
    


    theuni commented at 5:32 PM on May 29, 2018:

    'ul' suffix for these as well.

  34. theuni approved
  35. theuni commented at 6:07 PM on May 29, 2018: member

    utACK, nothing major to complain about. Admittedly I'm relying on tests to check the new sha2 code itself, I only glanced. Very nice work :)

    Might it be helpful to add a non-double 4way/8way function as well? That would allow (for example) batched txid calculations where the first iteration is lazy and variable-sized as it is now, but 4way/8way could be used for the fixed-size second iterations.

  36. Specialized double sha256 for 64 byte inputs d0c9632883
  37. Use SHA256D64 in Merkle root computation 1f0e7ca09c
  38. 4-way SSE4.1 implementation for double SHA256 on 64-byte inputs 230294bf5f
  39. 8-way AVX2 implementation for double SHA256 on 64-byte inputs 4437d6e1f3
  40. [MOVEONLY] Move unused Merkle branch code to tests 4defdfab94
  41. sipa force-pushed on May 29, 2018
  42. sipa commented at 9:28 PM on May 29, 2018: member

    Addressed @theuni's nits.

    Also:

    Might it be helpful to add a non-double 4way/8way function as well? That would allow (for example) batched txid calculations where the first iteration is lazy and variable-sized as it is now, but 4way/8way could be used for the fixed-size second iterations.

    Yes, but it's significantly more complicated. You need to schedule multiple variable length things in groups of 64 bytes, interspersed with padding. Not going to do that in this PR. Also, @jl2012 suggested writing an optimized 32-byte-input single-SHA256 for use in the second half of double-SHA256 computations, which could help there as well.

  43. in src/bench/merkle_root.cpp:1 in 4defdfab94
       0 | @@ -0,0 +1,26 @@
       1 | +// Copyright (c) 2016 The Bitcoin Core developers
    


    Empact commented at 12:25 AM on May 30, 2018:

    2018

  44. in src/bench/merkle_root.cpp:5 in 4defdfab94
       0 | @@ -0,0 +1,26 @@
       1 | +// Copyright (c) 2016 The Bitcoin Core developers
       2 | +// Distributed under the MIT software license, see the accompanying
       3 | +// file COPYING or http://www.opensource.org/licenses/mit-license.php.
       4 | +
       5 | +#include "bench.h"
    


    Empact commented at 12:25 AM on May 30, 2018:

    #include <

  45. DesWurstes commented at 1:29 PM on June 2, 2018: contributor

    In the future, please don’t spend time coding AVX/AVX2 SHA256. SHA-NI specialized instruction set has SHA2 opcodes, which is the possible most efficient way. Besides, it was released before SSE4.2 and AVX, so there are compatible more processors.

    If anyone is interested: https://github.com/01org/isa-l_crypto/blob/master/sha256_mb/sha256_ni_x2.asm

    Thanks pwuille. I guess you're right.

  46. sipa commented at 2:21 PM on June 2, 2018: member

    @DesWurstes I'm perfectly capable of deciding for myself what I find interesting, thanks. Hardly any systems today support SHA-NI (only very recent low-power Intel CPUs, and AMD Ryzen), while AVX2 is available on all Intel chips since 2013 and AMD chips since 2015.

  47. laanwj commented at 3:41 PM on June 2, 2018: member

    @DesWurstes Apparently you didn't bother to read the posts at all before replying, as this came up before. People decide for themselves what to work on. If you think something else is more important, you can submit your own pull request.

  48. sipa cross-referenced this on Jun 3, 2018 from issue SHA256 implementations based on Intel SHA Extensions by sipa
  49. sipa commented at 11:42 PM on June 3, 2018: member
  50. laanwj commented at 10:08 AM on June 4, 2018: member

    utACK 4defdfab94504018f822dc34a313ad26cedc8255 Verified that FreeBSD+OpenBSD builds still pass.

  51. laanwj merged this on Jun 4, 2018
  52. laanwj closed this on Jun 4, 2018

  53. laanwj referenced this in commit 0de7cc848e on Jun 4, 2018
  54. sipa cross-referenced this on Jun 4, 2018 from issue Enable double-SHA256-for-64-byte code on 32-bit x86 by sipa
  55. Sjors cross-referenced this on Jun 5, 2018 from issue ARMv8 sha2 support by Sjors
  56. Bushstar cross-referenced this on Jun 6, 2018 from issue commits from bitcoin/master by Bushstar
  57. Empact cross-referenced this on Jun 6, 2018 from issue Break circular dependency: init -> * -> init by extracting shutdown.h by Empact
  58. droark commented at 6:44 PM on June 6, 2018: contributor

    After-the-fact tACK

    The code's running nicely on my AVX2-enabled machine. Good job!

    I do have one question. Is this going to affect Gitian builds for people who are on older machines? Maybe I'm missing something silly but it seems like this could cause mismatches. I remember that the SSE4-enabled SHA-256 code initially required --enable-experimental-asm and was eventually enabled by default.

    Thanks.

  59. sipa commented at 6:53 PM on June 6, 2018: member

    @droark It does require an AVX2 compatible compiler, but GCC 4.7 is sufficient for that (which we already required for release builds). You can perfectly well build AVX2 code even if your own hardware doesn't support AVX2.

  60. droark commented at 7:20 PM on June 6, 2018: contributor

    @sipa - Thanks. I had a bad understanding of compilers from long ago that I somehow never shook.

  61. sipa cross-referenced this on Jun 11, 2018 from issue Improve coverage of SHA256 SelfTest code by sipa
  62. sipa cross-referenced this on Jun 11, 2018 from issue Convert the 1-way SSE4 SHA256 code from asm to intrinsics by sipa
  63. laanwj referenced this in commit a607d23ae8 on Jun 12, 2018
  64. laanwj referenced this in commit 450055bdbd on Jun 18, 2018
  65. laanwj referenced this in commit 3a3eabef40 on Jul 9, 2018
  66. TheBlueMatt cross-referenced this on Jul 31, 2018 from issue Use crates from RustCrypto org instead of rust-crypto by newpavlov
  67. sipa cross-referenced this on Aug 8, 2018 from issue Assembly optimisations are compiled even with --disable-asm by luke-jr
  68. sipa added the label Needs release notes on Aug 14, 2018
  69. HashUnlimited cross-referenced this on Sep 11, 2018 from issue SHA256 implementations based on Intel SHA Extensions by HashUnlimited
  70. fanquake removed the label Needs release note on Mar 20, 2019
  71. rollmeister commented at 7:51 PM on April 13, 2019: contributor

    Can you please provide any documentation as to how you calculated the custom constants in Transform 2? I would like to adapt it for a 112 byte input. Could it be made to work if Transform 1 uses the state calculated from the first 64 bytes and the remaining 48 bytes (with appropriate padding) as the input message digest along with said calculated custom constants for Transform 2?

  72. codablock referenced this in commit 1fe1d86a06 on Oct 1, 2019
  73. codablock referenced this in commit 136765cfa8 on Oct 1, 2019
  74. codablock referenced this in commit 84194ace4a on Oct 1, 2019
  75. codablock referenced this in commit dda04a300e on Oct 1, 2019
  76. codablock referenced this in commit 5a23934df1 on Oct 1, 2019
  77. codablock referenced this in commit e44b6c7e58 on Oct 1, 2019
  78. codablock referenced this in commit 1df17c02e5 on Oct 1, 2019
  79. codablock referenced this in commit 1b2252c28c on Oct 1, 2019
  80. sickpig cross-referenced this on Oct 25, 2019 from issue [port] Specialized double-SHA256 with 64 byte inputs with SSE4.1 and AVX2 by sickpig
  81. barrystyle referenced this in commit 23187a4b35 on Jan 22, 2020
  82. barrystyle referenced this in commit b2c9a95587 on Jan 22, 2020
  83. barrystyle referenced this in commit 1eee3932e0 on Jan 22, 2020
  84. barrystyle referenced this in commit 9be0431b4c on Jan 22, 2020
  85. str4d cross-referenced this on Dec 6, 2021 from issue Backport SHA-256 assembly optimisations by str4d
  86. bitcoin locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-05-20 06:55 UTC