Forum Home
    • Register
    • Login
    • Search
    • Recent
    • Tags
    • Popular

    [Dev] NeoScrypt GPU Miner - Public Beta Test

    Technical Development
    52
    802
    574351
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      Wolf0 Regular Member last edited by

      Can he post a picture? Never saw 1100 GPU farm yet.

      Dunno if he wants people to know - I could ask, I guess, but we have more of a business relationship.

      1 Reply Last reply Reply Quote 0
      • R
        RIPPEDDRAGON Regular Member last edited by

        I’m with wemineftc at moment (although they’ve really pissed me & others off past few days!) and the hashtate is 155Mh/s with 345 miners.

        I left them and I am not going back, join me on ftc.theblocksfactory.com or ftc.nut2pools.com

        1 Reply Last reply Reply Quote 0
        • C
          cisahasa last edited by

          Can he post a picture? Never saw 1100 GPU farm yet.

          there is 2-4 of them… seen it

          1 Reply Last reply Reply Quote 0
          • ?
            A Former User last edited by

            … they have the potential to do some dark stuff if they wanted with that hashrate,

            I’m under the impression that ACP will keep things in order, so I wouldn’t worry to much.

            1 Reply Last reply Reply Quote 0
            • ?
              A Former User last edited by

              While it’s possible technically to design and produce ASICs for any algorithm, I don’t think they can implement NeoScrypt easily. Even if they do at some point of time, I can always release a 64-bit version of NeoScrypt incompatible with 32-bit devices or very slow there.

              So essentially, there wont be ASICs untill we feel as though they we are ready for them?

              1 Reply Last reply Reply Quote 0
              • W
                Wolf0 Regular Member last edited by

                So essentially, there wont be ASICs untill we feel as though they we are ready for them?

                At the cost of changing the algorithm often, you can evade ASICs easily.

                1 Reply Last reply Reply Quote 0
                • ghostlander
                  ghostlander Regular Member last edited by

                  So essentially, there wont be ASICs untill we feel as though they we are ready for them?

                  It isn’t very complicated to replace or add a PoW algorithm if it’s for common good. If block hashing algorithm stays the same (SHA-256d currently), it doesn’t matter for the most part of crypto software. For example, there was no need to update Abe, a popular block explorer, after switching to NeoScrypt. It doesn’t verify PoW hashes at all.

                  1 Reply Last reply Reply Quote 0
                  • W
                    Wolf0 Regular Member last edited by

                    I want to know why on earth permuting the state for parallel chacha ONCE is slower than doing it dozens of times. This is nuts.

                    1 Reply Last reply Reply Quote 0
                    • SpartanC001
                      SpartanC001 Regular Member last edited by

                      I want to know why on earth permuting the state for parallel chacha ONCE is slower than doing it dozens of times. This is nuts.

                      When you shove it 1, it cries for a while and them does it, but if you feed it 20, its got no room to cry so it calculates them then cries after… jk, but id like to know aswell, goes against logic

                      1 Reply Last reply Reply Quote 0
                      • W
                        Wolf0 Regular Member last edited by

                        When you shove it 1, it cries for a while and them does it, but if you feed it 20, its got no room to cry so it calculates them then cries after… jk, but id like to know aswell, goes against logic

                        Yeah, some manual inlining, unrolling, and memory optimizations are the difference between 130kh/s on a 270X and 300kh/s. That compiler is stupid.

                        1 Reply Last reply Reply Quote 0
                        • S
                          slowhash Regular Member last edited by

                          I’ve been putting 60+ hour weeks at work and dealing with vehicle issues and work xmas parties, so I haven’t had much of a chance to follow neoscrypt development.

                          Has anyone anywhere compiled an SGminer for windoz that will work with wolf’s latest kernel for the 290/290x GPU’s?

                          1 Reply Last reply Reply Quote 0
                          • kris_davison
                            kris_davison last edited by

                            Here? https://forum.feathercoin.com/index.php?/topic/8194-download-the-latest-neogpuminer-378-and-sgminer-501-git-here/#entry71386

                            1 Reply Last reply Reply Quote 0
                            • S
                              slowhash Regular Member last edited by

                              Looking into it now, thank you. :D

                              1 Reply Last reply Reply Quote 0
                              • S
                                slowhash Regular Member last edited by

                                Both of those builds won’t run on my rig at all. Noted in thread.

                                1 Reply Last reply Reply Quote 0
                                • SpartanC001
                                  SpartanC001 Regular Member last edited by

                                  Couldnt get either of the builds to run on my R7 240 (catalyst 14.12 omega)

                                  Got cgminer working with wolfs latest kernel but no change otherwise it seems, only 20kh/s i know even this pathetic chip can do better than that -_-

                                  1 Reply Last reply Reply Quote 0
                                  • W
                                    Wolf0 Regular Member last edited by

                                    Couldnt get either of the builds to run on my R7 240 (catalyst 14.12 omega)

                                    Got cgminer working with wolfs latest kernel but no change otherwise it seems, only 20kh/s i know even this pathetic chip can do better than that -_-

                                    How much memory does it have? Try 14.6/14.7.

                                    1 Reply Last reply Reply Quote 0
                                    • A
                                      Alpha Wolf last edited by

                                      Wow no updates or post since last year :-* :)

                                      How’s it going? Hope everyone had happy holidays.

                                      1 Reply Last reply Reply Quote 0
                                      • S
                                        slowhash Regular Member last edited by

                                        There are people doing minor mods to the latest wolf kernel, and bumping the speed up just a tad. One decreased my speed by about 1.5%, the other increased by about 1.5%, but combined they gave me about 9 kh/s on my 290’s, roughly 2.5%.

                                        // NeoScrypt(128, 2, 1) with Salsa20/20 and ChaCha20/20

                                        // Stupid AMD compiler ignores the unroll pragma in these two
                                        #define SALSA_SMALL_UNROLL 3
                                        #define CHACHA_SMALL_UNROLL 3

                                        // If SMALL_BLAKE2S is defined, BLAKE2S_UNROLL is interpreted
                                        // as the unroll factor; must divide cleanly into ten.
                                        // Usually a bad idea.
                                        //#define SMALL_BLAKE2S
                                        //#define BLAKE2S_UNROLL 5

                                        #define BLOCK_SIZE 64U
                                        #define FASTKDF_BUFFER_SIZE 256U
                                        #ifndef PASSWORD_LEN
                                        #define PASSWORD_LEN 80U
                                        #endif

                                        #if !defined(cl_khr_byte_addressable_store)
                                        #error “Device does not support unaligned stores”
                                        #endif

                                        // Swaps 128 bytes at a time without using temp vars
                                        void SwapBytes128(void *restrict A, void *restrict B, uint len)
                                        {
                                        #pragma unroll 2
                                        for(int i = 0; i < (len >> 7); ++i)
                                        {
                                        ((ulong16 *)A)[i] ^= ((ulong16 *)B)[i];
                                        ((ulong16 *)B)[i] ^= ((ulong16 *)A)[i];
                                        ((ulong16 *)A)[i] ^= ((ulong16 *)B)[i];
                                        }
                                        }

                                        void CopyBytes128(void *restrict dst, const void *restrict src, uint len)
                                        {
                                        #pragma unroll 2
                                        for(int i = 0; i < len; ++i)
                                        ((ulong16 *)dst)[i] = ((ulong16 *)src)[i];
                                        }

                                        void CopyBytes(void *restrict dst, const void *restrict src, uint len)
                                        {
                                        for(int i = 0; i < len; ++i)
                                        ((uchar *)dst)[i] = ((uchar *)src)[i];
                                        }

                                        //
                                        // a bit of byte alignment checking goes a long ways…
                                        //
                                        void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
                                        {
                                        switch(mod % 4)
                                        {
                                        case 0:
                                        #pragma unroll 2
                                        for(int i = 0; i < 4; i+=2)
                                        {
                                        ((uint2 *)dst)[i] ^= ((uint2 *)src)[i];
                                        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];
                                        }
                                        break;

                                        case 2:
                                        #pragma unroll 8
                                        for(int i = 0; i < 16; i+=2)
                                        {
                                        ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
                                        ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
                                        }
                                        break;

                                        default:
                                        #pragma unroll 8
                                        for(int i = 0; i < 31; i+=4)
                                        {
                                        ((uchar *)dst)[i] ^= ((uchar *)src)[i];
                                        ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
                                        ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
                                        ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];
                                        }
                                        }
                                        }

                                        void XORBytes(void *restrict dst, const void *restrict src1, const void *restrict src2, uint len)
                                        {
                                        #pragma unroll 1
                                        for(int i = 0; i < len; ++i)
                                        ((uchar *)dst)[i] = ((uchar *)src1)[i] ^ ((uchar *)src2)[i];
                                        }

                                        // Blake2S

                                        #define BLAKE2S_BLOCK_SIZE 64U
                                        #define BLAKE2S_OUT_SIZE 32U
                                        #define BLAKE2S_KEY_SIZE 32U

                                        static const __constant uint BLAKE2S_IV[8] =
                                        {
                                        0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
                                        0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
                                        };

                                        static const __constant uchar BLAKE2S_SIGMA[10][16] =
                                        {
                                        { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 } ,
                                        { 14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 } ,
                                        { 11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4 } ,
                                        { 7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8 } ,
                                        { 9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13 } ,
                                        { 2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9 } ,
                                        { 12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11 } ,
                                        { 13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10 } ,
                                        { 6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5 } ,
                                        { 10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13 , 0 } ,
                                        };

                                        #define BLAKE_G(idx0, idx1, a, b, c, d, key) do { \
                                        a += b + key[BLAKE2S_SIGMA[idx0][idx1]]; \
                                        d = rotate(d ^ a, 16U); \
                                        c += d; \
                                        b = rotate(b ^ c, 20U); \
                                        a += b + key[BLAKE2S_SIGMA[idx0][idx1 + 1]]; \
                                        d = rotate(d ^ a, 24U); \
                                        c += d; \
                                        b = rotate(b ^ c, 25U); \
                                        } while(0)

                                        void Blake2S(uint *restrict inout, const uint *restrict inkey)
                                        {
                                        uint16 V;
                                        uint8 tmpblock;

                                        // Load first block (IV into V.lo) and constants (IV into V.hi)
                                        V.lo = V.hi = vload8(0U, BLAKE2S_IV);

                                        // XOR with initial constant
                                        V.s0 ^= 0x01012020;

                                        // Copy input block for later
                                        tmpblock = V.lo;

                                        // XOR length of message so far (including this block)
                                        // There are two uints for this field, but high uint is zero
                                        V.sc ^= BLAKE2S_BLOCK_SIZE;

                                        // Compress state, using the key as the key
                                        #ifdef SMALL_BLAKE2S
                                        #pragma unroll BLAKE2S_UNROLL
                                        #else
                                        #pragma unroll
                                        #endif
                                        for(int x = 0; x < 10; ++x)
                                        {
                                        BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inkey);
                                        BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inkey);
                                        BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inkey);
                                        BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inkey);
                                        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inkey);
                                        BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inkey);
                                        BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inkey);
                                        BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inkey);
                                        }

                                        // XOR low part of state with the high part,
                                        // then with the original input block.
                                        V.lo ^= V.hi ^ tmpblock;

                                        // Load constants (IV into V.hi)
                                        V.hi = vload8(0U, BLAKE2S_IV);

                                        // Copy input block for later
                                        tmpblock = V.lo;

                                        // XOR length of message into block again
                                        V.sc ^= BLAKE2S_BLOCK_SIZE << 1;

                                        // Last block compression - XOR final constant into state
                                        V.se ^= 0xFFFFFFFFU;

                                        // Compress block, using the input as the key
                                        #ifdef SMALL_BLAKE2S
                                        #pragma unroll BLAKE2S_UNROLL
                                        #else
                                        #pragma unroll
                                        #endif
                                        for(int x = 0; x < 10; ++x)
                                        {
                                        BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inout);
                                        BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inout);
                                        BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inout);
                                        BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inout);
                                        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inout);
                                        BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inout);
                                        BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inout);
                                        BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inout);
                                        }

                                        // XOR low part of state with high part, then with input block
                                        V.lo ^= V.hi ^ tmpblock;

                                        // Store result in input/output buffer
                                        vstore8(V.lo, 0, inout);
                                        }

                                        /* FastKDF, a fast buffered key derivation function:
                                        * FASTKDF_BUFFER_SIZE must be a power of 2;
                                        * password_len, salt_len and output_len should not exceed FASTKDF_BUFFER_SIZE;
                                        * prf_output_size must be

                                        1 Reply Last reply Reply Quote 0
                                        • A
                                          Alpha Wolf last edited by

                                          There are people doing minor mods to the latest wolf kernel, and bumping the speed up just a tad. One decreased my speed by about 1.5%, the other increased by about 1.5%, but combined they gave me about 9 kh/s on my 290’s, roughly 2.5%.

                                          With my already tweaked sgminer.conf and your file/tweaks I increase from 134.5 Kh/s per R9 270 non X to 149.2 Kh/s per card.

                                          I have duel 270 non X in this system I tested with. Well done with your tweaking, thanks for sharing.

                                          sgminer-5.1-dev-2014-11-10-win32

                                          ,
                                          "xintensity" : "4,4",
                                          "vectors" : "1,1",
                                          "worksize" : "64,64",
                                          "thread-concurrency" : "8192,8192",
                                          "gpu-engine" : "1100,1100,",
                                          "gpu-memclock" : "1450,1450",
                                          
                                          1 Reply Last reply Reply Quote 0
                                          • S
                                            slowhash Regular Member last edited by

                                            Nice to know that the interest in improving the kernel didn’t go away when Wolf said he was keeping his improvements to himself…

                                            I’m not in any way faulting him for that decision, but that doesn’t mean that I have to like it either… ;)

                                            BTW, I got into X11 mining, and guess who showed up as the top kernel writer… lol

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post