Again, not an issue, just an implementation question.
I noticed that while you followed Bernstein and Schwabe's approach to use an unrolled version of Horner's method for Neon and AVX implementations, i.e.
((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
for AVX2 you instead use unperturbed Horner's method with the same power of r for all but tail blocks:
((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+ ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+ ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+ ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
This has ramification of performing reduction twice more often.
I was wondering why didn't you chose the same level of unrolling as for 128-bit vectors for 256-bit vectors, e.g. for l==24:
((m[0])*r^8 + m[4]*r^4 + m[8])*r^8 + m[12]*r^4 + m[16])*r^8 + m[20]*r^4
+ ((m[1])*r^8 + m[5]*r^4 + m[9])*r^8 + m[13]*r^4 + m[17])*r^7 + m[21]*r^3 \\
+ ((m[2])*r^8 + m[6]*r^4 + m[10])*r^8 + m[14]*r^4 + m[18])*r^6 + m[22]*r^2 \\
+ ((m[3])*r^8 + m[7]*r^4 + m[11])*r^8 + m[15]*r^4 + m[17])*r^5 + m[21]*r
This will come at an additional overhead of pre-calculating r^5-r^8, but should be quite more performant for the main loop, no?
I was just wondering if you tried this and it proved less efficient, or for some reason you decided to stick to the original Horner's method from scratch? Thanks!
Again, not an issue, just an implementation question.
I noticed that while you followed Bernstein and Schwabe's approach to use an unrolled version of Horner's method for Neon and AVX implementations, i.e.
for AVX2 you instead use unperturbed Horner's method with the same power of
rfor all but tail blocks:This has ramification of performing reduction twice more often.
I was wondering why didn't you chose the same level of unrolling as for 128-bit vectors for 256-bit vectors, e.g. for
l==24:This will come at an additional overhead of pre-calculating
r^5-r^8, but should be quite more performant for the main loop, no?I was just wondering if you tried this and it proved less efficient, or for some reason you decided to stick to the original Horner's method from scratch? Thanks!