Skip to content

Horner's method for POLY1305-AVX2 #18

@yapolyak

Description

@yapolyak

Again, not an issue, just an implementation question.

I noticed that while you followed Bernstein and Schwabe's approach to use an unrolled version of Horner's method for Neon and AVX implementations, i.e.

((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r

for AVX2 you instead use unperturbed Horner's method with the same power of r for all but tail blocks:

((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+ ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+ ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+ ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1

This has ramification of performing reduction twice more often.

I was wondering why didn't you chose the same level of unrolling as for 128-bit vectors for 256-bit vectors, e.g. for l==24:

((m[0])*r^8 + m[4]*r^4 + m[8])*r^8 + m[12]*r^4 + m[16])*r^8 + m[20]*r^4
+ ((m[1])*r^8 + m[5]*r^4 + m[9])*r^8 + m[13]*r^4 + m[17])*r^7 + m[21]*r^3 \\
+ ((m[2])*r^8 + m[6]*r^4 + m[10])*r^8 + m[14]*r^4 + m[18])*r^6 + m[22]*r^2 \\
+ ((m[3])*r^8 + m[7]*r^4 + m[11])*r^8 + m[15]*r^4 + m[17])*r^5 + m[21]*r

This will come at an additional overhead of pre-calculating r^5-r^8, but should be quite more performant for the main loop, no?

I was just wondering if you tried this and it proved less efficient, or for some reason you decided to stick to the original Horner's method from scratch? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions