Horner's method for POLY1305-AVX2

Again, not an issue, just an implementation question.

I noticed that while you followed Bernstein and Schwabe's approach to use an unrolled version of Horner's method for Neon and AVX implementations, i.e.
```
((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
+ ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
```
for AVX2 you instead use unperturbed Horner's method with the same power of `r` for all but tail blocks:
```
((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
+ ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
+ ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
+ ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
```
This has ramification of performing reduction twice more often. 

I was wondering why didn't you chose the same level of unrolling as for 128-bit vectors for 256-bit vectors, e.g. for `l==24`:
```
((m[0])*r^8 + m[4]*r^4 + m[8])*r^8 + m[12]*r^4 + m[16])*r^8 + m[20]*r^4
+ ((m[1])*r^8 + m[5]*r^4 + m[9])*r^8 + m[13]*r^4 + m[17])*r^7 + m[21]*r^3 \\
+ ((m[2])*r^8 + m[6]*r^4 + m[10])*r^8 + m[14]*r^4 + m[18])*r^6 + m[22]*r^2 \\
+ ((m[3])*r^8 + m[7]*r^4 + m[11])*r^8 + m[15]*r^4 + m[17])*r^5 + m[21]*r
```
 This will come at an additional overhead of pre-calculating `r^5`-`r^8`, but should be quite more performant for the main loop, no?

I was just wondering if you tried this and it proved less efficient, or for some reason you decided to stick to the original Horner's method from scratch? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horner's method for POLY1305-AVX2 #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Horner's method for POLY1305-AVX2 #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions