version3/c/fastest.txt - incubator-milagro-crypto - Git at Google

 When using Montgomery reduction, some advantage comes from "fusing" the
 multiplication with the modular reduction and unrolling the loops.

 For a 32-bit build and if for example using 256 bit BIGs and a base of 2^28
 with the NIST256 curve, replace XXX with 256_28 and YYY with NIST256 in
 fastest.c


 Then compile and execute the program fastest.c like this (using MinGW
 port of GCC as an example), in the same directory as arch.h and fp_NIST256.h

 gcc -O2 -std=c99 fastest.c -o fastest.exe
 fastest > t.txt

 Now extract the code fragment from t.txt and insert it where indicated
 into fp_NIST256.c (look for FUSED_MODMUL)

 Finally make sure that

 #define FUSED_MODMUL

 appears somewhere in fp_NIST256.h

 Finally compile and replace the fp_YYY module in the library, and maybe
 get a 30% speed-up! If there is no significant improvement, don't use this
 method!

 NOTE: This method is experimental. It might impact on numerical stability.
	When using Montgomery reduction, some advantage comes from "fusing" the
	multiplication with the modular reduction and unrolling the loops.

	For a 32-bit build and if for example using 256 bit BIGs and a base of 2^28
	with the NIST256 curve, replace XXX with 256_28 and YYY with NIST256 in
	fastest.c


	Then compile and execute the program fastest.c like this (using MinGW
	port of GCC as an example), in the same directory as arch.h and fp_NIST256.h

	gcc -O2 -std=c99 fastest.c -o fastest.exe
	fastest > t.txt

	Now extract the code fragment from t.txt and insert it where indicated
	into fp_NIST256.c (look for FUSED_MODMUL)

	Finally make sure that

	#define FUSED_MODMUL

	appears somewhere in fp_NIST256.h

	Finally compile and replace the fp_YYY module in the library, and maybe
	get a 30% speed-up! If there is no significant improvement, don't use this
	method!

	NOTE: This method is experimental. It might impact on numerical stability.