Learning Porting to Aarch64: Fossil(3)
I do some search for the potential optimization by adding rotate function for arm machine.
#define SHA_ROT(op, x, k) \
({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define rol(x,k) SHA_ROT("roll", x, k)
#define ror(x,k) SHA_ROT("rorl", x, k)
in aarch64 command, it the rotate function is
ROR Wd, Wm, #uimm
Rotate Right (immediate): alias for EXTR Wd,Wm,Wm,#uimm.
ROR Xd, Xm, #uimm
Rotate Right (extended immediate): alias for EXTR Xd,Xm,Xm,#uimm.
I try do test in x86_64 first.
#include<stdio.h>
#include<stdlib.h>
#include <sys/types.h>
#define INT_BITS 32
#define TESTNUM 16
//under aarch64 original code,using C
//#define SHA_ROT(x,l,r) ((x) << (l) | (x) >> (r))
//#define rol(x,k) SHA_ROT(x,k,32-(k))
//#define ror(x,k) SHA_ROT(x,32-(k),k)
//under X86_64 original code
#define SHA_ROT(op, x, k) \
({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define rol(x,k) SHA_ROT("roll", x, k)
#define ror(x,k) SHA_ROT("rorl", x, k)
char * bit_representation(unsigned int num) {
char * bit_string = (char *)malloc(sizeof(char)*sizeof(unsigned int)*8+1);
unsigned int i=1, j;
for(i=i<<(sizeof(unsigned int)*8-1), j=0; i>0; i=i>>1, j++) {
if(num&i) {
*(bit_string+j)='1';
} else {
*(bit_string+j)='0';
}
}
*(bit_string+j)='\0';
return bit_string;
}
/* Driver program to test above functions */
int main()
{
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = rol(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
display = ror(TESTNUM, 2);
}
I run this test program for 20000 times and record the time.
5 times, get 3 times in the middle. calculate average,
under x86_64:
real 0m7.057s
user 0m0.546s
sys 0m0.918s
real 0m7.065s
user 0m0.499s
sys 0m0.959s
real 0m7.049s
user 0m0.534s
sys 0m0.906s
real 0m7.101s
user 0m0.486s
sys 0m0.986s
real 0m7.069s
user 0m0.486s
sys 0m0.983s
under C code:
real 0m7.073s
user 0m0.569s
sys 0m0.857s
real 0m7.008s
user 0m0.549s
sys 0m0.856s
real 0m7.065s
user 0m0.528s
sys 0m0.897s
real 0m7.001s
user 0m0.568s
sys 0m0.833s
real 0m7.044s
user 0m0.549s
sys 0m0.862s
result:
X86_64 asm:
user: (0.499+0.534+0.486)/3=0.50633 S
C :
user: (0.549+0.528+0.568)/3=0.54833 S
under x86_64,
assembly rotation will be 7.66% faster than C function rotation.
I will try to create the assembly code for rotating under arrch64.