Tuesday 6 May 2014

conclusion for cxxtools

conclusion for cxxtools

According to a few days' research, I find out the reason why getting errors when building the cxxtools on aarch64, but even I change the source code I can't make it built successful.. Maybe I need to change the compile command parameters to force it ignoring the warnings.

Conclusion for cxxtools:

Possible Optimization:
Not much. The cxxtools' source codes include the code for arm architecture, and we don't need to modify the source code.

Building:

Even I find the problem (see my last post), I can't solve it by modifying source code, maybe I need to modify the building commands' parameters. For now I cant build it successfully.

Porting & Optimization (4) and conclusion for fossil

Porting & Optimization (4) and conclusion for fossil


In my last port, I test the C codes and Assembly codes on x86_64, and now I will test it on aarch64 machine.

I will use the same test program as last post, but change the assembly codes for running on aarch64.

//Rotation using C
//#define SHA_ROT(x,l,r) ((x) << (l) | (x) >> (r))
//#define rol(x,k) SHA_ROT(x,k,32-(k))
//#define ror(x,k) SHA_ROT(x,32-(k),k)

//Rotation using assembly under x86_64
//#define SHA_ROT(op, x, k) \
        ({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
//#define rol(x,k) SHA_ROT("roll", x, k)
//#define ror(x,k) SHA_ROT("rorl", x, k)

//Rotation using assembly under aarch64
#define SHA_ROT(op, x, k) \
        ({ unsigned int y; asm(op " %0,%2,%1" : "=&r" (y) : "r" (k), "r" (x)); y; })
#define rol(x,k) SHA_ROT("ror", x, 64-(k))
#define ror(x,k) SHA_ROT("ror", x, k)


testing standard:

7 tests, run the program for 2000 times for each test, record the time for each test

remove the first test (for preloading the cache), the longest time test and the shortest time test. calculate the average for the rest 4 tests.

test result:

[root@localhost test]# time ./test_c.sh

real    0m16.349s
user    0m1.070s
sys    0m2.660s
[root@localhost test]# vi ./test_c.sh
[root@localhost test]# time ./test_c.sh

real    0m16.379s
user    0m0.960s
sys    0m2.760s
[root@localhost test]# time ./test_c.sh

real    0m16.479s
user    0m0.940s
sys    0m2.850s
[root@localhost test]# time ./test_c.sh

real    0m16.408s
user    0m0.980s
sys    0m2.760s
[root@localhost test]# time ./test_c.sh

real    0m16.506s
user    0m1.080s
sys    0m2.670s
[root@localhost test]# time ./test_c.sh

real    0m16.414s
user    0m1.110s
sys    0m2.620s
[root@localhost test]# time ./test_c.sh

real    0m16.410s
user    0m1.030s
sys    0m2.720s
[root@localhost test]#

arm64:

[root@localhost test]# time ./test_arm.sh

real    0m16.440s
user    0m1.180s
sys    0m2.570s

[root@localhost test]# time ./test_arm.sh

real    0m16.438s
user    0m1.030s
sys    0m2.720s
[root@localhost test]# time ./test_arm.sh

real    0m16.451s
user    0m1.010s
sys    0m2.750s
[root@localhost test]# time ./test_arm.sh

real    0m16.473s
user    0m1.190s
sys    0m2.580s
[root@localhost test]# time ./test_arm.sh

real    0m16.519s
user    0m1.030s
sys    0m2.800s
[root@localhost test]# time ./test_arm.sh

real    0m16.432s
user    0m1.000s
sys    0m2.780s
[root@localhost test]# time ./test_arm.sh1

real    0m16.441s
user    0m1.010s
sys    0m2.760s

C:
(.98+.96+1.08+1.03)/4 = 1.0125 s

assembly:
(1.03+1.01+1.03+1.01)/4 = 1.02 s


We can see the performances between 2 types of rotation are almost the same. The C codes don't have to be converted to assembly codes.

Conclusion:
Build:
building the fossil on aarch64 you need to replace the latest config.guess file,
the link is

http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD

Optimization:
According to the testing result, we can see the performance doesn't show significant improvement after changing the code to assembly, so modification is not necessary.

Friday 4 April 2014

     Learning Porting to Aarch64: Fossil(3)

I do some search for the potential optimization by adding rotate function for arm machine.

#define SHA_ROT(op, x, k) \
        ({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define rol(x,k) SHA_ROT("roll", x, k)
#define ror(x,k) SHA_ROT("rorl", x, k)

in aarch64 command, it the rotate function is

ROR Wd, Wm, #uimm
Rotate Right (immediate): alias for EXTR Wd,Wm,Wm,#uimm.
ROR Xd, Xm, #uimm
Rotate Right (extended immediate): alias for EXTR Xd,Xm,Xm,#uimm.
 
I try do test in x86_64 first.


#include<stdio.h>
#include<stdlib.h>
#include <sys/types.h>


#define INT_BITS 32
#define TESTNUM 16

//under aarch64 original code,using C
//#define SHA_ROT(x,l,r) ((x) << (l) | (x) >> (r))
//#define rol(x,k) SHA_ROT(x,k,32-(k))
//#define ror(x,k) SHA_ROT(x,32-(k),k)


//under X86_64 original code
#define SHA_ROT(op, x, k) \
        ({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define rol(x,k) SHA_ROT("roll", x, k)
#define ror(x,k) SHA_ROT("rorl", x, k)

char * bit_representation(unsigned int num) {
  char * bit_string = (char *)malloc(sizeof(char)*sizeof(unsigned int)*8+1);
  unsigned int i=1, j;
  for(i=i<<(sizeof(unsigned int)*8-1), j=0; i>0; i=i>>1, j++) {
    if(num&i) {
      *(bit_string+j)='1';
    } else {
      *(bit_string+j)='0';
    }
  }
  *(bit_string+j)='&#92&#48';
  return bit_string;
}

/* Driver program to test above functions */
int main()
{
 
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);
  display = rol(TESTNUM, 2);

  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);
  display = ror(TESTNUM, 2);

}
I run this test program for 20000 times and record the time.

5 times, get 3 times in the middle. calculate average,
under x86_64:

real    0m7.057s
user    0m0.546s
sys    0m0.918s

real    0m7.065s
user    0m0.499s
sys    0m0.959s

real    0m7.049s
user    0m0.534s
sys    0m0.906s

real    0m7.101s
user    0m0.486s
sys    0m0.986s

real    0m7.069s
user    0m0.486s
sys    0m0.983s
under C code:

real    0m7.073s
user    0m0.569s
sys    0m0.857s

real    0m7.008s
user    0m0.549s
sys    0m0.856s

real    0m7.065s
user    0m0.528s
sys    0m0.897s

real    0m7.001s
user    0m0.568s
sys    0m0.833s

real    0m7.044s
user    0m0.549s
sys    0m0.862s

result:
X86_64 asm:
user: (0.499+0.534+0.486)/3=0.50633 S

C :
user: (0.549+0.528+0.568)/3=0.54833 S

under x86_64,
assembly rotation will be  7.66% faster than C function rotation.

I will try to create the assembly code for rotating under arrch64.

Learning Porting to Aarch64: cxxtools(2)

Learning Porting to Aarch64: cxxtools(2)

 

Today I take a look at the cxxtools, which is a set of libraries including a lot of functionality. 

First of all download and prep, get everything in the rpmbuild directory.

Taking a look at the source code, we can find it has a separate file to code the assembly for arm.

Looks like everything is there?

When I build the project, it throws a lot of warnings:

/bin/sh ../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.  -I../src -I../include -I../include -Wno-long-long -Wall -pedantic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches  -fno-stack-protector  -c -o csvformatter.lo csvformatter.cpp
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../src -I../include -I../include -Wno-long-long -Wall -pedantic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -grecord-gcc-switches -fno-stack-protector -c csvformatter.cpp  -fPIC -DPIC -o .libs/csvformatter.o
In file included from ../include/cxxtools/string.h:34:0,
                 from ../include/cxxtools/formatter.h:32,
                 from ../include/cxxtools/csvformatter.h:32,
                 from csvformatter.cpp:29:
../include/cxxtools/char.h: In function 'bool cxxtools::operator==(const cxxtools::Char&, wchar_t)':
../include/cxxtools/char.h:143:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a.value() == b; }
                                   ^
../include/cxxtools/char.h: In function 'bool cxxtools::operator==(wchar_t, const cxxtools::Char&)':
../include/cxxtools/char.h:145:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a == b.value(); }
                                   ^
../include/cxxtools/char.h: In function 'bool cxxtools::operator!=(const cxxtools::Char&, wchar_t)':
../include/cxxtools/char.h:156:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a.value() != b; }
                                   ^
../include/cxxtools/char.h: In function 'bool cxxtools::operator!=(wchar_t, const cxxtools::Char&)':
../include/cxxtools/char.h:158:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a != b.value(); }
                                   ^
../include/cxxtools/char.h: In function 'bool cxxtools::operator<(const cxxtools::Char&, wchar_t)':
../include/cxxtools/char.h:169:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a.value() < b; }
                                  ^
../include/cxxtools/char.h: In function 'bool cxxtools::operator<(wchar_t, const cxxtools::Char&)':
../include/cxxtools/char.h:171:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
             { return a < b.value(); }
                                  ^
......
 
And then stuck.

I was thinking about how to fix this, until I take a look at their latest patch:
....
 -            else if (ch == L'\n' || ch == L'\r')
+            else if ( (int) ch == (int) L'\n' || (int) ch == (int) L'\r')
             {
                 log_debug("title=\"" << _titles.back() << '"');
                 _noColumns = 1;
-                _state = (ch == L'\r' ? state_cr : state_rowstart);
+                _state = ( (int) ch == (int) L'\r' ? state_cr : state_rowstart);
             }
-            else if (ch == L'\'' || ch == L'"')
+            else if ( (int) ch == (int) L'\'' || (int) ch == (int) L'"')
.....

They fixed similar issue by force casting the variables to int and make the comparison safe.

I will do same thing for the char.h and try building again.
 

Thursday 3 April 2014

Learning Porting to Aarch64: Fossil(2)

Learning Porting to Aarch64: fossil

These days I am learning about porting software to Aarch64, and Fossil is the one can not be built in aarch64 architecture environment.

Fossil is a distributed version control like Git and Mercurial. Fossil also supports distributed bug tracking, distributed wiki, and a distributed blog mechanism all in a single integrated package.

I use Foundation model as the virtual aarch64 environment and rpmbuild tools to build the software. OS is fedora 19.

1. Install all the needed tools for rpmbuild,
  • "Fedora Packager"
  • rpmdevtools
  • rpmlint
  • yum-utils
2. Download source

    fedpkg clone -a fossil
    cd fossil
    fedpkg srpm

3. check dependencies

    yum-builddep *.rpm (under the fossil directory)

4. preparation for rpmbuild

    rpm -i *.rpm (same directory as above)

5. build it!

    cd ~/rpmbuild/SPECS/
    rpmbuild -ba fossil.spec

Issue: build error because the autosetup's config.guess file can not recognize the aarch 64 machine. 

Then I check the config.guess and find that file's last modified date is 2010-09-24. I go to the internet and find that the lastest version is made on 2014-03-23, I check the script and find that it support aarch 64.
 This the link:
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
Then I replace the config.guess file and build again.

Building successfully:

Wrote: /root/rpmbuild/SRPMS/fossil-1.28-1.20140127173344.fc19.src.rpm
Wrote: /root/rpmbuild/RPMS/aarch64/fossil-1.28-1.20140127173344.fc19.aarch64.rpm
Wrote: /root/rpmbuild/RPMS/aarch64/fossil-doc-1.28-1.20140127173344.fc19.aarch64.rpm
Wrote: /root/rpmbuild/RPMS/aarch64/fossil-debuginfo-1.28-1.20140127173344.fc19.aarch64.rpm
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.NTvRE9
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd fossil-src-20140127173344
+ /usr/bin/rm -fr /root/rpmbuild/BUILDROOT/fossil-1.28-1.20140127173344.fc19.aarch64
+ exit 0

Then it is time to take a look at the assembly code in the file to see if I can do something.

only one line:

#define SHA_ROT(op, x, k) \
        ({ unsigned int y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define rol(x,k) SHA_ROT("roll", x, k)
#define ror(x,k) SHA_ROT("rorl", x, k)

#else
/* Generic C equivalent */
#define SHA_ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define rol(x,k) SHA_ROT(x,k,32-(k))
#define ror(x,k) SHA_ROT(x,32-(k),k)
#endif


#define blk0le(i) (block[i] = (ror(block[i],8)&0xFF00FF00) \
    |(rol(block[i],8)&0x00FF00FF))
#define blk0be(i) block[i]
#define blk(i) (block[i&15] = rol(block[(i+13)&15]^block[(i+8)&15] \
    ^block[(i+2)&15]^block[i&15],1))

Obviously, it doesn't optimize for aarch64, so in aarch64 it will generate C code "((x) << (l) | (x) >> (r))" for this part.

In aarch64 it only support ror(rotate right), no rol. We can try right a rotate asm code for aarch 64 and try running and compare to the c code part to see which is faster.

Hua



Thursday 13 March 2014

about fossil(1)

fossil is another software I am studying on. This is a distributed version control system, bug tracking system and wiki software server for use in software development. It is like the Git. 

For now I just want to build it on my x86_64 machine first.

after downloading it and generate source code, I take a look at the whold directory.
configure file again! That makes life easy! But let me take a look at the BUILD.txt.
nothing special, lets begin building.

  mkdir build
  cd build
  ../configure

 then make, but I got error because of the sqlite3.c missing.
I check the official website and say it is included in the src, but I don't see it. So I just download another one.





Now it is built.


fossil study(1) is done.

Hua

about cxxtools(1)

This days I spend some time taking a look at the cxxtools library. It is a library written in C++, and provides classes for serialization, unicode text, multi threading, networking, rpc, http client and server, xml, logging and many more. What I will do is trying to build and test this library. And If I port the codes to aarch64, do I need to modify some codes. My environment is Fedora linux and the CPU is X86_64.

The first step is downloading the source code.  the cmd is 
          fedpkg clone -a cxxtools

Waiting for finishing downloading, then go to the cxxtools directory, and 
          fedpkg prep

Next, go to the cxxtools-2.2.1 directory, and try to find the something about the asembly codes for the first look.(to see if it needs to be modified for porting to arrch64)

Under the src directory, I find many cpp file for atomicity functions, and they all codes in asembly language, I think maybe that is what I should focus on.

All the codes using asembly,

 Leave it for now. I want build it first on the x86_64 machine.

I red the README file and INSTALL file, and know that this library's author uses the automake and autoconfig for development. He provides configure file which can be used to config and build. Then I run ./configure and then make, and it start to build:

After 10 min(so long..), it is done. and now I have a built cxxtools on my x86_64 machine.

I will continue studying about it.

Hua

Monday 10 March 2014

What is the meaning of do{...} while(0)

    Sometimes we see some codes in C using this pattern: do{...} while(0). It looks it doesn't do anything else but just simply running everything in the braces. When I saw these codes the first time, I also felt weird. Then I did some research and post it. I hope it can help you.

    This pattern is only used in C or C++’s conditional compilation and when you want to use #define to represent multiple statements. A simple sample is like this:

#define function(x)  do { a(x); b(x); }while (0)

so, we can use it like this:

if(some condition){
    function(x);
}

else 
      blah blah....

     Why we have to use this pattern? When you want to use #define to represent multiple statements, 
this statement wouldn't work:
#define function(x)  foo(x); bar(x)   //cause syntax error in the if statement

this statement still won't work:
#define FOO(x) { foo(x); bar(x); } // doesn't work
 
only in this pattern:
#define function(x)  do { a(x); b(x); }while (0)
It work well in the if statement.

That is my study about this thing. I hope I can help you.

Hua

Sunday 2 February 2014

Write a simple program with Assembler language

Assembler Lab

I will write a simple program to show the output below:

1. This is my code for x86_64 processor with explanation:

.text
.globl _start
start = 0                               /* starting value for the loop index */
max = 31                              /* loop exits when loop condition is i<max */
_start:
    mov     $start,%r15          /* loop index */
loop:
        mov     $0,%rdx            /* assign value 0 to rdx (refresh the rdx) */
        mov     %r15,%rax       /* move the loop index to rax (for later divide operation) */
        mov     $10,%r10          /* move value 10 to r10 register */
        div     %r10                   /* divide rax by r10 (10), places quotient into rax, */
                                             /* remainder into rdx, rdx must be zero before this step */
        mov     %rax,%r14       /* move the quotient to r14 */
        mov     %rdx,%r13       /* move the remainder to r13 */
        mov     $'0',%r12          /* assign ascii '0' to r12 */
        cmp     $0,%r14             /* compare the quotient with zero */
        je      skip                      /* skip if equal */
        add     %r12,%r14         /* modify the quotient to become the ascii character, */
                                              /* the character will be put into the msg later */
        movq    $high,%rsi        /* location the quotient will be added into */
        movb    %r14b,(%rsi)    /* move the quotient to msg */
skip:
        add     %r12,%r13         /* modify the remainder to become the ascii character */
        movq    $low,%rsi          /* location the remainder will be added into */
        movb    %r13b,(%rsi)    /* move the remainder to msg */

        movq    $len,%rdx         /* message length */
        movq    $msg,%rsi         /* message location */
        movq    $1,%rdi              /* file descriptor stdout */
        movq    $1,%rax             /* syscall sys_write */
        syscall

    inc     %r15                      /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                        /* loop if we're not */
    mov     $0,%rdi                /* exit status */
    mov     $60,%rax              /* syscall sys_exit */
    syscall
.data
msg:    .ascii      "Loop:   \n" /* the content of msg  */
len = . - msg                          /* length of msg */
high = . - 3                            /* define variable high */
low = . - 2                              /* define variable low */



2. I also write codes for aarch64 processor:


.text
.globl _start
_start:
        mov             x28, 0                     /* move 0 to x28, loop index */
loop:
                mov     x27,x28                  /* move x28 to x27 for later use */
                mov     x26,'0'                    /* move ascii '0' to x26 */
                mov     x25,10                    /* move 10 to x25 for division operation */

                udiv    x24, x27, x25           /* x27 (loop index) divides by x25 (10), */

                                                          /* quotient is in the x24 */
                msub    x23, x24, x25, x27  /* (x27 - x24 * x25) will be stored in x23, */

                                                          /*which is the remainder of x27 divides by x25 */
                adr     x20,msg                   /* get the location of msg */
                cmp     x24,0                       /* compare the quotient with 0 */
                beq     skip                         /* go to skip if equal */
                add     x24, x24, x26            /* add x26 to x24,  */

                                                          /* which make the quotient become ascii value */
                strb    w24,[x20,6]               /* move the quotient ascii value to the msg offset 6 */
skip:
                add     x23, x23, x26            /* make the remainder become ascii value */
                strb    w23,[x20,7]               /* move the remainder ascii value to msg offset 7 */

                mov     x0, 1                        /* file descriptor: 1 is stdout */
                adr     x1, msg                     /* message location (memory address) */
                mov     x2, len                     /* message length (bytes) */
                mov     x8, 64                      /* write is syscall #64 */
                svc     1                               /* invoke syscall */

        add     x28, x28, 1                        /* add 1 to loop index */
        cmp     x28, 31                             /* compare index with 31 */
        bne     loop                                  /* go to loop if not equal */
        mov     x0, 0                                /* status -> 0 */
        mov     x8, 93                              /* exit is syscall #93 */
        svc     0                                       /* invoke syscall */

.data
msg:    .ascii      "Loop:   \n"               /* define msg */
len = . - msg                                        /* length of msg */