Project Stage 3 - Optimization - Re-Visited

So, while testing different ways to optimize the program using the compiler options I realized that there's some flaw in the methods I used to benchmark and compare the performance with different build versions.

The initial time used to benchmark had perf overhead, which most likely have affected the actual execution time of the program.

Also, on top of the compiler options I came across (-fmerge-all-constants, -floop-parallelize-all, -ftree-loop-distribution), I found few more options:
-Ofast
**Ofast enables all -O3 options and more by disregarding strict standards compliance. There is some 'risk' with using this option as some enabled options may not be safe/compatible to be used with the codes of the program.

-mtune=cortex-a57(aarchie)/intel(xerxes)
**mtune tunes every applicable codes generated to the cpu-type.

-march=core2(xerxes Intel Core 2)/native(aarchie)
**march generates instructions specific to the machine with specified cpu-type.

Based on these following are my build options that I'll be testing

1) Out of the box (-O2)
2) -O3
3) -O3 -fmerge-all-constants -floop-parallelize-all -ftree-loop-distribution
4) -Ofast
5) -Ofast -march=core2/native -mtune=intel/cortex-a57

Command: multitime -n 50 ./bzip2 -c file > /dev/null

Changes to the test as well. Number of test iterations stays the same at 5 (discarding first 2 for cache warm up).
I created new random text file that is 300MB in size.
Also, I used document created in PRJ566 class as a true text file to compare with random text file. This file has texts and images and is 138 pages long, just like the files that would be used in real life.


!!!Note!!!
On xerxes, multitime was not available, so I downloaded the file, created a directory and built it inside the directory. The multitime I used on xerxes did not run from usr/bin but from ~/multitime/multitime-1.4/


Results

The results are quite interesting.
On x86_64 the improvement is consistent at 2% but on aarch64 the improvement varies per file, but the improvement on aarch64 averages to 2% as well.
So it looks like the -Ofast -march -mtune build gives an average of 2% improvement.

Let's see if I can optimize the codes to get better improvement.

Comments