Profiling Code Using gprof

When you are attempting to improve the performance of code, it is necessary to know which code is consuming the largest amount of time executing. This tells you which code you should spend time optimizing (code responsible for most of a program's execution), and which code you should ignore (where little time is spent executing).

A "profiler" is a tool used to accomplish this task. Almost every development environment has one (compilers, Java SDKs, R, matlab, ...), and they all work in a similar manner - you compile (if you're using a compiled language) your program using some extra options to support profiling, you run your program once, and you run the profiler program to produce a report of how much time was spent in each of the program's functions.

The vast majority of code today is compiled using the gcc suite. Its profiling tool is called gprof, and so that will be the profiler demonstrated in this HOWTO.

1. Compile the Program With Profiling Options Set

In order to produce profiling information, the program must be compiled with specific options to tell it to record this information. The important options are:

'-pg' (profile graph) and '-g' (debug information) must be included. NOTE: These must be supplied both on compilation and linking.
Optimization settings should be conservative. In particular, you should not use -O3 (although -O2 can sometimes affect information slightly), and avoid any compiler options that introduce "inlining." If in doubt, don't worry about this unless you see -O3 while compiling.

So, if you were to compile a program composed of a single file, let's say hello.c, you would compile it with a command line resembling:

gcc -pg -g -o hello hello.c

However, if you were to compile a program consisting of multiple files, or where you have to do separate compilation and linking, you would have to supply -pg and -g on all compilation and linking commands. For example, given the files hello.c and world.c, the commands would resemble the following:

gcc -pg -g -c hello.c
gcc -pg -g -c world.c
gcc -pg -g -o hello hello.o world.o

Often in larger software, you would modify the flags variables (CFLAGS, CXXFLAGS, FFLAGS, LD_OPTS and/or LDFLAGS) in your Makefile to add these, or you would set and export these as environment variables before running configure. An example of this last approach:

export CFLAGS='-pg -g'
export LDFLAGS='-pg -g'
./configure
make

2. Run the Program Once

When you compiled the program in step 1, it added code to perform timings as the program runs, saving the timings to a file. So, in order to perform the profiling, all you have to do is run your program manually.

However, there is one common issue - if there is any input you would have to give the program while it's running, the time spent waiting for you to type gets counted. Therefore, it's best to write a file containing the input you are going to give the program, and use input redirection to supply the input. OK, that might sound like Greek, so here's an example. Suppose you have a program which asks you to type in two numbers as dimensions of a matrix, and a third number for number of iterations of an algorithm. To avoid delays caused by typing in input, create a file, let's say it's called "input.dat", containing the input. For example, for 100 iterations on a 1000x2000 matrix, it may contain:

1000
2000
100

Then, to have your program read from the file, instead of having to type the input, you can run:

myprogram < input.dat

3. View The Execution Profile

Now that the program has been compiled, run, and the data has been generated (stored normally in a file called 'gmon.out'), it is time to look at it, and where your program has spent a lot of its time.

To view the profile information, run:

gprof -b myprogram gmon.out

You should get a report similar to the following:

    Flat profile:
    
    Each sample counts as 0.01 seconds.
     no time accumulated
    
      %   cumulative   self              self     total
     time   seconds   seconds    calls  Ts/call  Ts/call  name
      0.00      0.00     0.00        2     0.00     0.00  fact
      0.00      0.00     0.00        1     0.00     0.00  global constructors keyed to factGCOV
    
    
                            Call graph
    
    
    granularity: each sample hit covers 4 byte(s) no time propagated
    
    index % time    self  children    called     name
                                      17             fact [1]
                    0.00    0.00       2/2           main [8]
    [1]      0.0    0.00    0.00       2+17      fact [1]
                                      17             fact [1]
    -----------------------------------------------
                    0.00    0.00       1/1           __do_global_ctors_aux [13]
    [9]      0.0    0.00    0.00       1         global constructors keyed to factGCOV [9]
    -----------------------------------------------
    
    
    Index by function name
    
       [9] global constructors keyed to factGCOV (serial_example.c) [1] fact

The easiest place to start is the "Flat profile", i.e. the first table. In that table the cumulative seconds shows how long was spent executing that function, along with anything that function called. The self seconds is purely the time spent executing code within the function itself. The calls column tell you how many times the function is called.

Those three pieces of information are extremely useful. The cumulative seconds tells you at a high level, where your code is spending time. This indicates places where you can either benefit by tuning, or by thinking about using different algorithms entirely. For example, if it's spending time in a routine called 'integration', you might look to implement a different integration routine.

The self seconds can tell you the number of seconds spent directly inside the routine's code itself. This can help point out specific functions to directly tune. However, of these three columns, optimizations you can do in response to this are, in most cases, the least effective.

The calls column can be very useful, particularly if the Total Ts/call value is high - it indicates the number of times the function is called. Reducing this is often very significant. For example, if you were to implement a Monte Carlo integration over a function fn(X) to three digits of accuracy, you would see fn() called about 1,000,000 times. To reduce this, one might consider using another integration routine like vegas.

Another frequent benefit of watching the calls column is that it can indicate when a function is called unnecessarily often. A frequent habit when coding is copy-and-paste, which can often result in a deterministic function being called several times with the same input, whereas being called once, and storing the result into a variable for later use will significantly improve performance.

Go back