Integrated Performance Monitoring toolkit
The Integrated Performance Monitoring (IPM) toolkit is an application-profiling infrastructure that produces detailed profiles of MPI communication and computation of an application while having negligible impact on its performance. It currently is used at most US Department of Energy supercomputing sites such as NERSC, Argonne and Oak Ridge National Laboratories. IPM's key developers are based at the Lawrence Berkeley Laboratory and the University of California (both the Berkeley and San Diego campuses).
IPM is extremely portable and is currently available on all HPC platforms at CSCS.
What can IPM do ?
It can generate detailed profiles of the following :
- MPI communication (calls, buffer sizes, communication patterns)
- PAPI hardware performance counters
- physical memory used per compute node
IPM is excellent at making profiles of the information listed above on a per-MPI-task basis. This makes detecting load imbalance much easier. However, there are a few significant limitations that one must keep in mind.
- Only MPI communication is profiled. There is no OpenMP or P-Threads support. Also, PGAS languages like UPC and co-Array FORTRAN are not supported.
- IO profiling is not available. However a version of IPM with IO support is currently under development.
- Detailed profiling of user-written functions that do not involve communication are not profiled automatically.
How to use IPM
- Using IPM is very easy. First, load the necessary module.
module load ipm
- Then relink your application in the following fashion. Please note that a full recompile is not necessary. By using the ftn or cc wrappers, you will NOT need to add the include or the library path to your compilation command. The wrapper will automatically do it for you.
ftn -o test.exe test.f90
cc -o test.exe test.c
- Before running the application, check that the ipm has been linked into your executable.
nm -o test.exe | grep ipm_finalize
ipm:00000000004017f0 T ipm_finalize
ipm:00000000004017f6 t __ipm_finalizeEND
- Now run your application as normal. A brief IPM report will be added to the STDOUT of your job. This report will highlight the walltime spent in some of the mostly commonly used MPI functions called by the application, the overall walltime used, the maximum amount of physical memory used per compute node, and the aggregate GFLOPS for the job. An XML-format IPM log file will also be generated in the directory where the executable is ran. More detailed application performance analysis is done with this file and is explained in a later section.
- By default, the only measurements gathered from the hardware performance counters is the total number of cycles and flops. Additional counts can be requested by setting the environmental variable IPM_HPM to point to a specific group of hardware performance counter measurements. This has to be done inside your job submission file. Some popular hardware performance counter group values are the following.
total cycles, total instructions and TLB metrics given
total instructions, L1 and L2 cache metrics given
total cycles, instructions and flops given
Please note that IPM currently does not have support for L3 cache measurements.
- One can add IPM regions to an application for a more precise analysis. This is done by adding MPI_Pcontrol
CALL MPI_Pcontrol( 1, "Region_1\0" )
...code to be profiled...
CALL MPI_Pcontrol(-1, "Region_1\0")
The first argument to MPI_Pcontrol is the switch variable to denote the start/end of a region. 1 denotes the start while -1 denotes the end. The second argument is a string containing the name given to the region. The string must contain normal ASCII characters with no spaces and must end with the null character "\0". There is no limit on the number of regions that can be defined. One amalgamate non-consecutive IPM regions by giving them the same name. (I.e. if you want to profile the performance of a given subroutine that is called in different places through your application, then you just wrap MPI_Pcontrol calls around each call to the desired subroutine using the same region name. IPM will concatenate all the performance data for each separate region into a single IPM record.) IPM regions can be useful to measure walltime and hardware performance counter measurements for sections of code not containing MPI calls.
Detailed performance analysis with IPM
As stated previously, a brief performance report is added to the STDOUT of any job where a binary linked with the IPM library is executed. Much more performance information can be gathered by parsing the IPM log file generated in the directory where the job is ran. Generally, the name of this XML-based log file will have the following for
- In order to parse the log file, first load the IPM module and then call the parser program :
module load ipm
ipm_parse -html <name_of_log_file>
With the -html flag present, the IPM parser will create a new subdirectory that contains a number of HTML webpages outlining the gathered performance data for the profiled job. There are a number of different output options for the IPM parser. These options can be seen by the command ipm_parse -h. However the three most useful and commonly used output options are the followi
IPM Parser output flag
recreates the brief report generated in STDOUT
creates a more detailed performance report in ASCII format
creates interactive webpages with text and graphics
The IPM parser can be used on almost any CSCS platform however only the Cray XT login nodes will have the public html support available so that users can access and see the IPM web reports via a web browser. It is very important to realize that the computational resources (i.e. walltime and physical memory) increase with the number of MPI tasks used in the profiled job. For instance, the parsing of an IPM log file describing a 1536 MPI task job takes about 45 minutes to complete on a 1.4GHz Itanium CPU.
Will my application run slower if it is using IPM?
Generally, applications that are linked against the IPM library run up to 5% slower. However, internal testing at CSCS has shown little negative performance impact on a variety of applications. In some rare situations, the application actually runs faster when linked against the IPM library.
I get no IPM output (in STDOUT) and no IPM log file is created
Did you use a binary linked against the IPM library? Is the necessary IPM module loaded? In some applications the call to MPI_Finalize is wrapped as so
END PROGRAM test
CALL MPI_FINALIZE( MPI_COMM_WORLD, ierr )
END SUBROUTINE stop_run
This code structure can sometimes confuse IPM and make it "miss" the MPI_Finalize call. If this happens, IPM will abort the profiling process and no information is collected. The solution is to remove the wrapped call and insert an explicit MPI_Finalize call into the main program file.
I just want the brief IPM report in STDOUT. Can I configure IPM to not create a logfile?
Yes. When re-linking your binary with the IPM library add the flag -DIPM_DISABLE_LOG in your link statement. This will prevent IPM from creating the XML logfile in your execution directory.