Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Heap profiling with OpenMPI
From: Jan Ploski (Jan.Ploski_at_[hidden])
Date: 2008-08-06 03:29:47

users-bounces_at_[hidden] schrieb am 08/05/2008 05:51:51 PM:

> Jan,
> I'm using valgrind with Open MPI on a [very] regular basis and I never
> had any problems. I usually want to know the execution path on the MPI
> applications. For this I use:
> mpirun -np XX valgrind --tool=callgrind -q --log-file=some_file ./my_app
> I just run your example:
> mpirun -np 2 -bynode --mca btl tcp,self valgrind --tool=massif -
> q ./NPmpi -u 4
> and I got 2 non empty files in the current directory:
> bosilca_at_dancer:~/NetPIPE_3.6.2$ ls -l massif.out.*
> -rw------- 1 bosilca bosilca 140451 2008-08-05 11:57 massif.out.
> 21197
> -rw------- 1 bosilca bosilca 131471 2008-08-05 11:57 massif.out.
> 21210


Thanks for the info - which version of OpenMPI, compiler and valgrind did
you try with? I checked in two different clusters with OpenMPI 1.2.4
compiled with two different versions of the PGI compiler and valgrind
3.3.1, with the same bad result. I also noticed that the MPI processes
despite of producing the expected output do not terminate cleanly. I can
see in the stderr log (for each process):

==7909== Warning: client syscall munmap tried to modify addresses
==7909== Process terminating with default action of signal 11 (SIGSEGV)
==7909== Access not within mapped region at address 0x8053D8000
==7909== at 0x5284996: _int_free (in
==7909== by 0x52837A7: free (in
==7909== by 0x593C76A: free_mem (in /lib64/
==7909== by 0x593C3E1: __libc_freeres (in /lib64/
==7909== by 0x491D31C: _vgnU_freeres (vg_preloaded.c:60)
==7909== by 0x587D1C4: exit (in /lib64/
==7909== by 0x586815A: (below main) (in /lib64/

That probably explains why my massif.out.* are empty (<200 bytes long),
but why do the processes crash? The same program runs ok with
valgrind+MVAPICH or with OpenMPI without valgrind in their respective
clusters. I experience this both with a simple test program and with a
real application (WRF).

Jan Ploski