Thanks Ashley,  I'll try your tool..

I would think that this is an error in the programs I am trying to use, too, but this is a problem with 2 different programs, written by 2 different groups.. One of them might be bad, but both.. seems unlikely. 

Interestingly the results for the connectivity_c test that is included with OMPI... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the daemons on the nodes shown below" Note that NO nodes are shown below.   Once, I got -np 250 to pass the connectivity test, but I was not able to replicate this reliable, so I'm not sure if it was a fluke, or what.  Here is a like to a screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU usage.. Hmmmm

The other tests, ring_c, hello_c, as well as the cxx versions of these guys with with all values of -np.

Unfortunately, I could not get valgrind to work...

Thanks, Matt

On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote:

On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
There are 8 physical cores, or 16 with hyperthreading enabled.

That should be meaty enough.

1st of all, let me say that when I specify that -np is less than 4
processors (1, 2, or 3), both programs seem to work as expected. Also,
the non-mpi version of each of them works fine.

Presumably the non-mpi version is serial however? this this doesn't mean
the program is bug-free or that the parallel version isn't broken.
There are any number of apps that don't work above N processes, in fact
probably all programs break for some value of N, it's normally a little
higher then 3 however.

Thus, I am pretty sure that this is a problem with MPI rather that
with the program code or something else.  

What happens is simply that the program hangs..

I presume you mean here the output stops?  The program continues to use
CPU cycles but no longer appears to make any progress?

I'm of the opinion that this is most likely a error in your program, I
would start by using either valgrind or padb.

You can run the app under valgrind using the following mpirun options,
this will give you four files named v.log.0 to v.log.3 which you can
check for errors in the normal way.  The "--mca btl tcp,self" option
will disable shared memory which can create false positives.

mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%

Alternatively you can run the application, wait for it to hang and then
in another window run my tool, padb, which will show you the MPI message
queues and stack traces which should show you where it's hung,
instructions and sample output are on this page.

There are no error messages, and there is no clue from anything else
(system working fine otherwise- no RAM issues, etc). It does not hang
at the same place everytime, sometimes in the very beginning, sometime
near the middle..  

Could this an issue with hyperthreading? A conflict with something?

Unlikely, if there was a problem in OMPI running more than 3 processes
it would have been found by now.  I regularly run 8 process applications
on my dual-core netbook alongside all my desktop processes without
issue, it runs fine, a little slowly but fine.

All this talk about binding and affinity won't help either, process
binding is about squeezing the last 15% of performance out of a system
and making performance reproducible, it has no bearing on correctness or
scalability.  If you're not running on a dedicated machine which with
firefox running I guess you aren't then there would be a good case for
leaving it off anyway.



Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing

users mailing list

Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website:
Personal Website: