Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun only works when -np <4
From: Matthew MacManes (macmanes_at_[hidden])
Date: 2009-12-09 13:20:11


Thanks Ashley, I'll try your tool..

I would think that this is an error in the programs I am trying to use, too, but this is a problem with 2 different programs, written by 2 different groups.. One of them might be bad, but both.. seems unlikely.

Interestingly the results for the connectivity_c test that is included with OMPI... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the daemons on the nodes shown below" Note that NO nodes are shown below. Once, I got -np 250 to pass the connectivity test, but I was not able to replicate this reliable, so I'm not sure if it was a fluke, or what. Here is a like to a screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU usage.. Hmmmm

http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink

The other tests, ring_c, hello_c, as well as the cxx versions of these guys with with all values of -np.

Unfortunately, I could not get valgrind to work...

Thanks, Matt

On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote:

> On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
>> There are 8 physical cores, or 16 with hyperthreading enabled.
>
> That should be meaty enough.
>
>> 1st of all, let me say that when I specify that -np is less than 4
>> processors (1, 2, or 3), both programs seem to work as expected. Also,
>> the non-mpi version of each of them works fine.
>
> Presumably the non-mpi version is serial however? this this doesn't mean
> the program is bug-free or that the parallel version isn't broken.
> There are any number of apps that don't work above N processes, in fact
> probably all programs break for some value of N, it's normally a little
> higher then 3 however.
>
>> Thus, I am pretty sure that this is a problem with MPI rather that
>> with the program code or something else.
>>
>> What happens is simply that the program hangs..
>
> I presume you mean here the output stops? The program continues to use
> CPU cycles but no longer appears to make any progress?
>
> I'm of the opinion that this is most likely a error in your program, I
> would start by using either valgrind or padb.
>
> You can run the app under valgrind using the following mpirun options,
> this will give you four files named v.log.0 to v.log.3 which you can
> check for errors in the normal way. The "--mca btl tcp,self" option
> will disable shared memory which can create false positives.
>
> mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
> q{OMPI_COMM_WORLD_RANK} <app>
>
> Alternatively you can run the application, wait for it to hang and then
> in another window run my tool, padb, which will show you the MPI message
> queues and stack traces which should show you where it's hung,
> instructions and sample output are on this page.
>
> http://padb.pittman.org.uk/full-report.html
>
>> There are no error messages, and there is no clue from anything else
>> (system working fine otherwise- no RAM issues, etc). It does not hang
>> at the same place everytime, sometimes in the very beginning, sometime
>> near the middle..
>>
>> Could this an issue with hyperthreading? A conflict with something?
>
> Unlikely, if there was a problem in OMPI running more than 3 processes
> it would have been found by now. I regularly run 8 process applications
> on my dual-core netbook alongside all my desktop processes without
> issue, it runs fine, a little slowly but fine.
>
> All this talk about binding and affinity won't help either, process
> binding is about squeezing the last 15% of performance out of a system
> and making performance reproducible, it has no bearing on correctness or
> scalability. If you're not running on a dedicated machine which with
> firefox running I guess you aren't then there would be a good case for
> leaving it off anyway.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_________________________________
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/