Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-31 11:19:11

I would try attaching to the processes to see where things are
getting stuck.

On Oct 31, 2007, at 5:51 AM, Murat Knecht wrote:

> Jeff Squyres schrieb:
>> On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote:
>>> Yes I am, (master and child 1 running on the same machine). But
>>> knowing the oversubscribing issue, I am using mpi_yield_when_idle
>>> which should fix precisely this problem, right?
>> It won't *fix* the problem -- you're still oversubscribing the
>> nodes, so things will run slowly. But it should help, in that the
>> processes will yield regularly.
> Yes. I meant "solving the blocking problem by letting others get
> some CPU time" by "fix".
>> What version of OMPI are you using?
> I am using 1.2.4
>>> I did give both machines multiple slots, so OpenMPI "knows" that
>>> the possibility for more oversubscription may arise.
>> I'm not sure what you mean by this -- you should not "lie" to OMPI
>> and tell it that it has more slots than it physically does. But
>> keep in mind that, as I described in my first mail, OMPI does not
>> currently re-compute the number of processes on a host as you
>> spawn (which can lead to the oversubscription problem). If you're
>> explicitly setting yield_when_idle, that *may* help, but we may or
>> may not be explicitly propoagating that value to spawned
>> processes... I'll have to check.
> In the hostfile I specified for each host the number of physically
> available cores. Together with the "yield" setting I hoped the
> oversubscription would be recognised even if the "oversubscribing"
> processes are dynamically started.
> I re-checked the high/low parameter, but it does seem alright.
> Would be kind of awkward for this to be the reason, as the problem
> seems to depend on the host and the order.
> Thanks,
> Murat
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems