Am 14.03.2012 um 04:02 schrieb Joshua Baker-LePain:
> On Tue, 13 Mar 2012 at 5:31pm, Ralph Castain wrote
>> FWIW: I have a Centos6 system myself, and I have no problems running OMPI on it (1.4 or 1.5). I can try building it the same way you do and see what happens.
> I can run as many threads as I like on a single system with no problems, even if those threads are running at different nice levels.
How do they get different nice levels - you renice them? I would assume that all start at the same of the parent. In your test program you posted there are no threads.
> The problem seems to arise when I'm both a) running across multiple machines and b) running threads at differing nice levels (which often happens as a result of our queueing setup).
This sounds like you are getting slots from different queues assigned to one and the same job. My experience: don't do it, unless you neeed it. The problem is, that SGE can't decide in its `qrsh -inherit ...` call, which queue is the correct one for the particular call. As a result all calls to a slave machine can end up in one and the same queue. Although this is not correct, it won't oversubscribe the node, as usually the overall slot amount is limited already and it's more a matter of names SGE sets for the environment of the job:
As a result, the SGE set $TMPDIR can be different between the master of the parallel job and the slave as the name of the queue is part of $TMPDIR. When a wrong $TMPDIR is set on a node (by Open MPI's forwarding?), strange things can happen depending on the application.
Do you face the same if you stay in one and the same queue across the machines? If you want to limit the number of available PEs in your setup for the user, you could request a PE by a wildcard and once a PE is selected SGE will stay in this PE. Attaching each PE to only one queue allows this way to avoid the mixture of slots from different queues (orte1 PE => all.q, orte2 PE => extra.q and you request orte*).
> I can't guarantee that the problem *never* happens when I run across multiple machines with all the threads un-niced, but I haven't been able to reproduce that at will like I can for the other case.
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> users mailing list