Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Reuti (reuti_at_[hidden])
Date: 2012-02-06 08:14:13


Am 04.02.2012 um 00:15 schrieb Tom Bryan:

> OK. I misunderstood you. I thought that you were saying that spawn_multiple
> had to call mpiexec for each spawned process. If you just meant that mpi.sh
> should launch the initial process with mpiexec, that seems reasonable. I
> tried it with and without, and I definitely get better results when using
> mpiexec.

Yep.

> If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support,
> it's not clear to me whether MPI::Init_Thread() and
> MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from
> Open MPI.

If you need thread support, you will need MPI::Init_Thread and it needs one argument (or three).

The 2.2 standard states it:

http://www.mpi-forum.org/docs/

page 384.

>>>> NB: What is MPI::Init( MPI::THREAD_MULTIPLE ) supposed to do, output a
>>>> feature of MPI?
>
>> From the man page:
> MPI_Init_thread, as compared to MPI_Init, has a provision to request a
> certain level of thread support in required....The level of thread support
> available to the program is set in provided, except in C++, where it is the
> return value of the function.
>
>> For me it's not hanging. Did you try the alternative startup using mpiexec?
>> Aha - BTW: I use 1.4.4
>
> Right, I'm on 1.5.4.

I suggest to use a stable version 1.4.4 for your experiments. As you said you are new MPI, you could get misled between wrong error messages and bugs and error messages due to a programming error on your side.

> Yes, I did try starting with mpiexec. That helps, but I still don't know
> whether I understand all of the results.
>
> For each experiment, I've attached the output of
> qfstat -f
> qfstat -g t
> pstree -Aalp <pid of sge_execd>
> output of mpitest parent and children (mpi.sh.o<job>)
>
> I ran each test with two different SGE queue configurations. In one case,
> the queue with the orte pe is set to include all 5 exec hosts in my gird.
> In the "single" case, the queue with the orte pe is set to use only a single
> host. (The queue configuration isn't shown here, but I changed the queue's
> hostlist to user either a single host or a host group that includes all of
> my machines.
>
> I run qsub on node 17. The grid machines available for this run are 3, 4,
> 10, 11, and 16.
>
> Some observations:
>
> 1. I'm still surprised that the SGE behavior is so different when I
> configure my SGE queue differently. See test "a" in the .tgz. When I just
> run mpitest in mpi.sh and ask for exactly 5 slots (-pe orte 5-5), it works
> if the queue is configured to use a single host. I see 1 MASTER and 4
> SLAVES in qstat -g t, and I get the correct output.

Fine. ("job_is_first_task true" in the PE according to this.)

> If the queue is set to
> use multiple hosts, the jobs hang in spawn/init, and I get errors
> [grid-03.cisco.com][[19159,2],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint
> _complete_connect] connect() to 192.168.122.1 failed: Connection refused
> (111)

What is the setting in SGE for:

$ qconf -sconf
...
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin

If it's set to use ssh, you will need a passphrase-less login to other nodes or (better) a hostbased authentication (as it's a one time setup for all users in the future):

http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html

But I wonder, why it's working for some nodes? Are there custom configuration per node, and some are faulty:

$ qconf -sconfl

And then you can check for each listed one:

$ qconf -sconf grid-04

and so on.

In case you are interested in the meaning and behavior behind these settings:

http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html

> [grid-10.cisco.com:05327] [[19159,0],3] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> [grid-16.cisco.com:25196] [[19159,0],1] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> [grid-11.cisco.com:63890] [[19159,0],2] routed:binomial: Connection to
> lifeline [[19159,0],0] lost
> So, I'll just assume that mpiexec does some magic that is needed in the
> multi-machine scenario but not in the single machine scenario.
>
> 2. I guess I'm not sure how SGE is supposed to behave. Experiment "a" and
> "b" were identical except that I changed -pe orte 5-5 to -pe orte 5-. The
> single case works like before, and the multiple exec host case fails as
> before. The difference is that qstat -g t shows additional SLAVEs that
> don't seem to correspond to any jobs on the exec hosts. Are these SLAVEs
> just slots that are reserved for my job but that I'm not using? If my job
> will only use 5 slots, then I should set the SGE qsub job to ask for exactly
> 5 with "-pe orte 5-5", right?

Correct. The remaining ones are just unused. You could adjust your application of course to check how many slots were granted, and start slaves according to the information you got to use all granted slots.

> 3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1
> mpitest" instead of running mpitest directly. Now both the single machine
> queue and multiple machine queue work. So, mpiexec seems to make my
> multi-machine configuration happier. In this case, I'm still using "-pe
> orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t.

Then case a) could show a bug in 1.5.4. For me both we working, but the allocation was different. The correct allocation I only got with "mpiexec -np 1". In your case 4 were routed to one remote machine: the machine where the jobscript runs is usually the first entry in the machinefile, but on grid-03 you got only one slot by accident, and so the 4 additional ones were routed to the next machine it found in the machinefile.

> 4. Based on "d", I thought that I could follow the approach in "a". That
> is, for experiment "e", I used mpiexec -np 1, but I also used -pe orte 5-5.
> I thought that this would make the multi-machine queue reserve only the 5
> slots that I needed. The single machine queue works correctly, but now the
> multi-machine case hangs with no errors. The output from qstat and pstree
> are what I'd expect, but it seems to hang in Span_multiple and Init_thread.
> I really expected this to work.

Yes, this should work across multiple machines. And it's using `qrsh -inherit ...` so it's failing somewhere in Open MPI - is it working with 1.4.4?

-- Reuti

> I'm really confused by experiment "e" with multiple machines in the queue.
> Based on "a" and "d", I thought that a combination of mpiexec -np 1 would
> permit the multi-machine scheduling to work with MPI while the "-pe orte
> 5-5" would limit the slots to exactly the number that it needed to run.
>
> ---Tom
>
> <mpiExperiments.tgz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users