Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Tom Bryan (tombry_at_[hidden])
Date: 2012-02-06 16:28:51


On 2/6/12 8:14 AM, "Reuti" <reuti_at_[hidden]> wrote:

>> If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support,
>> it's not clear to me whether MPI::Init_Thread() and
>> MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from
>> Open MPI.
>
> If you need thread support, you will need MPI::Init_Thread and it needs one
> argument (or three).

Sorry, typo on my side. I meant to compare
MPI::Init_thread(MPI::THREAD_MULTIPLE) and MPI::Init(). I think that your
first reply mentioned replacing MPI::Init_thread by MPI::Init.

> I suggest to use a stable version 1.4.4 for your experiments. As you said you
> are new MPI, you could get misled between wrong error messages and bugs and
> error messages due to a programming error on your side.

OK. I'll certainly set it up so that I can validate what's supposed to
work. I'll have to check with our main MPI developers to see whether
there's anything in 1.5.x that they need.

>> 1. I'm still surprised that the SGE behavior is so different when I
>> configure my SGE queue differently. See test "a" in the .tgz. When I just
>> run mpitest in mpi.sh and ask for exactly 5 slots (-pe orte 5-5), it works
>> if the queue is configured to use a single host. I see 1 MASTER and 4
>> SLAVES in qstat -g t, and I get the correct output.
>
> Fine. ("job_is_first_task true" in the PE according to this.)

Yes, I believe that job_is_first_task will need to be true for our
environment.

>> If the queue is set to
>> use multiple hosts, the jobs hang in spawn/init, and I get errors
>> [grid-03.cisco.com][[19159,2],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint
>> _complete_connect] connect() to 192.168.122.1 failed: Connection refused
>> (111)
>
> What is the setting in SGE for:
>
> $ qconf -sconf
> ...
> qlogin_command builtin
> qlogin_daemon builtin
> rlogin_command builtin
> rlogin_daemon builtin
> rsh_command builtin
> rsh_daemon builtin
> If it's set to use ssh,

Nope. My output is the same as yours.
qlogin_command builtin
qlogin_daemon builtin
rlogin_command builtin
rlogin_daemon builtin
rsh_command builtin
rsh_daemon builtin

> But I wonder, why it's working for some nodes?

I don't think that it's working on some nodes. In my other cases where it
hangs, I don't always get those "connection refused" errors.

I'm not sure, but the "connection refused" errors might be a red herring.
The machines' primary NICs are on a different private network (172.28.*.*).
The 192.168.122.1 address is actually the machine's own virbr0 device, which
the documentations says is a "xen interface used by Virtualization guest and
host oses for network communication."

> Are there custom configuration per node, and some are faulty:

I did a qconf -sconf machine for each host in my grid. I get identical
output like this for each machine.
$ qconf -sconf grid-03
#grid-03.cisco.com:
mailer /bin/mail
xterm /usr/bin/xterm

So, I think that the SGE config is the same across those machines.

>> 2. I guess I'm not sure how SGE is supposed to behave. Experiment "a" and
>> "b" were identical except that I changed -pe orte 5-5 to -pe orte 5-. The
>> single case works like before, and the multiple exec host case fails as
>> before. The difference is that qstat -g t shows additional SLAVEs that
>> don't seem to correspond to any jobs on the exec hosts. Are these SLAVEs
>> just slots that are reserved for my job but that I'm not using? If my job
>> will only use 5 slots, then I should set the SGE qsub job to ask for exactly
>> 5 with "-pe orte 5-5", right?
>
> Correct. The remaining ones are just unused. You could adjust your application
> of course to check how many slots were granted, and start slaves according to
> the information you got to use all granted slots.

OK. That makes sense. In our intended uses, I believe that we'll know
exactly how many slots the application will need, and it will use the same
number of slots throughout the entire job.

>> 3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1
>> mpitest" instead of running mpitest directly. Now both the single machine
>> queue and multiple machine queue work. So, mpiexec seems to make my
>> multi-machine configuration happier. In this case, I'm still using "-pe
>> orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t.
>
> Then case a) could show a bug in 1.5.4. For me both we working, but the

OK. That helps to explain my confusion. Our previous experiments (where I
was told that case (a) was working) were with Open MPI 1.4.x. Should I open
a bug for this issue?

> allocation was different. The correct allocation I only got with "mpiexec -np
> 1". In your case 4 were routed to one remote machine: the machine where the
> jobscript runs is usually the first entry in the machinefile, but on grid-03
> you got only one slot by accident, and so the 4 additional ones were routed to
> the next machine it found in the machinefile.

FYI, I think that this particular allocation was a mis-configuration on my
side. It looks like SGE thinks that grid-03 only has 1 slot available.

>> 4. Based on "d", I thought that I could follow the approach in "a". That
>> is, for experiment "e", I used mpiexec -np 1, but I also used -pe orte 5-5.
>> I thought that this would make the multi-machine queue reserve only the 5
>> slots that I needed. The single machine queue works correctly, but now the
>> multi-machine case hangs with no errors. The output from qstat and pstree
>> are what I'd expect, but it seems to hang in Span_multiple and Init_thread.
>> I really expected this to work.
>
> Yes, this should work across multiple machines. And it's using `qrsh -inherit
> ...` so it's failing somewhere in Open MPI - is it working with 1.4.4?

I'm not sure. We no longer have our 1.4 test environment, so I'm in the
process of building that now. I'll let you know once I have a chance to run
that experiment.

Thanks,
---Tom