Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Reuti (reuti_at_[hidden])
Date: 2012-02-06 17:10:19


Am 06.02.2012 um 22:28 schrieb Tom Bryan:

> On 2/6/12 8:14 AM, "Reuti" <reuti_at_[hidden]> wrote:
>
>>> If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support,
>>> it's not clear to me whether MPI::Init_Thread() and
>>> MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from
>>> Open MPI.
>>
>> If you need thread support, you will need MPI::Init_Thread and it needs one
>> argument (or three).
>
> Sorry, typo on my side. I meant to compare
> MPI::Init_thread(MPI::THREAD_MULTIPLE) and MPI::Init(). I think that your
> first reply mentioned replacing MPI::Init_thread by MPI::Init.

Yes, if you don't need threads, I don't see any reason why it should add anything to the environment what you could make use of.

>>> <snip>
>>
>> What is the setting in SGE for:
>>
>> $ qconf -sconf
>> ...
>> qlogin_command builtin
>> qlogin_daemon builtin
>> rlogin_command builtin
>> rlogin_daemon builtin
>> rsh_command builtin
>> rsh_daemon builtin
>> If it's set to use ssh,
>
> Nope. My output is the same as yours.
> qlogin_command builtin
> qlogin_daemon builtin
> rlogin_command builtin
> rlogin_daemon builtin
> rsh_command builtin
> rsh_daemon builtin

Fine.

>> But I wonder, why it's working for some nodes?
>
> I don't think that it's working on some nodes. In my other cases where it
> hangs, I don't always get those "connection refused" errors.

If "builtin" is used, there is no reason to get "connection refused". The error message from Open MPI should be different in case of a closed firewall IIRC.

> I'm not sure, but the "connection refused" errors might be a red herring.
> The machines' primary NICs are on a different private network (172.28.*.*).
> The 192.168.122.1 address is actually the machine's own virbr0 device, which
> the documentations says is a "xen interface used by Virtualization guest and
> host oses for network communication."

By default Open MPI is using the primary interface for its communication AFAIK.

>> Are there custom configuration per node, and some are faulty:
>
> I did a qconf -sconf machine for each host in my grid. I get identical
> output like this for each machine.
> $ qconf -sconf grid-03
> #grid-03.cisco.com:
> mailer /bin/mail
> xterm /usr/bin/xterm
>
> So, I think that the SGE config is the same across those machines.

Yes, ok. Then it's fine.

>>> <snip>
>>> 3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1
>>> mpitest" instead of running mpitest directly. Now both the single machine
>>> queue and multiple machine queue work. So, mpiexec seems to make my
>>> multi-machine configuration happier. In this case, I'm still using "-pe
>>> orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t.
>>
>> Then case a) could show a bug in 1.5.4. For me both we working, but the
>
> OK. That helps to explain my confusion. Our previous experiments (where I
> was told that case (a) was working) were with Open MPI 1.4.x. Should I open
> a bug for this issue?

I'm not sure, as for me it's working. Maybe it has really something to do with the virtual machines setup.

>> Yes, this should work across multiple machines. And it's using `qrsh -inherit
>> ...` so it's failing somewhere in Open MPI - is it working with 1.4.4?
>
> I'm not sure. We no longer have our 1.4 test environment, so I'm in the
> process of building that now. I'll let you know once I have a chance to run
> that experiment.

Ok.

-- Reuti