Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-02-10 11:38:54


Tom and I talked more about this off list, and I eventually logged in to his cluster to see what I could see.

The issue turned out to be not related to SGE or THREAD_MULTIPLE at all. The issue was that RHEL6, by default, activated a virtualization IP interface on all of Tom's nodes. All nodes had a local IP interface in the 192.168.1.x/24 subnet, but that address was only used to communicate to the local Xen interface.

But OMPI saw the interface, saw that every MPI process had an address in that IP subnet, and assumed that it could be used for MPI communication.

Nope!

The simple solution here was to add the RHEL6 Xen virtualization device (virb0) to OMPI's exclude list, like this:

    mpirun --mca btl_tcp_if_exclude lo,virbr0 \
        --mca oob_tcp_if_exclude lo,virbr0 ...

Then everything worked fine.

On Feb 9, 2012, at 4:19 PM, Reuti wrote:

> Am 08.02.2012 um 22:52 schrieb Tom Bryan:
>
>> <snip>
>> Yes, this should work across multiple machines. And it's using `qrsh
>>>>> -inherit
>>>>> ...` so it's failing somewhere in Open MPI - is it working with 1.4.4?
>>>>
>>>> I'm not sure. We no longer have our 1.4 test environment, so I'm in the
>>>> process of building that now. I'll let you know once I have a chance to run
>>>> that experiment.
>>
>> You said that both of these cases worked for you in 1.4. Were you running a
>> modified version that did not use THREAD_MULTIPLE? I ask because I'm
>> getting worse errors in 1.4. I'm using the same code that was working (in
>> some cases) with 1.5.4.
>>
>> I built 1.4.4 with (among other option)
>> --with-threads=posix --enable-mpi-threads
>
> ./configure --prefix=$HOME/local/openmpi-1.4.4-default-thread --with-sge --with-threads=posix --enable-mpi-threads
>
> No problems even with THREAD_MULTIPLE.
>
> Only as stated in singleton mode one or more additional line (looks like one per slave host, but not always - race condition?):
>
> [pc15370:31390] [[24201,0],1] routed:binomial: Connection to lifeline [[24201,0],0] lost
>
>> <snip>
>> ompi_mpi_init: orte_init failed
>> --> Returned "Data unpack would read past end of buffer" (-26) instead of
>> "Success" (0)
>> --------------------------------------------------------------------------
>> *** The MPI_Init_thread() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>
> Interesting error message, as it's not true to be disallowed.
>
> -- Reuti
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/