Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun hangs
From: Maciej Kazulak (kazulakm_at_[hidden])
Date: 2009-01-06 15:11:29


2009/1/6 Ralph Castain <rhc_at_[hidden]>

>
> On Jan 5, 2009, at 5:19 PM, Jeff Squyres wrote:
>
> On Jan 5, 2009, at 5:01 PM, Maciej Kazulak wrote:
>>
>> Interesting though. I thought in such a simple scenario shared memory
>>> would be used for IPC (or whatever's fastest) . But nope. Even with one
>>> process still it wants to use TCP/IP to communicate between mpirun and
>>> orted.
>>>
>>
>> Correct -- we only have TCP enabled for MPI process <--> orted
>> communication. There are several reasons why; the simplest is that this is
>> our "out of band" channel and it is only used to setup and tear down the
>> job. As such, we don't care that it's a little slower than other possible
>> channels (such as sm). MPI traffic will use shmem, OpenFabrics-based
>> networks, Myrinet, ...etc. But not MPI process <--> orted communication.
>>
>> What's even more surprising to me it won't use loopback for that. Hence
>>> my maybe a little bit over-restrictive iptables rules were the problem. I
>>> allowed traffic from 127.0.0.1 to 127.0.0.1 on lo but not from <eth0_addr>
>>> to <eth0_addr> on eth0 and both processes ended up waiting for IO.
>>>
>>> Can I somehow configure it to use something other than TCP/IP here? Or at
>>> least switch it to loopback?
>>>
>>
>> I don't remember how it works in the v1.2 series offhand; I think it's
>> different in the v1.3 series (where all MPI processes *only* talk to the
>> local orted, vs. MPI processes making direct TCP connections back to mpirun
>> and any other MPI process with which it needs to bootstrap other
>> communication channels). I'm *guessing* that the MPI process <--> orted
>> communication either uses a named unix socket or TCP loopback. Ralph -- can
>> you explain the details?
>>
>
> In the 1.2 series, mpirun spawns a local orted to handle all local procs.
> The code that discovers local interfaces specifically ignores any interfaces
> that are not up or are just local loopbacks. My guess is that the person who
> wrote that code long, long ago was assuming that the sole purpose was to
> talk to remote nodes, not to loop back onto yourself.
>
> I imagine it could be changed to include loopback, but I would first need
> to work with other developers to ensure there are no unexpected consequences
> in doing so. Since no TCP interface is found, mpirun fails.
>
> In the 1.3 series, mpirun handles the local procs itself. Thus, this issue
> does not appear and things run just fine.
>
>
> Ralph
>
>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Thanks for the answer. Think i'll just update my firewall rules for now and
wait for a 1.3 release.