Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-02 20:32:25


On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote:

> My G5s only have one ethernet card each and are connected to the
> network
> through those cards. I upgraded to Open MPI 1.0.2. The problem
> remains the
> same.
>
> A somewhat detailed description of the problem is like this. When I
> run jobs
> from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
> processes will start on this machine. orted and 2 processes will
> start on
> the 2-cpu machine. The processes hang for a while and then I get
> the error
> message . After that, the processes still hang. If I Ctrl+c, all
> processes
> on both machines including both orteds and the orterun will quit.
> If I run
> jobs from the 2-cpu machin, specfying 6 processes, orted, orterun
> and 2
> processes will start on this machine. Only orted will start on the
> 4-cpu
> machine and no processes will start. The job then hang and I don't
> get any
> response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
> machine will quit. But orted on the 4-cpu machine will not quit.
>
> Does this have anything to do with the IP addresses? The IP address
> xxx.xxx.aaa.bbb for one machine is different from the IP address
> xxx.xxx.cc.dd for the other machine in that not only bbb is not dd,
> but also
> aaa is not cc.

Well, you can't guess right all the time :). But I think you gave
enough information for the next thing to try. It sounds like there
might be a firewall running on the 2 process machine. When you
orterun on the 4 cpu machine, the remote orted can clearly connect
back to orterun because it is getting the process startup and
shutdown messages. Things only fail when the MPI process on the 4
cpu machine try to connect to the other processes. On the other
hand, when you start on the 2 cpu machine, the orted on the 4 cpu
machine starts but can't even connect back to orterun to find out
what processes to start, nor can it get the shutdown request. So you
get a hang.

If you go into System Preferences -> Sharing, make sure that the
firewall is turned off in the "firewall" tab. Hopefully, that will
do the trick.

Brian

>> From: Brian Barrett <brbarret_at_[hidden]>
>> Reply-To: Open MPI Users <users_at_[hidden]>
>> Date: Thu, 2 Mar 2006 18:50:35 -0500
>> To: Open MPI Users <users_at_[hidden]>
>> Subject: Re: [OMPI users] Problem running open mpi across nodes.
>>
>> On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:
>>
>>> I installed Open MPI on two Mac G5s, one with 2 cpus and the other
>>> with 4
>>> cpus. I can run jobs on either of the machines fine. But when I ran
>>> a job on
>>> machine one across the two nodes, the all processes I requested
>>> would start,
>>> but they then seemed to hang and I got the error message:
>>>
>>> [0,1,1][btl_tcp_endpoint.c:
>>> 559:mca_btl_tcp_endpoint_complete_connect]
>>> connect() failed with
>>> errno=60[0,1,0][btl_tcp_endpoint.c:
>>> 559:mca_btl_tcp_endpoint_complete_connect
>>> ] connect() failed with errno=60
>>>
>>> When I ran the job on machine two across the nodes, only processes
>>> on this
>>> machine would start and then hung. No processes would start on
>>> machine one
>>> and I didn't get any messages. In both cases, I have to Ctrl+C to
>>> kill the
>>> jobs. Any idea what was wrong? Thanks a lot!
>>
>> errno 60 is ETIMEDOUT, which means that the connect() timed out
>> before the remote side answered. The other way was probably a
>> similar problem - there's something strange going on with the routing
>> on the two nodes that's causing OMPI to get confused. Do your G5
>> machines have ethernet adapters other than the primary GigE cards
>> (wireless, a second GigE card, a Firewire TCP stack) by any chance?
>> There's an issue with situations where there are multiple ethernet
>> cards that causes the TCP btl to behave badly like this. We think we
>> have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
>> it might help to upgrade to that version:
>>
>> http://www.open-mpi.org/software/ompi/v1.0/
>>
>> Brian
>>
>> --
>> Brian Barrett
>> Open MPI developer
>> http://www.open-mpi.org/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users