Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Xiaoning (David) Yang (xyang_at_[hidden])
Date: 2006-03-03 11:33:19


Brian,

Thank you so much! It is working now.

David

***** Correspondence *****

> From: Brian Barrett <brbarret_at_[hidden]>
> Reply-To: Open MPI Users <users_at_[hidden]>
> Date: Thu, 2 Mar 2006 20:32:25 -0500
> To: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] Problem running open mpi across nodes.
>
> On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote:
>
>> My G5s only have one ethernet card each and are connected to the
>> network
>> through those cards. I upgraded to Open MPI 1.0.2. The problem
>> remains the
>> same.
>>
>> A somewhat detailed description of the problem is like this. When I
>> run jobs
>> from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
>> processes will start on this machine. orted and 2 processes will
>> start on
>> the 2-cpu machine. The processes hang for a while and then I get
>> the error
>> message . After that, the processes still hang. If I Ctrl+c, all
>> processes
>> on both machines including both orteds and the orterun will quit.
>> If I run
>> jobs from the 2-cpu machin, specfying 6 processes, orted, orterun
>> and 2
>> processes will start on this machine. Only orted will start on the
>> 4-cpu
>> machine and no processes will start. The job then hang and I don't
>> get any
>> response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
>> machine will quit. But orted on the 4-cpu machine will not quit.
>>
>> Does this have anything to do with the IP addresses? The IP address
>> xxx.xxx.aaa.bbb for one machine is different from the IP address
>> xxx.xxx.cc.dd for the other machine in that not only bbb is not dd,
>> but also
>> aaa is not cc.
>
> Well, you can't guess right all the time :). But I think you gave
> enough information for the next thing to try. It sounds like there
> might be a firewall running on the 2 process machine. When you
> orterun on the 4 cpu machine, the remote orted can clearly connect
> back to orterun because it is getting the process startup and
> shutdown messages. Things only fail when the MPI process on the 4
> cpu machine try to connect to the other processes. On the other
> hand, when you start on the 2 cpu machine, the orted on the 4 cpu
> machine starts but can't even connect back to orterun to find out
> what processes to start, nor can it get the shutdown request. So you
> get a hang.
>
> If you go into System Preferences -> Sharing, make sure that the
> firewall is turned off in the "firewall" tab. Hopefully, that will
> do the trick.
>
> Brian
>
>
>
>>> From: Brian Barrett <brbarret_at_[hidden]>
>>> Reply-To: Open MPI Users <users_at_[hidden]>
>>> Date: Thu, 2 Mar 2006 18:50:35 -0500
>>> To: Open MPI Users <users_at_[hidden]>
>>> Subject: Re: [OMPI users] Problem running open mpi across nodes.
>>>
>>> On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:
>>>
>>>> I installed Open MPI on two Mac G5s, one with 2 cpus and the other
>>>> with 4
>>>> cpus. I can run jobs on either of the machines fine. But when I ran
>>>> a job on
>>>> machine one across the two nodes, the all processes I requested
>>>> would start,
>>>> but they then seemed to hang and I got the error message:
>>>>
>>>> [0,1,1][btl_tcp_endpoint.c:
>>>> 559:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with
>>>> errno=60[0,1,0][btl_tcp_endpoint.c:
>>>> 559:mca_btl_tcp_endpoint_complete_connect
>>>> ] connect() failed with errno=60
>>>>
>>>> When I ran the job on machine two across the nodes, only processes
>>>> on this
>>>> machine would start and then hung. No processes would start on
>>>> machine one
>>>> and I didn't get any messages. In both cases, I have to Ctrl+C to
>>>> kill the
>>>> jobs. Any idea what was wrong? Thanks a lot!
>>>
>>> errno 60 is ETIMEDOUT, which means that the connect() timed out
>>> before the remote side answered. The other way was probably a
>>> similar problem - there's something strange going on with the routing
>>> on the two nodes that's causing OMPI to get confused. Do your G5
>>> machines have ethernet adapters other than the primary GigE cards
>>> (wireless, a second GigE card, a Firewire TCP stack) by any chance?
>>> There's an issue with situations where there are multiple ethernet
>>> cards that causes the TCP btl to behave badly like this. We think we
>>> have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
>>> it might help to upgrade to that version:
>>>
>>> http://www.open-mpi.org/software/ompi/v1.0/
>>>
>>> Brian
>>>
>>> --
>>> Brian Barrett
>>> Open MPI developer
>>> http://www.open-mpi.org/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users