Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-02 18:50:35


On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:

> I installed Open MPI on two Mac G5s, one with 2 cpus and the other
> with 4
> cpus. I can run jobs on either of the machines fine. But when I ran
> a job on
> machine one across the two nodes, the all processes I requested
> would start,
> but they then seemed to hang and I got the error message:
>
> [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect]
> connect() failed with
> errno=60[0,1,0][btl_tcp_endpoint.c:
> 559:mca_btl_tcp_endpoint_complete_connect
> ] connect() failed with errno=60
>
> When I ran the job on machine two across the nodes, only processes
> on this
> machine would start and then hung. No processes would start on
> machine one
> and I didn't get any messages. In both cases, I have to Ctrl+C to
> kill the
> jobs. Any idea what was wrong? Thanks a lot!

errno 60 is ETIMEDOUT, which means that the connect() timed out
before the remote side answered. The other way was probably a
similar problem - there's something strange going on with the routing
on the two nodes that's causing OMPI to get confused. Do your G5
machines have ethernet adapters other than the primary GigE cards
(wireless, a second GigE card, a Firewire TCP stack) by any chance?
There's an issue with situations where there are multiple ethernet
cards that causes the TCP btl to behave badly like this. We think we
have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
it might help to upgrade to that version:

   http://www.open-mpi.org/software/ompi/v1.0/

Brian

-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/