Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
From: Bob Soliday (soliday_at_[hidden])
Date: 2007-11-29 12:12:37


Jeff Squyres (jsquyres) wrote:
> Interesting. Would you mind sharing your patch?
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Bob Soliday
> Sent: Thursday, November 29, 2007 11:35 AM
> To: Ralph H Castain
> Cc: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
>
> I solved the problem by making a change to
> orte/mca/oob/tcp/oob_tcp_peer.c
>
> On Linux 2.6 I have read that after a failed connect system call the
> next call to connect can immediately return ECONNABORTED and not try to
> actually connect, the next call to connect will then work. So I changed
> mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
> connect again. The hello_c example script is now working.
>
> I don't think this has solved the underlying cause as to way connect is
> failing in the first place but at least now I move on to the next step.
> My best guess at the moment is that it is using eth0 initially when I
> want it to use eth1. This fails and then when it moves on to eth1 I run
> into the "can't call connect after it just failed bug".
>
> --Bob
>
>

I changed oob_tcp_peer.c at line 289 from:

/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
     (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
   /* non-blocking so wait for completion */
   if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
     opal_event_add(&peer->peer_send_event, 0);
     return ORTE_SUCCESS;
   }
   opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
               "connect to %s:%d failed: %s (%d)",
               ORTE_NAME_ARGS(orte_process_info.my_name),
               ORTE_NAME_ARGS(&(peer->peer_name)),
               inet_ntoa(inaddr.sin_addr),
               ntohs(inaddr.sin_port),
               strerror(opal_socket_errno),
               opal_socket_errno);
   continue;
}

to:

/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
     (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
   /* non-blocking so wait for completion */
   if (opal_socket_errno == ECONNABORTED) {
     if(connect(peer->peer_sd,
         (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
       if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
         opal_event_add(&peer->peer_send_event, 0);
         return ORTE_SUCCESS;
       }
       opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
                   "connect to %s:%d failed: %s (%d)",
                   ORTE_NAME_ARGS(orte_process_info.my_name),
                   ORTE_NAME_ARGS(&(peer->peer_name)),
                   inet_ntoa(inaddr.sin_addr),
                   ntohs(inaddr.sin_port),
                   strerror(opal_socket_errno),
                   opal_socket_errno);
       continue;
     }
   } else {
     if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
       opal_event_add(&peer->peer_send_event, 0);
       return ORTE_SUCCESS;
     }
     opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
                 "connect to %s:%d failed: %s (%d)",
                 ORTE_NAME_ARGS(orte_process_info.my_name),
                 ORTE_NAME_ARGS(&(peer->peer_name)),
                 inet_ntoa(inaddr.sin_addr),
                 ntohs(inaddr.sin_port),
                 strerror(opal_socket_errno),
                 opal_socket_errno);
     continue;
   }
}