Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
From: Bob Soliday (soliday_at_[hidden])
Date: 2007-11-29 12:12:37


Jeff Squyres (jsquyres) wrote:
> Interesting. Would you mind sharing your patch?
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Bob Soliday
> Sent: Thursday, November 29, 2007 11:35 AM
> To: Ralph H Castain
> Cc: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
>
> I solved the problem by making a change to
> orte/mca/oob/tcp/oob_tcp_peer.c
>
> On Linux 2.6 I have read that after a failed connect system call the
> next call to connect can immediately return ECONNABORTED and not try to
> actually connect, the next call to connect will then work. So I changed
> mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
> connect again. The hello_c example script is now working.
>
> I don't think this has solved the underlying cause as to way connect is
> failing in the first place but at least now I move on to the next step.
> My best guess at the moment is that it is using eth0 initially when I
> want it to use eth1. This fails and then when it moves on to eth1 I run
> into the "can't call connect after it just failed bug".
>
> --Bob
>
>

I changed oob_tcp_peer.c at line 289 from:

/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
     (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
   /* non-blocking so wait for completion */
   if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
     opal_event_add(&peer->peer_send_event, 0);
     return ORTE_SUCCESS;
   }
   opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
               "connect to %s:%d failed: %s (%d)",
               ORTE_NAME_ARGS(orte_process_info.my_name),
               ORTE_NAME_ARGS(&(peer->peer_name)),
               inet_ntoa(inaddr.sin_addr),
               ntohs(inaddr.sin_port),
               strerror(opal_socket_errno),
               opal_socket_errno);
   continue;
}

to:

/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
     (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
   /* non-blocking so wait for completion */
   if (opal_socket_errno == ECONNABORTED) {
     if(connect(peer->peer_sd,
         (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
       if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
         opal_event_add(&peer->peer_send_event, 0);
         return ORTE_SUCCESS;
       }
       opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
                   "connect to %s:%d failed: %s (%d)",
                   ORTE_NAME_ARGS(orte_process_info.my_name),
                   ORTE_NAME_ARGS(&(peer->peer_name)),
                   inet_ntoa(inaddr.sin_addr),
                   ntohs(inaddr.sin_port),
                   strerror(opal_socket_errno),
                   opal_socket_errno);
       continue;
     }
   } else {
     if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
       opal_event_add(&peer->peer_send_event, 0);
       return ORTE_SUCCESS;
     }
     opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
                 "connect to %s:%d failed: %s (%d)",
                 ORTE_NAME_ARGS(orte_process_info.my_name),
                 ORTE_NAME_ARGS(&(peer->peer_name)),
                 inet_ntoa(inaddr.sin_addr),
                 ntohs(inaddr.sin_port),
                 strerror(opal_socket_errno),
                 opal_socket_errno);
     continue;
   }
}