Jeff Squyres (jsquyres) wrote:
> Interesting. Would you mind sharing your patch?
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Bob Soliday
> Sent: Thursday, November 29, 2007 11:35 AM
> To: Ralph H Castain
> Cc: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
>
> I solved the problem by making a change to
> orte/mca/oob/tcp/oob_tcp_peer.c
>
> On Linux 2.6 I have read that after a failed connect system call the
> next call to connect can immediately return ECONNABORTED and not try to
> actually connect, the next call to connect will then work. So I changed
> mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
> connect again. The hello_c example script is now working.
>
> I don't think this has solved the underlying cause as to way connect is
> failing in the first place but at least now I move on to the next step.
> My best guess at the moment is that it is using eth0 initially when I
> want it to use eth1. This fails and then when it moves on to eth1 I run
> into the "can't call connect after it just failed bug".
>
> --Bob
>
>
I changed oob_tcp_peer.c at line 289 from:
/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
/* non-blocking so wait for completion */
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
to:
/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
/* non-blocking so wait for completion */
if (opal_socket_errno == ECONNABORTED) {
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
} else {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
}
|