Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with sending messages from one of the machines
From: Krzysztof Zarzycki (k.zarzycki_at_[hidden])
Date: 2010-11-18 05:19:41


We just discovered this ticket, which might describe the same problem that
we have:

https://svn.open-mpi.org/trac/ompi/ticket/1505

It seems unresolved... do you have a workaround for it? I've seen the "-mca
opal_net_private_ipv4 " parameter, but I don't exactly know how to use it...
At least my experiments failed to do anything.

I'll be very grateful for your help,
Krzysztof

2010/11/17 Grzegorz Maj <maju3_at_[hidden]>

> 2010/11/11 Jeff Squyres <jsquyres_at_[hidden]>:
> > On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote:
> >
> >> No, unfortunately specification of interfaces is a little more
> complicated... eth0/1/2 is not common for both machines.
> >
> > Can you define "common"? Do you mean that eth0 on one machine is on a
> different network then eth0 on the other machine?
> >
> > Is there any way that you can make them the same? It would certainly
> make things easier.
>
> Yes, they are on different networks and unfortunately we are not
> allowed to play with this.
>
> >
> >> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I
> don't know exactly how.
> >
> > See my other mail:
> >
> > http://www.open-mpi.org/community/lists/users/2010/11/14737.php
> >
> >> Anyway, do you have any ideas how to further debug the communication
> problem?
> >
> > The connect() is not getting through somehow. Sadly, we don't have
> enough debug messages to show exactly what is going wrong when these kinds
> of things happen; I have a half-finished branch that has much better
> debug/error messages, but I've never had the time to finish it (indeed, I
> think there's a bug in that development branch right now, otherwise I'd
> recommend giving it a whirl). :-\
>
> Analyzing the strace of both processes shows, that on both sides the
> call to 'poll' after connect/accept succeeds. As I understand they
> even exchange some information, which is always 8 bytes, like
> D\227\0\1\0\0\0\0. One of them sends this information and the other
> receives it. But after receiving, it does:
>
> ----
> recv(8, "\5g\0\1\0\0\0\0", 8, 0) = 8
> fcntl64(8, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> getpeername(8, {sa_family=AF_INET, sin_port=htons(57885),
> sin_addr=inet_addr("10.0.0.2")}, [16]) = 0
> close(8)
> ----
>
> In a working scenario (on another machines), after receiving, these
> bytes are resent and then proceeds the proper communication (my
> 'hello' message is sent).
>
> The above address 10.0.0.2 is eth2 on the host machine, which indeed
> should be used in this communication.
>
> While playing with network interfaces it came out, that when we bring
> down one of the aliases (eth2:0), it starts working. How should we
> enforce mpirun not to use this alias, when it's up? We were trying to
> use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't
> seem to help.
>
> Regards,
> Grzegorz
>
>
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>