We just discovered this ticket, which might describe the same problem that we have:

https://svn.open-mpi.org/trac/ompi/ticket/1505

It seems unresolved... do you have a workaround for it? I've seen the "-mca opal_net_private_ipv4 " parameter, but I don't exactly know how to use it... At least my experiments failed to do anything.

I'll be very grateful for your help,
Krzysztof


2010/11/17 Grzegorz Maj <maju3@wp.pl>
2010/11/11 Jeff Squyres <jsquyres@cisco.com>:
> On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote:
>
>> No, unfortunately specification of interfaces is a little more complicated...  eth0/1/2 is not common for both machines.
>
> Can you define "common"?  Do you mean that eth0 on one machine is on a different network then eth0 on the other machine?
>
> Is there any way that you can make them the same?  It would certainly make things easier.

Yes, they are on different networks and unfortunately we are not
allowed to play with this.

>
>> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I don't know exactly how.
>
> See my other mail:
>
>    http://www.open-mpi.org/community/lists/users/2010/11/14737.php
>
>> Anyway, do you have any ideas how to further debug the communication problem?
>
> The connect() is not getting through somehow.  Sadly, we don't have enough debug messages to show exactly what is going wrong when these kinds of things happen; I have a half-finished branch that has much better debug/error messages, but I've never had the time to finish it (indeed, I think there's a bug in that development branch right now, otherwise I'd recommend giving it a whirl).  :-\

Analyzing the strace of both processes shows, that on both sides the
call to 'poll' after connect/accept succeeds. As I understand they
even exchange some information, which is always 8 bytes, like
D\227\0\1\0\0\0\0. One of them sends this information and the other
receives it. But after receiving, it does:

----
recv(8, "\5g\0\1\0\0\0\0", 8, 0)        = 8
fcntl64(8, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
getpeername(8, {sa_family=AF_INET, sin_port=htons(57885),
sin_addr=inet_addr("10.0.0.2")}, [16]) = 0
close(8)
----

In a working scenario (on another machines), after receiving, these
bytes are resent and then proceeds the proper communication (my
'hello' message is sent).

The above address 10.0.0.2 is eth2 on the host machine, which indeed
should be used in this communication.

While playing with network interfaces it came out, that when we bring
down one of the aliases (eth2:0), it starts working. How should we
enforce mpirun not to use this alias, when it's up? We were trying to
use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't
seem to help.

Regards,
Grzegorz


>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users