Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with sending messages from one of the machines
From: George Bosilca (bosilca_at_[hidden])
Date: 2010-11-29 17:14:20


You can use the tcp_if_include/tcp_if_exclude with address ranges instead of names. ompi_info --mca btl tcp give you some hints:

> MCA btl: parameter "btl_tcp_if_include" (current value: <none>, data
> source: default value)
> Comma-delimited list of devices or CIDR notation of networks
> to use for MPI communication (e.g., "eth0,eth1" or
> "192.168.0.0/16,10.1.4.0/24"). Mutually exclusive with
> btl_tcp_if_exclude.
> MCA btl: parameter "btl_tcp_if_exclude" (current value: <lo,sppp>, data
> source: default value)
> Comma-delimited list of devices or CIDR notation of networks
> to NOT use for MPI communication -- all devices not matching
> these specifications will be used (e.g., "eth0,eth1" or
> "192.168.0.0/16,10.1.4.0/24"). Mutually exclusive with
> btl_tcp_if_include.
>

  george.

On Nov 18, 2010, at 05:19 , Krzysztof Zarzycki wrote:

> We just discovered this ticket, which might describe the same problem that we have:
>
> https://svn.open-mpi.org/trac/ompi/ticket/1505
>
> It seems unresolved... do you have a workaround for it? I've seen the "-mca opal_net_private_ipv4 " parameter, but I don't exactly know how to use it... At least my experiments failed to do anything.
>
> I'll be very grateful for your help,
> Krzysztof
>
>
> 2010/11/17 Grzegorz Maj <maju3_at_[hidden]>
> 2010/11/11 Jeff Squyres <jsquyres_at_[hidden]>:
> > On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote:
> >
> >> No, unfortunately specification of interfaces is a little more complicated... eth0/1/2 is not common for both machines.
> >
> > Can you define "common"? Do you mean that eth0 on one machine is on a different network then eth0 on the other machine?
> >
> > Is there any way that you can make them the same? It would certainly make things easier.
>
> Yes, they are on different networks and unfortunately we are not
> allowed to play with this.
>
> >
> >> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I don't know exactly how.
> >
> > See my other mail:
> >
> > http://www.open-mpi.org/community/lists/users/2010/11/14737.php
> >
> >> Anyway, do you have any ideas how to further debug the communication problem?
> >
> > The connect() is not getting through somehow. Sadly, we don't have enough debug messages to show exactly what is going wrong when these kinds of things happen; I have a half-finished branch that has much better debug/error messages, but I've never had the time to finish it (indeed, I think there's a bug in that development branch right now, otherwise I'd recommend giving it a whirl). :-\
>
> Analyzing the strace of both processes shows, that on both sides the
> call to 'poll' after connect/accept succeeds. As I understand they
> even exchange some information, which is always 8 bytes, like
> D\227\0\1\0\0\0\0. One of them sends this information and the other
> receives it. But after receiving, it does:
>
> ----
> recv(8, "\5g\0\1\0\0\0\0", 8, 0) = 8
> fcntl64(8, F_GETFL) = 0x2 (flags O_RDWR)
> fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> getpeername(8, {sa_family=AF_INET, sin_port=htons(57885),
> sin_addr=inet_addr("10.0.0.2")}, [16]) = 0
> close(8)
> ----
>
> In a working scenario (on another machines), after receiving, these
> bytes are resent and then proceeds the proper communication (my
> 'hello' message is sent).
>
> The above address 10.0.0.2 is eth2 on the host machine, which indeed
> should be used in this communication.
>
> While playing with network interfaces it came out, that when we bring
> down one of the aliases (eth2:0), it starts working. How should we
> enforce mpirun not to use this alias, when it's up? We were trying to
> use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't
> seem to help.
>
> Regards,
> Grzegorz
>
>
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users