Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with sending messages from one of the machines
From: Krzysztof Zarzycki (k.zarzycki_at_[hidden])
Date: 2010-11-11 15:23:47


No, unfortunately specification of interfaces is a little more
complicated... eth0/1/2 is not common for both machines.

I've tried to play with (oob/btl)_tcp_ if_include, but actually... I don't
know exactly how.

Anyway, do you have any ideas how to further debug the communication
problem?

Cheers,
Krzysztof

2010/11/11 Ralph Castain <rhc_at_[hidden]>

> There are two connections to be specified:
>
> -mca oob_tcp_if_include xxx
> -mca btl_tcp_if_include xxx
>
>
> On Nov 11, 2010, at 12:04 PM, Krzysztof Zarzycki wrote:
>
> Hi,
> I'm working with Grzegorz on the mentioned problem.
> If I'm correct on checking the firewall settings, "iptables --list" shows
> an empty list of rules.
> The second host does not have iptables installed at all.
>
> So what can be a next reason of this problem?
>
> By the way, how can I enforce mpirun to use specific ethernet interface
> for connections, if I have several possible?
>
> Cheers,
> Krzysztof
>
> 2010/11/11 Jeff Squyres <jsquyres_at_[hidden]>
>
>> I'd check the firewall settings. The stack trace indicates that the one
>> host is trying to connect to the other (Open MPI initiates non-blocking TCP
>> connections that can be polled on later).
>>
>>
>> On Nov 10, 2010, at 12:46 PM, David Zhang wrote:
>>
>> > Have you double checked your firewall settings, TCP/IP settings, and SSH
>> keys are all setup correctly for all machines including the host?
>> >
>> > On Wed, Nov 10, 2010 at 2:57 AM, Grzegorz Maj <maju3_at_[hidden]> wrote:
>> > Hi all,
>> > I've got a problem with sending messages from one of my machines. It
>> > appears during MPI_Send/MPI_Recv and MPI_Bcast. The simplest case I've
>> > found is two processes, rank 0 sending a simple message and rank 1
>> > receiving this message. I execute these processes using mpirun with
>> > -np 2.
>> > - when both processes are executed on the host machine, it works fine;
>> > - when both processes are executed on client machines (both on the
>> > same or different machines), it works fine;
>> > - when sender is executed on one of the client machines and receiver
>> > on the host machine, it works fine;
>> > - when sender is executed on the host machine and receiver on client
>> > machine, it blocks.
>> >
>> > This last case is my problem. When adding option '--mca
>> > btl_base_verbose 30' to mpirun, I get:
>> >
>> > ----------
>> > [host:28186] mca: base: components_open: Looking for btl components
>> > [host:28186] mca: base: components_open: opening btl components
>> > [host:28186] mca: base: components_open: found loaded component self
>> > [host:28186] mca: base: components_open: component self has no register
>> function
>> > [host:28186] mca: base: components_open: component self open function
>> successful
>> > [host:28186] mca: base: components_open: found loaded component sm
>> > [host:28186] mca: base: components_open: component sm has no register
>> function
>> > [host:28186] mca: base: components_open: component sm open function
>> successful
>> > [host:28186] mca: base: components_open: found loaded component tcp
>> > [host:28186] mca: base: components_open: component tcp has no register
>> function
>> > [host:28186] mca: base: components_open: component tcp open function
>> successful
>> > [host:28186] select: initializing btl component self
>> > [host:28186] select: init of component self returned success
>> > [host:28186] select: initializing btl component sm
>> > [host:28186] select: init of component sm returned success
>> > [host:28186] select: initializing btl component tcp
>> > [host:28186] select: init of component tcp returned success
>> > [client01:19803] mca: base: components_open: Looking for btl components
>> > [client01:19803] mca: base: components_open: opening btl components
>> > [client01:19803] mca: base: components_open: found loaded component self
>> > [client01:19803] mca: base: components_open: component self has no
>> > register function
>> > [client01:19803] mca: base: components_open: component self open
>> > function successful
>> > [client01:19803] mca: base: components_open: found loaded component sm
>> > [client01:19803] mca: base: components_open: component sm has no
>> > register function
>> > [client01:19803] mca: base: components_open: component sm open
>> > function successful
>> > [client01:19803] mca: base: components_open: found loaded component tcp
>> > [client01:19803] mca: base: components_open: component tcp has no
>> > register function
>> > [client01:19803] mca: base: components_open: component tcp open
>> > function successful
>> > [client01:19803] select: initializing btl component self
>> > [client01:19803] select: init of component self returned success
>> > [client01:19803] select: initializing btl component sm
>> > [client01:19803] select: init of component sm returned success
>> > [client01:19803] select: initializing btl component tcp
>> > [client01:19803] select: init of component tcp returned success
>> > 00 of 2 host
>> > [host:28186] btl: tcp: attempting to connect() to address 10.0.7.97 on
>> > port 53255
>> > 01 of 2 client01
>> > ----------
>> >
>> > Where lines "00 of 2 host" and "01 of 2 client01" are just my debug
>> > saying "mpirank of comm_size hostname". The last but one line appears
>> > in call to Send:
>> > MPI::COMM_WORLD.Send(message, 5, MPI::CHAR, 1, 13);
>> >
>> > When executing the sender on host with strace, I get:
>> >
>> > ----------
>> > ...
>> > connect(10, {sa_family=AF_INET, sin_port=htons(1024),
>> > sin_addr=inet_addr("10.0.7.97")}, 16) = -1 EINPROGRESS (Operation now
>> > in progress)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 1 ([{fd=10,
>> > revents=POLLOUT}])
>> > getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>> > send(10, "D\227\0\1\0\0\0\0", 8, 0) = 8
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 1 ([{fd=10,
>> > revents=POLLIN}])
>> > recv(10, "", 8, 0) = 0
>> > close(10) = 0
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}], 6, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}], 6, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}], 6, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}], 6, 0) = 0 (Timeout)
>> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
>> > events=POLLIN}], 6, 0) = 0 (Timeout)
>> > ...
>> > (forever)
>> > ...
>> > ----------
>> >
>> > For me it looks like the above connect is responsible for establishing
>> > connection, but I'm afraid I don't understand what those calls for
>> > poll are supposed to do.
>> >
>> > Attaching gdb to the sender gives me:
>> >
>> > ----------
>> > (gdb) bt
>> > #0 0xffffe410 in __kernel_vsyscall ()
>> > #1 0x0064993b in poll () from /lib/libc.so.6
>> > #2 0xf7df07b5 in poll_dispatch () from
>> /home/gmaj/openmpi/lib/libopen-pal.so.0
>> > #3 0xf7def8c3 in opal_event_base_loop () from
>> > /home/gmaj/openmpi/lib/libopen-pal.so.0
>> > #4 0xf7defbe7 in opal_event_loop () from
>> > /home/gmaj/openmpi/lib/libopen-pal.so.0
>> > #5 0xf7de323b in opal_progress () from
>> /home/gmaj/openmpi/lib/libopen-pal.so.0
>> > #6 0xf7c51455 in mca_pml_ob1_send () from
>> > /home/gmaj/openmpi/lib/openmpi/mca_pml_ob1.so
>> > #7 0xf7ed9c60 in PMPI_Send () from /home/gmaj/openmpi/lib/libmpi.so.0
>> > #8 0x0804e900 in main ()
>> > ----------
>> >
>> > If anybody knows what may cause this problem or what may I do to find
>> > the reason, any help is appreciated.
>> >
>> > My open-mpi is version 1.4.1.
>> >
>> >
>> > Regards,
>> > Grzegorz Maj
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> > --
>> > David Zhang
>> > University of California, San Diego
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>