Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Adrian Knoth (adi_at_[hidden])
Date: 2006-11-01 18:20:56


Hi,

I'm currently testing the new IPv6 code in a lot of
different setups.

It's doing fine with Linux and Solaris, both on x86.
There are also no problems between multiple amd64s,
but I wasn't able to communicate between x86 and amd64.

The oob connection is up, but the BTL hangs. gdb (remote) shows:

#0 0xb7d3bac9 in sigprocmask () from /lib/tls/libc.so.6
#1 0xb7eb956c in opal_evsignal_recalc ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#2 0xb7eba033 in poll_dispatch ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#3 0xb7eb8d5d in opal_event_loop ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#4 0xb7eb2f58 in opal_progress ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libopal.so.0
#5 0xb7c72505 in mca_pml_ob1_recv ()
   from /home/racl/adi/ompi/trunk/Linux-i686//lib/openmpi/mca_pml_ob1.so
#6 0xb7fa8c10 in PMPI_Recv ()
   from /home/racl/adi/ompi/trunk/Linux-i686/lib/libmpi.so.0
#7 0x080488cd in main ()

and the local gdb:

#0 0x00002aaaab4b4d99 in __libc_sigaction () from /lib/libpthread.so.0
#1 0x00002aaaaaee4c26 in opal_evsignal_recalc ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#2 0x00002aaaaaee44b1 in opal_event_loop ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#3 0x00002aaaaaedfc10 in opal_progress ()
   from /home/adi//trunk/Linux-x86_64/lib/libopal.so.0
#4 0x00002aaaad6a0c8c in mca_pml_ob1_recv ()
   from /home/adi/trunk/Linux-x86_64//lib/openmpi/mca_pml_ob1.so
#5 0x00002aaaaac429f9 in PMPI_Recv ()
   from /home/adi//trunk/Linux-x86_64/lib/libmpi.so.0
#6 0x0000000000400b39 in main ()

The ompi-1.1.2-release also shows this problem, so I'm not
sure if it's my fault.

I've added some debug output to my ringtest (see below) and
got the following result:

1: waiting for message
0: sending message (0) to 1
0: sent message

Here's the code:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv)
{
    int rank;
    int size;
    int message = 0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (!rank) {
        printf("%i: sending message (%i) to %i\n", rank, message, 1);
        MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
        printf("%i: sent message\n", rank);
        MPI_Recv(&message, 1, MPI_INT, size-1, 0, MPI_COMM_WORLD,
                MPI_STATUS_IGNORE);
        printf("%i: got message (%i) from %i\n", rank, message, size-1);
    } else {
        printf("%i: waiting for message\n");
        MPI_Recv(&message, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        message += 1;
        MPI_Send(&message, 1, MPI_INT, (rank+1)%size, 0, MPI_COMM_WORLD);
        printf("%i: got message (%i) from %i, sending to %i\n", rank, message,
               rank-1, (rank+1)%size);
    }

    MPI_Finalize();
    return 0;
}

Not very particular, but as seen in the gdb output and also
from the debug lines, both processes are waiting in PMPI_Recv(),
expecting a message to arrive.

Is this a known problem? What's wrong? Usercode? ompi?
As far as I can see (tcpdump and strace), all tcp connections
are up, so the message might got stuck between rank0 and rank1.

-- 
mail: adi_at_[hidden]  	http://adi.thur.de	PGP: v2-key via keyserver
Windows not found - Abort/Retry/Smile