Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Åke Sandgren (ake.sandgren_at_[hidden])
Date: 2006-11-16 16:53:40


Hi!

I'm having problems running the Allgather test of the IMB 3.0.

System: Ubuntu Dapper, dual Amd Opteron, Myricom MX 1.1.5
OMPI version: 1.1.2 and 1.2b1
buildflags -O0 -g

started with mpirun -mca mpi_yield_when_idle 1 -mca
mpi_keep_peer_hostnames 0

(The problem also exists when mpi_yield_when_idle is 0)

When running with 88 nodes (one task per node) the test runs ok, but
when run with 89 nodes or more it never returns any data. It prints the
header, up to

# List of Benchmarks to run

# Allgather

and then nothing.

If i trap into one the task 0 process with gdb it shows
#0 0x00002aaaab7185f9 in sched_yield () from /lib/libc.so.6
#1 0x00002aaaaaf48d06 in opal_progress () at runtime/opal_progress.c:301
#2 0x00002aaaae5948f4 in opal_condition_wait (c=0x2aaaae69a890,
    m=0x2aaaae69a840) at condition.h:81
#3 0x00002aaaae59471d in __ompi_free_list_wait (fl=0x2aaaae69a790,
    item=0x7fffffae7948) at ompi_free_list.h:180
#4 0x00002aaaae594c86 in mca_btl_mx_prepare_src (btl=0x557200,
    endpoint=0x7bd690, registration=0x0, convertor=0x5712e0, reserve=32,
    size=0x7fffffae79e8) at btl_mx.c:263
#5 0x00002aaaae157507 in mca_bml_base_prepare_src (bml_btl=0x7bded0, reg=0x0,
    conv=0x5712e0, reserve=32, size=0x7fffffae79e8, des=0x7fffffae7a00)
    at bml.h:315
#6 0x00002aaaae157b57 in mca_pml_ob1_send_request_start_rndv (
    sendreq=0x571200, bml_btl=0x7bded0, size=16292, flags=8)
    at pml_ob1_sendreq.c:803
#7 0x00002aaaae14ba66 in mca_pml_ob1_send_request_start_btl (
    sendreq=0x571200, bml_btl=0x7bded0) at pml_ob1_sendreq.h:332
#8 0x00002aaaae14b6f4 in mca_pml_ob1_send_request_start (sendreq=0x571200)
    at pml_ob1_sendreq.h:374
#9 0x00002aaaae14bf6d in mca_pml_ob1_send (buf=0x2aaab0832010, count=65536,
    datatype=0x50c180, dst=1, tag=-17, sendmode=MCA_PML_BASE_SEND_STANDARD,
    comm=0x7ed4b0) at pml_ob1_isend.c:103
#10 0x00002aaaaecd7a0d in ompi_coll_tuned_bcast_intra_chain (
    buff=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7ed4b0, segsize=65536, chains=1) at coll_tuned_bcast.c:109
#11 0x00002aaaaecd7e90 in ompi_coll_tuned_bcast_intra_pipeline (
    buffer=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7ed4b0, segsize=65536) at coll_tuned_bcast.c:208
#12 0x00002aaaaecd2d79 in ompi_coll_tuned_bcast_intra_dec_fixed (
    buff=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7ed4b0) at coll_tuned_decision_fixed.c:205
#13 0x00002aaaae9bce6f in mca_coll_basic_allgather_intra (sbuf=0x2aaab0431010,
    scount=4194304, sdtype=0x50c180, rbuf=0x2aaab0832010, rcount=4194304,
    rdtype=0x50c180, comm=0x7ed4b0) at coll_basic_allgather.c:77
#14 0x00002aaaaac2efb2 in PMPI_Allgather (sendbuf=0x2aaab0431010,
    sendcount=4194304, sendtype=0x50c180, recvbuf=0x2aaab0832010,
    recvcount=4194304, recvtype=0x50c180, comm=0x7ed4b0) at pallgather.c:75
#15 0x00000000004088e8 in IMB_allgather ()
#16 0x00000000004065a2 in IMB_warm_up ()
#17 0x000000000040347a in main ()

The last task shows
#0 0x00002aaaab7185f9 in sched_yield () from /lib/libc.so.6
#1 0x00002aaaaaf48d06 in opal_progress () at runtime/opal_progress.c:301
#2 0x00002aaaae14aace in opal_condition_wait (c=0x2aaaaadb2880,
    m=0x2aaaaadb2900) at condition.h:81
#3 0x00002aaaae14a9ad in mca_pml_ob1_recv (addr=0x2aaab002f010, count=65536,
    datatype=0x50c180, src=87, tag=-17, comm=0x7fc170, status=0x0)
    at pml_ob1_irecv.c:107
#4 0x00002aaaaecd7d07 in ompi_coll_tuned_bcast_intra_chain (
    buff=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7fc170, segsize=65536, chains=1) at coll_tuned_bcast.c:179
#5 0x00002aaaaecd7e90 in ompi_coll_tuned_bcast_intra_pipeline (
    buffer=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7fc170, segsize=65536) at coll_tuned_bcast.c:208
#6 0x00002aaaaecd2d79 in ompi_coll_tuned_bcast_intra_dec_fixed (
    buff=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
    comm=0x7fc170) at coll_tuned_decision_fixed.c:205
#7 0x00002aaaae9bce6f in mca_coll_basic_allgather_intra (sbuf=0x2aaaafc2e010,
    scount=4194304, sdtype=0x50c180, rbuf=0x2aaab002f010, rcount=4194304,
    rdtype=0x50c180, comm=0x7fc170) at coll_basic_allgather.c:77
#8 0x00002aaaaac2efb2 in PMPI_Allgather (sendbuf=0x2aaaafc2e010,
    sendcount=4194304, sendtype=0x50c180, recvbuf=0x2aaab002f010,
    recvcount=4194304, recvtype=0x50c180, comm=0x7fc170) at pallgather.c:75
#9 0x00000000004088e8 in IMB_allgather ()
#10 0x00000000004065a2 in IMB_warm_up ()
#11 0x000000000040347a in main ()

Any ideas??

I have no problem running the the Reduce_scatter or Allreduce test of IMB.