Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Marco Sbrighi (m.sbrighi_at_[hidden])
Date: 2007-08-31 12:51:43


Dear Open MPI developers,

I'm using Open MPI 1.2.2 over OFED 1.1 on an 680 nodes dual Opteron dual
core Linux cluster. Of course, with Infiniband interconnect.
During the execution of big jobs (greater than 128 processes) I've
experimented slow down in performances and deadlock in collective MPI
operations. The job processes terminates often issuing "RETRY EXCEEDED
ERROR", of course if the btl_openib_ib_timeout is properly set.
Yes, this kind of error seems to be related to the fabric, but more or
less half of the MPI processes are incurring in timeout.....
In order to do a better investigation on that behaviour, I've tried to
do some "constrained" tests using SKaMPI, but it is quite difficult to
insulate a single collective operation using SKaMPI. In fact despite the
SKaMPI script can contain only a request for (say) a Reduce, with many
communicator sizes, the SKaMPI code will make also a lot of bcast,
alltoall etc. by itself.
So I've tried to use an hand made piece of code, in order to do "only" a
repeated collective operation at a time.
The code is attached to this message, the file is named
collect_noparms.c.
What is happened when I've tried to run this code is reported here:

......

011 - 011 - 039 NOOT START
000 - 000 of 38 - 655360 0.000000
[node1049:11804] *** Process received signal ***
[node1049:11804] Signal: Segmentation fault (11)
[node1049:11804] Signal code: Address not mapped (1)
[node1049:11804] Failing at address: 0x18
035 - 035 - 039 NOOT START
000 - 000 of 38 - 786432 0.000000
[node1049:11804] [ 0] /lib64/tls/libpthread.so.0 [0x2a964db420]
000 - 000 of 38 - 917504 0.000000
[node1049:11804] [ 1] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a9573fa18]
[node1049:11804] [ 2] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a9573f639]
[node1049:11804] [ 3] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_btl_sm_send+0x122) [0x2a9573f5e1]
[node1049:11804] [ 4] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957acac6]
[node1049:11804] [ 5] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x303) [0x2a957ace52]
[node1049:11804] [ 6] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957a2788]
[node1049:11804] [ 7] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 [0x2a957a251c]
[node1049:11804] [ 8] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send+0x2e2) [0x2a957a2d9e]
[node1049:11804] [ 9] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_generic+0x651) [0x2a95751621]
[node1049:11804] [10] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_pipeline+0x176) [0x2a95751bff]
[node1049:11804] [11] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x3f4) [0x2a957475f6]
[node1049:11804] [12] /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(PMPI_Reduce+0x3a6) [0x2a9570a076]
[node1049:11804] [13] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(reduce+0x3e) [0x404e64]
[node1049:11804] [14] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(main+0x620) [0x404c8e]
[node1049:11804] [15] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a966004bb]
[node1049:11804] [16] /bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x [0x40448a]
[node1049:11804] *** End of error message ***

.......

the behaviour is the same, more or less identical, using either
Infiniband or Gigabit interconnect. If I use another MPI implementation
(say MVAPICH), all goes right.
Then I've compiled both my code and Open MPI using gcc 3.4.4 with
bounds-checking, compiler debugging flags, without OMPI memory
manager ... the behaviour is identical but now I've the line were the
SIGSEGV is trapped:

----------------------------------------------------------------------------------------------------------------
gdb collect_noparms_bc.x core.11580
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1".

warning: core file may not match specified executable file.
Core was generated by `/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0...done.
Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0
Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0...done.
Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-rte.so.0
Reading symbols from /prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0...done.
Loaded symbols for /cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libopen-pal.so.0
Reading symbols from /usr/local/ofed/lib64/libibverbs.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibverbs.so.1
Reading symbols from /lib64/tls/librt.so.1...done.
Loaded symbols for /lib64/tls/librt.so.1
Reading symbols from /usr/lib64/libnuma.so.1...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/tls/libm.so.6...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/tls/libpthread.so.0...done.
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /usr/lib64/libsysfs.so.1...done.
Loaded symbols for /usr/lib64/libsysfs.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/local/ofed/lib64/infiniband/ipathverbs.so...done.
Loaded symbols for /usr/local/ofed/lib64/infiniband/ipathverbs.so
Reading symbols from /usr/local/ofed/lib64/infiniband/mthca.so...done.
Loaded symbols for /usr/local/ofed/lib64/infiniband/mthca.so
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
#0 0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x0)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370
370 h_ptr=fifo->head;
(gdb) bt
#0 0x0000002a9573fa18 in ompi_cb_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x0)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_circular_buffer_fifo.h:370
#1 0x0000002a9573f639 in ompi_fifo_write_to_head_same_base_addr (data=0x2a96f7df80, fifo=0x2a96e476a0, fifo_allocator=0x674100)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/class/ompi_fifo.h:312
#2 0x0000002a9573f5e1 in mca_btl_sm_send (btl=0x2a95923440, endpoint=0x6e9670, descriptor=0x2a96f7df80, tag=1 '\001')
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/btl/sm/btl_sm.c:894
#3 0x0000002a957acac6 in mca_bml_base_send (bml_btl=0x67fc00, des=0x2a96f7df80, tag=1 '\001')
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/bml/bml.h:283
#4 0x0000002a957ace52 in mca_pml_ob1_send_request_start_copy (sendreq=0x594080, bml_btl=0x67fc00, size=1024)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.c:565
#5 0x0000002a957a2788 in mca_pml_ob1_send_request_start_btl (sendreq=0x594080, bml_btl=0x67fc00)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:278
#6 0x0000002a957a251c in mca_pml_ob1_send_request_start (sendreq=0x594080)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_sendreq.h:345
#7 0x0000002a957a2d9e in mca_pml_ob1_send (buf=0x7b8400, count=256, datatype=0x51b8b0, dst=37, tag=-21,
    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x521c00) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/pml/ob1/pml_ob1_isend.c:103
#8 0x0000002a95751621 in ompi_coll_tuned_reduce_generic (sendbuf=0x7b8000, recvbuf=0x8b9000, original_count=32512,
    datatype=0x51b8b0, op=0x51ba40, root=0, comm=0x521c00, tree=0x520b00, count_by_segment=256)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:187
#9 0x0000002a95751bff in ompi_coll_tuned_reduce_intra_pipeline (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0,
    op=0x51ba40, root=0, comm=0x521c00, segsize=1024)
    at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_reduce.c:255
#10 0x0000002a957475f6 in ompi_coll_tuned_reduce_intra_dec_fixed (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0,
    op=0x51ba40, root=0, comm=0x521c00) at /cineca/prod/build/mpich/openmpi-1.2.2/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:353
#11 0x0000002a9570a076 in PMPI_Reduce (sendbuf=0x7b8000, recvbuf=0x8b9000, count=32768, datatype=0x51b8b0, op=0x51ba40, root=0,
    comm=0x521c00) at preduce.c:96
#12 0x0000000000404e64 in reduce (comm=0x521c00, count=32768) at collect_noparms.c:248
#13 0x0000000000404c8e in main (argc=1, argv=0x7fbffff308) at collect_noparms.c:187
(gdb)
-----------------------------------------

I think this bug is not related to my performance slowdown in collective
operations but ..... something seems to be wrong at an higher level in
MCA framework .....
Is there someone able to reproduce a similar bug?
Is there someone having performance slowdown in collective operations
with big jobs using OFED 1.1 over InfiniBand interconnect?
Does I need some further btl or coll tuning? (I've tried with SRQ but
that doesn't resolve my problems).

Marco

-- 
-----------------------------------------------------------------
 Marco Sbrighi  m.sbrighi_at_[hidden]
 HPC Group
 CINECA Interuniversity Computing Centre
 via Magnanelli, 6/3
 40033 Casalecchio di Reno (Bo) ITALY
 tel. 051 6171516