Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi-1.2.4-1/OFED 1.2.5.4 ConnectX MPI_Reduce hang
From: Mostyn Lewis (Mostyn.Lewis_at_[hidden])
Date: 2008-01-25 17:41:29


Using todays SVN (1.3a1r17234) and building in the context of OFED 1.2.5.4
installed and it works!

Regards,
Mostyn

On Thu, 24 Jan 2008, Mostyn Lewis wrote:

> Hello,
>
> I have a very simple MPI program hanging in MPI_Reduce using the openmpi-1.2.4-1
> as supplied with OFED 1.2.5.4 (running this too).
>
> It works with same hardware using the supplied mvapich (mvapich-0.9.9).
>
> The hardware is a Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0) HCA
> (SUN/voltaire) and the switch is a voltaire ISR9024D (running at DDR rate).
>
> ------------------------------------------------------------------------------
> Switch software/firmware is:
> ISR9024D-2c0c> version show
> ISR 9024 version: 3.4.5
> date: Oct 09 2007 11:46:00 AM
> build Id:467
>
> ISR9024D-2c0c> module-firmware show
> Anafa self address: lid 1 firmware 1.0.0 gid 0xfe800000000000000008f10400412c0c
>
> ------------------------------------------------------------------------------
> HCA info:
> /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/bin$ ./ibv_devinfo
> hca_id: mlx4_0
> fw_ver: 2.2.000
> node_guid: 0003:ba00:0100:5cf0
> sys_image_guid: 0003:ba00:0100:5cf3
> vendor_id: 0x03ba
> vendor_part_id: 25418
> hw_ver: 0xA0
> board_id: SUN0060000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 10
> port_lmc: 0x00
>
> port: 2
> state: PORT_DOWN (1)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
>
>
> ./ibstatus
>
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0003:ba00:0100:5cf1
> base lid: 0xa
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 20 Gb/sec (4X DDR)
>
>
> The program is an old LAM test (cpi.c)
>
> ------------------------------------------------------------------------------
> #include <stdio.h>
> #include <sys/types.h>
> #include <unistd.h>
> #include <math.h>
> #include <mpi.h>
>
> /* Constant for how many values we'll estimate */
> #define NUM_ITERS 1000
>
> /* Prototype the function that we'll use below. */
> static double f(double);
>
> int
> main(int argc, char *argv[])
> {
> int iter, rank, size, i;
> double PI25DT = 3.141592653589793238462643;
> double mypi, pi, h, sum, x;
> double startwtime = 0.0, endwtime;
> int namelen;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
>
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Get_processor_name(processor_name, &namelen);
>
> printf("Process %d of %d on %s\n", rank, size, processor_name);
>
> for (iter = 2; iter < NUM_ITERS; ++iter) {
> h = 1.0 / (double) iter;
> sum = 0.0;
>
> for (i = rank + 1; i <= iter; i += size) {
> x = h * ((double) i - 0.5);
> sum += f(x);
> }
> mypi = h * sum;
>
> MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
> }
> MPI_Finalize();
> return 0;
> }
>
>
> static double
> f(double a)
> {
> return (4.0 / (1.0 + a * a));
> }
> ------------------------------------------------------------------------------
>
> The gcc openmpi hang from gdb looks like:
> (gdb) where
> #0 0x00002b60d54428e5 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1 0x00002b60d8705aec in mlx4_poll_cq (ibcq=0x5b0bf0, ne=1, wc=0x7fffd6051390) at src/cq.c:334
> #2 0x00002b60d7c865bc in btl_openib_component_progress ()
> at /tmp/OFED-1.2.5.4/OFED/tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/include/infiniband/verbs
> .h:883
> #3 0x00002b60d7b7925a in mca_bml_r2_progress () at bml_r2.c:106
> #4 0x00002b60d4e6d11a in opal_progress () at runtime/opal_progress.c:288
> #5 0x00002b60d7a6b8b8 in mca_pml_ob1_recv (addr=0x7fffd60517c8, count=1, datatype=0x501660, src=8,
> tag=-21, comm=<value optimized out>, status=0x0) at ../../../../opal/threads/condition.h:81
> #6 0x00002b60d84e3cfa in ompi_coll_tuned_reduce_intra_basic_linear (sbuf=0x7fffd60517d0,
> rbuf=0x7fffd60517c8, count=1, dtype=0x501660, op=0x5012f0, root=<value optimized out>,
> comm=0x5014a0) at coll_tuned_reduce.c:385
> #7 0x00002b60d4bcd32f in PMPI_Reduce (sendbuf=0x7fffd60517d0, recvbuf=0x7fffd60517c8, count=1,
> datatype=0x501660, op=0x5012f0, root=0, comm=0x5014a0) at preduce.c:96
> #8 0x0000000000400cee in main ()
>
> A pgi compiled hang from gdb looks like:
>
> (gdb) where
> #0 0x00002ac216e408e5 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1 0x00002ac2177ceaec in mlx4_poll_cq (ibcq=0x5b52c0, ne=1, wc=0x7fff97255600) at src/cq.c:334
> #2 0x00002ac216bf51c2 in ibv_poll_cq ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_btl
> _openib.so
> #3 0x00002ac216bf8182 in btl_openib_component_progress ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_btl
> _openib.so
> #4 0x00002ac216ae9b24 in mca_bml_r2_progress ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_bml
> _r2.so
> #5 0x00002ac213d60be4 in opal_progress ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/libopen-pal.so.
> 0
> #6 0x00002ac2169d4f45 in opal_condition_wait ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_pml
> _ob1.so
> #7 0x00002ac2169d5a83 in mca_pml_ob1_recv ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_pml
> _ob1.so
> #8 0x00002ac2175a1e67 in ompi_coll_tuned_reduce_intra_basic_linear ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_col
> l_tuned.so
> #9 0x00002ac217597ca5 in ompi_coll_tuned_reduce_intra_dec_fixed ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_col
> l_tuned.so
> #10 0x00002ac213a07e38 in PMPI_Reduce ()
> from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/libmpi.so.0
> #11 0x0000000000402551 in main ()
>
> ------------------------------------------------------------------------------
> The openmpi_gcc script was:
> #!/bin/ksh
> set -x
>
> export PATH=/tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/gcc/openmpi-1.2.4-1/bin:$PATH
> PREFIX="--prefix /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/gcc/openmpi-1.2.4-1"
> MCA="-mca btl openib,self -mca btl_tcp_if_exclude lo,eth1 -mca oob_tcp_if_exclude lo,eth1"
> mpicc cpi.c
> mpirun $PREFIX $MCA -np 9 -machinefile ic48scali ./a.out
>
>
> Any ideas who may be the culprit in this hang?
>
> Regards,
> Mostyn
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>