Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Georg Wassen (wassen_at_[hidden])
Date: 2007-06-26 11:06:26


Hello all,

I temporarily worked around my former problem by using synchronous
communication and shifting the initialization
into the first call of a collective operation.

But nevertheless, I found a performance bug in btl_openib.

When I execute the attached sendrecv.c on 4 (or more) nodes of a Pentium
D Cluster with Infinniband,
each receiving process gets only 8 messages in some seconds and then
does nothing for at least 20 sec.
(I executed the following command and hit Ctrl-C 20 sec. after the last
output)

wassen_at_elrohir:~/src/mpi_test$ mpirun -np 4 -host
pd-01,pd-02,pd-03,pd-04 -mca btl openib,self sendrecv
[3] received data[0]=1
[1] received data[0]=1
[1] received data[1]=2
[1] received data[2]=3
[1] received data[3]=4
[1] received data[4]=5
[1] received data[5]=6
[1] received data[6]=7
[1] received data[7]=8
[2] received data[0]=1
[2] received data[1]=2
[2] received data[2]=3
[2] received data[3]=4
[2] received data[4]=5
[2] received data[5]=6
[2] received data[6]=7
[2] received data[7]=8
[3] received data[1]=2
[3] received data[2]=3
[3] received data[3]=4
[3] received data[4]=5
[3] received data[5]=6
[3] received data[6]=7
[3] received data[7]=8
{20 sec. later...}
mpirun: killing job...

When I execute the same program with "-mca btl udapl,self" or "-mca btl
tcp,self", it runs fine and terminates in less than a second.
Tried with Open MPI 1.2.1 and 1.2.3. The test program runs fine with
several other MPIs (intel-mpi and mvapich with InfinniBand, mp-mpich
with SCI).

I hope, my information suffices to reproduce the problem.

Best regards,
Georg Wassen.

ps. I know that I could transmit the array in one MPI_Send, but this is
extracted from my real problem.

--------------------1st node-----------------------
wassen_at_pd-01:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
        fw_ver: 1.2.0
        node_guid: 0002:c902:0020:b680
        sys_image_guid: 0002:c902:0020:b683
        vendor_id: 0x02c9
        vendor_part_id: 25204
        hw_ver: 0xA0
        board_id: MT_0230000001
        phys_port_cnt: 1
                port: 1
                        state: PORT_ACTIVE (4)
                        max_mtu: 2048 (4)
                        active_mtu: 2048 (4)
                        sm_lid: 1
                        port_lid: 1
                        port_lmc: 0x00

---------------------------------------------------------
wassen_at_pd-01:~$ /sbin/ifconfig
...
ib0 Protokoll:UNSPEC Hardware Adresse
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
          inet Adresse:192.168.0.11 Bcast:192.168.0.255
Maske:255.255.255.0
          inet6 Adresse: fe80::202:c902:20:b681/64
Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
          RX packets:260 errors:0 dropped:0 overruns:0 frame:0
          TX packets:331 errors:0 dropped:2 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:128
          RX bytes:14356 (14.0 KiB) TX bytes:24960 (24.3 KiB)
-------------------------------------------------------