Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] gadget2 infiniband openmpi hang
From: Craig West (umassastrohpcc_at_[hidden])
Date: 2011-03-17 10:57:55


Hi,
I'm a system administrator trying to help users resolve gadget 2 code hangs
doing MPI_Sendrecv (similar to
http://www.open-mpi.org/community/lists/users/2010/05/13057.php).
I'm trying to determine appropriate values for mpool_rdma_rcache_size_limit
for our hardware, and to make sure RDMA settings are appropriate and do not
lead to data corruption (
http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2
).
The gadget code was running fine under openmpi 1.2.9 and the hangs showed up
in 1.4.3 (actually also 1.3.2).

code runs using tcp (-mca btl tcp,self,sm)

code hangs using infiniband

code runs using infiniband with "-mca btl_openib_flags 1" and "-mca
mpool_rdma_rcache_size_limit 209715200" (suggestion from poster from the
referenced link above)

Any suggestions would be appreciated.
Regards,
Gretchen
0. openmpi 1.4.3 (ompi_info attached, config.log is missing but may not be
needed as this is a more general usage/settings question)
1. OFED 1.4.2 from git.openfabrics.org
2. Debian 5.0, kernel 2.6.26-2-amd64
3. opensm-3.2.6
4. ibv_devinfo
hca_id: mlx4_0
    fw_ver: 2.6.000
    node_guid: 0002:c903:0002:848c
    sys_image_guid: 0002:c903:0002:848f
    vendor_id: 0x02c9
    vendor_part_id: 25408
    hw_ver: 0xA0
    board_id: MT_04A0130005
    phys_port_cnt: 2
        port: 1
            state: PORT_ACTIVE (4)
            max_mtu: 2048 (4)
            active_mtu: 2048 (4)
            sm_lid: 30
            port_lid: 99
            port_lmc: 0x00

5. ifconfig
ib0 Link encap:UNSPEC HWaddr
80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
          inet addr:10.16.10.20 Bcast:10.16.10.255 Mask:255.255.255.0
          inet6 addr: fe80::202:c903:2:848d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
          RX packets:1936 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:5 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:189055 (184.6 KiB) TX bytes:0 (0.0 B)
6. unlimited