Hi,
I'm a system administrator trying to help users resolve gadget 2 code hangs doing MPI_Sendrecv (similar to http://www.open-mpi.org/community/lists/users/2010/05/13057.php).
I'm trying to determine appropriate values for mpool_rdma_rcache_size_limit for our hardware, and to make sure RDMA settings are appropriate and do not lead to data corruption (http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2).
The gadget code was running fine under openmpi 1.2.9 and the hangs showed up in 1.4.3 (actually also 1.3.2).
code runs using tcp (-mca btl tcp,self,sm)
code hangs using infiniband
code runs using infiniband with "-mca btl_openib_flags 1" and "-mca
mpool_rdma_rcache_size_limit 209715200" (suggestion from poster from the
referenced link above)
Any suggestions would be appreciated.
Regards,
Gretchen
0. openmpi 1.4.3 (ompi_info attached, config.log is missing but may not be needed as this is a more general usage/settings question)
1. OFED 1.4.2 from git.openfabrics.org
2. Debian 5.0, kernel 2.6.26-2-amd64
3. opensm-3.2.6
4. ibv_devinfo
hca_id: mlx4_0
fw_ver: 2.6.000
node_guid: 0002:c903:0002:848c
sys_image_guid: 0002:c903:0002:848f
vendor_id: 0x02c9
vendor_part_id: 25408
hw_ver: 0xA0
board_id: MT_04A0130005
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 30
port_lid: 99
port_lmc: 0x00
5. ifconfig
ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.16.10.20 Bcast:10.16.10.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:2:848d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:1936 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:189055 (184.6 KiB) TX bytes:0 (0.0 B)
6. unlimited