Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] gadget2 infiniband openmpi hang
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-03-17 12:45:32


Are you able to run if you use --mca btl_openib_cpc_include rdmacm ?

On Mar 17, 2011, at 10:57 AM, Craig West wrote:

> Hi,
> I'm a system administrator trying to help users resolve gadget 2 code hangs doing MPI_Sendrecv (similar to http://www.open-mpi.org/community/lists/users/2010/05/13057.php).
> I'm trying to determine appropriate values for mpool_rdma_rcache_size_limit for our hardware, and to make sure RDMA settings are appropriate and do not lead to data corruption (http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2).
> The gadget code was running fine under openmpi 1.2.9 and the hangs showed up in 1.4.3 (actually also 1.3.2).
>
> code runs using tcp (-mca btl tcp,self,sm)
>
> code hangs using infiniband
>
> code runs using infiniband with "-mca btl_openib_flags 1" and "-mca mpool_rdma_rcache_size_limit 209715200" (suggestion from poster from the referenced link above)
>
> Any suggestions would be appreciated.
> Regards,
> Gretchen
> 0. openmpi 1.4.3 (ompi_info attached, config.log is missing but may not be needed as this is a more general usage/settings question)
> 1. OFED 1.4.2 from git.openfabrics.org
> 2. Debian 5.0, kernel 2.6.26-2-amd64
> 3. opensm-3.2.6
> 4. ibv_devinfo
> hca_id: mlx4_0
> fw_ver: 2.6.000
> node_guid: 0002:c903:0002:848c
> sys_image_guid: 0002:c903:0002:848f
> vendor_id: 0x02c9
> vendor_part_id: 25408
> hw_ver: 0xA0
> board_id: MT_04A0130005
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 30
> port_lid: 99
> port_lmc: 0x00
>
> 5. ifconfig
> ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
> inet addr:10.16.10.20 Bcast:10.16.10.255 Mask:255.255.255.0
> inet6 addr: fe80::202:c903:2:848d/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
> RX packets:1936 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:5 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:189055 (184.6 KiB) TX bytes:0 (0.0 B)
> 6. unlimited
>
>
>
>
> <ompi_info.txt>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/