My guess is from the message below saying "(openib) BTL failed to initialize"  that the code is probably running over tcp.  To absolutely prove this you can specify to only use the openib, sm and self btls to eliminate the tcp btl.  To do that you add the following to the mpirun line "-mca btl openib,sm,self".  I believe with that specification the code will abort and not run to completion. 

What version of the OFED stack are you using?  I wonder if srq is supported on your system or not?


Allen Barnett wrote:
Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
on a cluster of machines running RHEL4 with the standard OFED stack. The
HCAs are identified as:

03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)

ibv_devinfo says that one port on the HCAs is active but the other is

hca_id:	mthca0
	fw_ver:				3.0.2
	node_guid:			0006:6a00:9800:4c78
	sys_image_guid:			0006:6a00:9800:4c78
	vendor_id:			0x066a
	vendor_part_id:			23108
	hw_ver:				0xA1
	phys_port_cnt:			2
		port:	1
			state:			active (4)
			max_mtu:		2048 (4)
			active_mtu:		2048 (4)
			sm_lid:			1
			port_lid:		26
			port_lmc:		0x00

		port:	2
			state:			down (1)
			max_mtu:		2048 (4)
			active_mtu:		512 (2)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00

 When the OMPI application is run, it prints the error message:

The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue.  This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?).  The failure occured here:

  Local host:  machine001.lan
source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
  Function:    ibv_create_srq()
  Error:       Invalid argument (errno=22)
  Device:      mthca0

You may need to consult with your system administrator to get this
problem fixed.

The full log of a run with "btl_openib_verbose 1" is attached. My
application appears to run to completion, but I can't tell if it's just
running on TCP and not using the IB hardware.

I would appreciate any suggestions on how to proceed to fix this error.



_______________________________________________ users mailing list

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803