Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
on a cluster of machines running RHEL4 with the standard OFED stack. The
HCAs are identified as:
03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
ibv_devinfo says that one port on the HCAs is active but the other is
down:
hca_id: mthca0
fw_ver: 3.0.2
node_guid: 0006:6a00:9800:4c78
sys_image_guid: 0006:6a00:9800:4c78
vendor_id: 0x066a
vendor_part_id: 23108
hw_ver: 0xA1
phys_port_cnt: 2
port: 1
state: active (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 26
port_lmc: 0x00
port: 2
state: down (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
When the OMPI application is run, it prints the error message:
--------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue. This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?). The failure occured here:
Local host: machine001.lan
OMPI
source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
Function: ibv_create_srq()
Error: Invalid argument (errno=22)
Device: mthca0
You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------
The full log of a run with "btl_openib_verbose 1" is attached. My
application appears to run to completion, but I can't tell if it's just
running on TCP and not using the IB hardware.
I would appreciate any suggestions on how to proceed to fix this error.
Thanks,
Allen
--
Allen Barnett
Transpire, Inc
E-Mail: allen_at_[hidden]
|