Hi: In response to my own question, by studying the file mca-btl-openib-device-params.ini, I discovered that this option in OMPI-1.4.2: -mca btl_openib_receive_queues P,65536,256,192,128 was sufficient to prevent OMPI from trying to create shared receive queues and allowed my application to run to completion using the IB hardware. I guess my question now is: What do these numbers mean? Presumably the size (or counts?) of buffers to allocate? Are there limits or a way to tune these values? Thanks, Allen On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote:Hi Terry: It is indeed the case that the openib BTL has not been initialized. I ran with your tcp-disabled MCA option and it aborted in MPI_Init. The OFED stack is what's included in RHEL4. It appears to be made up of the RPMs: openib-1.4-1.el4 opensm-3.2.5-1.el4 libibverbs-1.1.2-1.el4 How can I determine if srq is supported? Is there an MCA option to defeat it? (Our in-house cluster has more recent Mellanox IB hardware and is running this same IB stack and ompi 1.4.2 works OK, so I suspect srq is supported by the OpenFabrics stack. Perhaps.) Thanks, Allen On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:My guess is from the message below saying "(openib) BTL failed to initialize" that the code is probably running over tcp. To absolutely prove this you can specify to only use the openib, sm and self btls to eliminate the tcp btl. To do that you add the following to the mpirun line "-mca btl openib,sm,self". I believe with that specification the code will abort and not run to completion. What version of the OFED stack are you using? I wonder if srq is supported on your system or not? --td Allen Barnett wrote:Hi: A customer is attempting to run our OpenMPI 1.4.2-based application on a cluster of machines running RHEL4 with the standard OFED stack. The HCAs are identified as: 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) ibv_devinfo says that one port on the HCAs is active but the other is down: hca_id: mthca0 fw_ver: 3.0.2 node_guid: 0006:6a00:9800:4c78 sys_image_guid: 0006:6a00:9800:4c78 vendor_id: 0x066a vendor_part_id: 23108 hw_ver: 0xA1 phys_port_cnt: 2 port: 1 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 26 port_lmc: 0x00 port: 2 state: down (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 When the OMPI application is run, it prints the error message: -------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: machine001.lan OMPI source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 Function: ibv_create_srq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------- The full log of a run with "btl_openib_verbose 1" is attached. My application appears to run to completion, but I can't tell if it's just running on TCP and not using the IB hardware. I would appreciate any suggestions on how to proceed to fix this error. Thanks, Allen