Your command line may have just come across with a typo, but something isn't right:

-hostfile /home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1

That looks more like a path to a binary than a path to a hostfile. Is there a missing space or filename somewhere?

If not, then I would have expected this to error out since the argument would be taken as the hostfile, leaving no executable specified.

If you get that straightened out, then try adding -mca btl openib,sm,self to the cmd line. This will direct mpirun to use only the OpenIB, shared memory, and loopback transports, so you shouldn't pick up uDAPL any more.

Ralph


On Nov 20, 2008, at 12:38 PM, Michael Oevermann wrote:

Hi all,

I have "inherited" a small cluster with a head node and four compute
nodes which I have to administer.  The nodes are connected via infiniband (OFED), but the head is not. 
I am a complete novice to the infiniband stuff and here is my problem:

The infiniband configuration seems to be OK. The usual tests suggested in the OFED install guide give 
the expected output, e.g.

ibv_devinfo on the nodes:


************************* oscar_cluster *************************
--------- n01---------
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0025:930c
sys_image_guid: 0002:c902:0025:930f
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0140001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 1
port_lmc: 0x00

etc. for the other nodes.

sminfo on the nodes:

************************* oscar_cluster *************************
--------- n01---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6881 priority 0 state 3 SMINFO_MASTER
--------- n02---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6882 priority 0 state 3 SMINFO_MASTER
--------- n03---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6883 priority 0 state 3 SMINFO_MASTER
--------- n04---------
sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6884 priority 0 state 3 SMINFO_MASTER



However, when I directly start a mpi job (without using a scheduler) via:

/usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile /home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1

I get the error message:

0,1,0]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,2]: uDAPL on host n01 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,3]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host n02 was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
MPI with normal GB Etherrnet and IP networking just works fine, but the infinband doesn't. The MPI libs I am using
for the test are definitely compiled with IB support and the tests have been run successfully on
the cluster before.

Any suggestions what is going wrong here?

Best regards and thanks for any help!

Michael




_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users