Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] infiniband problem
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-20 14:51:56


Your command line may have just come across with a typo, but something
isn't right:

-hostfile /home/sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/
openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1

That looks more like a path to a binary than a path to a hostfile. Is
there a missing space or filename somewhere?

If not, then I would have expected this to error out since the
argument would be taken as the hostfile, leaving no executable
specified.

If you get that straightened out, then try adding -mca btl
openib,sm,self to the cmd line. This will direct mpirun to use only
the OpenIB, shared memory, and loopback transports, so you shouldn't
pick up uDAPL any more.

Ralph

On Nov 20, 2008, at 12:38 PM, Michael Oevermann wrote:

> Hi all,
>
> I have "inherited" a small cluster with a head node and four compute
> nodes which I have to administer. The nodes are connected via
> infiniband (OFED), but the head is not.
> I am a complete novice to the infiniband stuff and here is my problem:
>
> The infiniband configuration seems to be OK. The usual tests
> suggested in the OFED install guide give
> the expected output, e.g.
>
> ibv_devinfo on the nodes:
>
> ************************* oscar_cluster *************************
> --------- n01---------
> hca_id: mthca0
> fw_ver: 1.2.0
> node_guid: 0002:c902:0025:930c
> sys_image_guid: 0002:c902:0025:930f
> vendor_id: 0x02c9
> vendor_part_id: 25204
> hw_ver: 0xA0
> board_id: MT_03B0140001
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 2
> port_lid: 1
> port_lmc: 0x00
>
> etc. for the other nodes.
>
> sminfo on the nodes:
>
> ************************* oscar_cluster *************************
> --------- n01---------
> sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6881
> priority 0 state 3 SMINFO_MASTER
> --------- n02---------
> sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6882
> priority 0 state 3 SMINFO_MASTER
> --------- n03---------
> sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6883
> priority 0 state 3 SMINFO_MASTER
> --------- n04---------
> sminfo: sm lid 2 sm guid 0x2c90200259201, activity count 6884
> priority 0 state 3 SMINFO_MASTER
>
>
>
> However, when I directly start a mpi job (without using a scheduler)
> via:
>
> /usr/mpi/gcc4/openmpi-1.2.2-1/bin/mpirun -np 4 -hostfile /home/
> sysgen/infiniband-mpi-test/machine/usr/mpi/gcc4/openmpi-1.2.2-1/
> tests/IMB-2.3/IMB-MPI1
>
> I get the error message:
>
> 0,1,0]: uDAPL on host n01 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,2]: uDAPL on host n01 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,3]: uDAPL on host n02 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> [0,1,1]: uDAPL on host n02 was unable to find any NICs.
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> MPI with normal GB Etherrnet and IP networking just works fine, but
> the infinband doesn't. The MPI libs I am using
> for the test are definitely compiled with IB support and the tests
> have been run successfully on
> the cluster before.
>
> Any suggestions what is going wrong here?
>
> Best regards and thanks for any help!
>
> Michael
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users