Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Gus Correa (gus_at_[hidden])
Date: 2013-08-12 15:32:35


Thank you for the prompt help, Ralph!

Yes, it is OMPI 1.4.3 built with openib support:

$ ompi_info | grep openib
                  MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)

There are three libraries in prefix/lib/openmpi,
no mca_btl_openib library.

$ ls $PREFIX/lib/openmpi/
libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so

However, this may be just because it is an older OMPI version in
the 1.4 series.
Because those are exactly what I have in another cluster with IB,
and OMPI 1.4.3, where there isn't a problem.
The libraries' organization may have changed from
the 1.4 to the 1.6 series, right?
I only have mca_btl_openib libraries in the 1.6 series, but it
will be a hardship to migrate this program to OMPI 1.6.

(OK, I have newer OMPI, but I need old also for some
programs).

Why the heck it is not detecting the Infinband hardware?
[It used to detect it! :( ]

Thank you,
Gus Correa

On 08/12/2013 03:01 PM, Ralph Castain wrote:
> Check ompi_info - was it built with openib support?
>
> Then check that the mca_btl_openib library is present in the prefix/lib/openmpi directory
>
> Sounds like it isn't finding the openib plugin
>
>
> On Aug 12, 2013, at 11:57 AM, Gus Correa<gus_at_[hidden]> wrote:
>
>> Dear Open MPI pros
>>
>> On one of the clusters here, that has Infinband,
>> I am getting this type of errors from
>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>
>> *********************************************************
>> Tcl_InitNotifier: unable to start notifier thread
>> Abort: Command not found.
>> Tcl_InitNotifier: unable to start notifier thread
>> Abort: Command not found.
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[907,1],68]) is on host: node11.cluster
>> Process 2 ([[907,1],0]) is on host: node15
>> BTLs attempted: self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> *********************************************************
>>
>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>> The same error also happens if I force --mca btl openib,sm,self
>> in mpiexec.
>>
>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>
>> I don't understand either the initial errors
>> "Tcl_InitNotifier: unable to start notifier thread".
>> Are they coming from Torque perhaps?
>>
>> As I said, the cluster has Infiniband,
>> which is what we've been using forever, until
>> these errors started today.
>>
>> When I divert the traffic to tcp
>> (--mca btl tcp,sm,self), the jobs run normally.
>>
>> I am using the examples/connectivity_c.c program
>> to troubleshoot this problem.
>>
>> ***
>> I checked a few things on the IB side.
>>
>> The output of ibstat on all nodes seems OK (links up, etc),
>> and so are the output of ibhosts and ibchecknet.
>>
>> Only two connected ports had errors, as reported by ibcheckerrors,
>> and I cleared them with iblclearerrors.
>>
>> The IB subnet manager is running on the head node.
>> I restarted the daemon, but nothing changed, the job continue to
>> fail with the same errors.
>>
>> **
>>
>> Any hints of what is going on, how to diagnose it, and how to fix it?
>> Any gentler way than reboot everything and power cycling
>> the IB switch? (And would this brute force method work, at least?)
>>
>> Thank you,
>> Gus Correa
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users