Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Gus Correa (gus_at_[hidden])
Date: 2013-08-12 15:32:35


Thank you for the prompt help, Ralph!

Yes, it is OMPI 1.4.3 built with openib support:

$ ompi_info | grep openib
                  MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)

There are three libraries in prefix/lib/openmpi,
no mca_btl_openib library.

$ ls $PREFIX/lib/openmpi/
libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so

However, this may be just because it is an older OMPI version in
the 1.4 series.
Because those are exactly what I have in another cluster with IB,
and OMPI 1.4.3, where there isn't a problem.
The libraries' organization may have changed from
the 1.4 to the 1.6 series, right?
I only have mca_btl_openib libraries in the 1.6 series, but it
will be a hardship to migrate this program to OMPI 1.6.

(OK, I have newer OMPI, but I need old also for some
programs).

Why the heck it is not detecting the Infinband hardware?
[It used to detect it! :( ]

Thank you,
Gus Correa

On 08/12/2013 03:01 PM, Ralph Castain wrote:
> Check ompi_info - was it built with openib support?
>
> Then check that the mca_btl_openib library is present in the prefix/lib/openmpi directory
>
> Sounds like it isn't finding the openib plugin
>
>
> On Aug 12, 2013, at 11:57 AM, Gus Correa<gus_at_[hidden]> wrote:
>
>> Dear Open MPI pros
>>
>> On one of the clusters here, that has Infinband,
>> I am getting this type of errors from
>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>
>> *********************************************************
>> Tcl_InitNotifier: unable to start notifier thread
>> Abort: Command not found.
>> Tcl_InitNotifier: unable to start notifier thread
>> Abort: Command not found.
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[907,1],68]) is on host: node11.cluster
>> Process 2 ([[907,1],0]) is on host: node15
>> BTLs attempted: self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> *********************************************************
>>
>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>> The same error also happens if I force --mca btl openib,sm,self
>> in mpiexec.
>>
>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>
>> I don't understand either the initial errors
>> "Tcl_InitNotifier: unable to start notifier thread".
>> Are they coming from Torque perhaps?
>>
>> As I said, the cluster has Infiniband,
>> which is what we've been using forever, until
>> these errors started today.
>>
>> When I divert the traffic to tcp
>> (--mca btl tcp,sm,self), the jobs run normally.
>>
>> I am using the examples/connectivity_c.c program
>> to troubleshoot this problem.
>>
>> ***
>> I checked a few things on the IB side.
>>
>> The output of ibstat on all nodes seems OK (links up, etc),
>> and so are the output of ibhosts and ibchecknet.
>>
>> Only two connected ports had errors, as reported by ibcheckerrors,
>> and I cleared them with iblclearerrors.
>>
>> The IB subnet manager is running on the head node.
>> I restarted the daemon, but nothing changed, the job continue to
>> fail with the same errors.
>>
>> **
>>
>> Any hints of what is going on, how to diagnose it, and how to fix it?
>> Any gentler way than reboot everything and power cycling
>> the IB switch? (And would this brute force method work, at least?)
>>
>> Thank you,
>> Gus Correa
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users