Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-08-12 15:01:38


Check ompi_info - was it built with openib support?

Then check that the mca_btl_openib library is present in the prefix/lib/openmpi directory

Sounds like it isn't finding the openib plugin

On Aug 12, 2013, at 11:57 AM, Gus Correa <gus_at_[hidden]> wrote:

> Dear Open MPI pros
>
> On one of the clusters here, that has Infinband,
> I am getting this type of errors from
> OpenMPI 1.4.3 (OK, I know it is old ...):
>
> *********************************************************
> Tcl_InitNotifier: unable to start notifier thread
> Abort: Command not found.
> Tcl_InitNotifier: unable to start notifier thread
> Abort: Command not found.
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[907,1],68]) is on host: node11.cluster
> Process 2 ([[907,1],0]) is on host: node15
> BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *********************************************************
>
> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
> The same error also happens if I force --mca btl openib,sm,self
> in mpiexec.
>
> ** Why is it attempting only the self and sm BTLs, but not openib? **
>
> I don't understand either the initial errors
> "Tcl_InitNotifier: unable to start notifier thread".
> Are they coming from Torque perhaps?
>
> As I said, the cluster has Infiniband,
> which is what we've been using forever, until
> these errors started today.
>
> When I divert the traffic to tcp
> (--mca btl tcp,sm,self), the jobs run normally.
>
> I am using the examples/connectivity_c.c program
> to troubleshoot this problem.
>
> ***
> I checked a few things on the IB side.
>
> The output of ibstat on all nodes seems OK (links up, etc),
> and so are the output of ibhosts and ibchecknet.
>
> Only two connected ports had errors, as reported by ibcheckerrors,
> and I cleared them with iblclearerrors.
>
> The IB subnet manager is running on the head node.
> I restarted the daemon, but nothing changed, the job continue to
> fail with the same errors.
>
> **
>
> Any hints of what is going on, how to diagnose it, and how to fix it?
> Any gentler way than reboot everything and power cycling
> the IB switch? (And would this brute force method work, at least?)
>
> Thank you,
> Gus Correa
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users