Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Gus Correa (gus_at_[hidden])
Date: 2013-08-12 14:57:56

Dear Open MPI pros

On one of the clusters here, that has Infinband,
I am getting this type of errors from
OpenMPI 1.4.3 (OK, I know it is old ...):

Tcl_InitNotifier: unable to start notifier thread
Abort: Command not found.
Tcl_InitNotifier: unable to start notifier thread
Abort: Command not found.
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

   Process 1 ([[907,1],68]) is on host: node11.cluster
   Process 2 ([[907,1],0]) is on host: node15
   BTLs attempted: self sm

Your MPI job is now going to abort; sorry.

Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
The same error also happens if I force --mca btl openib,sm,self
in mpiexec.

** Why is it attempting only the self and sm BTLs, but not openib? **

I don't understand either the initial errors
"Tcl_InitNotifier: unable to start notifier thread".
Are they coming from Torque perhaps?

As I said, the cluster has Infiniband,
which is what we've been using forever, until
these errors started today.

When I divert the traffic to tcp
(--mca btl tcp,sm,self), the jobs run normally.

I am using the examples/connectivity_c.c program
to troubleshoot this problem.

I checked a few things on the IB side.

The output of ibstat on all nodes seems OK (links up, etc),
and so are the output of ibhosts and ibchecknet.

Only two connected ports had errors, as reported by ibcheckerrors,
and I cleared them with iblclearerrors.

The IB subnet manager is running on the head node.
I restarted the daemon, but nothing changed, the job continue to
fail with the same errors.


Any hints of what is going on, how to diagnose it, and how to fix it?
Any gentler way than reboot everything and power cycling
the IB switch? (And would this brute force method work, at least?)

Thank you,
Gus Correa