Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-08-12 17:29:36


No, this has nothing to do with the registration limit. For some reason, the system is refusing to create a thread - i.e., it is pthread_create that is failing. I have no idea what would be causing that to happen.

Try setting it to unlimited and see if it allows the thread to start, I guess.

On Aug 12, 2013, at 2:20 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Ralph, all
>
> I include more information below,
> after turning on btl_openib_verbose 30.
> As you can see, OMPI tries, and fails, to load openib.
>
> Last week I reduced the memlock limit from unlimited
> to ~12GB, as part of a general attempt to reign on memory
> use/abuse by jobs sharing a node.
> No parallel job ran until today, when the problem showed up.
> Could the memlock limit be the root of the problem?
>
> The OMPI FAQ says the memlock limit
> should be a "large number (or better yet, unlimited)":
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
> The next two FAQ kind of indicate that
> it should be set to "unlimited", but don't say it clearly:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
>
> QUESTION:
> Is "unlimited" a must, or is there any (magic) "large number"
> that would be OK for openib?
>
> I thought a 12GB memlock limit would be OK, but maybe it is not.
> The nodes have 64GB RAM.
>
> Thank you,
> Gus Correa
>
> *************************************************\
> [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> --------------------------------------------------------------------------
> WARNING: There was an error initializing an OpenFabrics device.
>
> Local host: node15.cluster
> Local device: mlx4_0
> --------------------------------------------------------------------------
> [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[8097,1],4]) is on host: node15.cluster
> Process 2 ([[8097,1],16]) is on host: node14
> BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>
> *************************************************
>
> On 08/12/2013 03:32 PM, Gus Correa wrote:
>> Thank you for the prompt help, Ralph!
>>
>> Yes, it is OMPI 1.4.3 built with openib support:
>>
>> $ ompi_info | grep openib
>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
>>
>> There are three libraries in prefix/lib/openmpi,
>> no mca_btl_openib library.
>>
>> $ ls $PREFIX/lib/openmpi/
>> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so
>>
>>
>> However, this may be just because it is an older OMPI version in
>> the 1.4 series.
>> Because those are exactly what I have in another cluster with IB,
>> and OMPI 1.4.3, where there isn't a problem.
>> The libraries' organization may have changed from
>> the 1.4 to the 1.6 series, right?
>> I only have mca_btl_openib libraries in the 1.6 series, but it
>> will be a hardship to migrate this program to OMPI 1.6.
>>
>> (OK, I have newer OMPI, but I need old also for some
>> programs).
>>
>> Why the heck it is not detecting the Infinband hardware?
>> [It used to detect it! :( ]
>>
>> Thank you,
>> Gus Correa
>>
>>
>> On 08/12/2013 03:01 PM, Ralph Castain wrote:
>>> Check ompi_info - was it built with openib support?
>>>
>>> Then check that the mca_btl_openib library is present in the
>>> prefix/lib/openmpi directory
>>>
>>> Sounds like it isn't finding the openib plugin
>>>
>>>
>>> On Aug 12, 2013, at 11:57 AM, Gus Correa<gus_at_[hidden]> wrote:
>>>
>>>> Dear Open MPI pros
>>>>
>>>> On one of the clusters here, that has Infinband,
>>>> I am getting this type of errors from
>>>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>>>
>>>> *********************************************************
>>>> Tcl_InitNotifier: unable to start notifier thread
>>>> Abort: Command not found.
>>>> Tcl_InitNotifier: unable to start notifier thread
>>>> Abort: Command not found.
>>>> --------------------------------------------------------------------------
>>>>
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[907,1],68]) is on host: node11.cluster
>>>> Process 2 ([[907,1],0]) is on host: node15
>>>> BTLs attempted: self sm
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>>
>>>> *********************************************************
>>>>
>>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>>>> The same error also happens if I force --mca btl openib,sm,self
>>>> in mpiexec.
>>>>
>>>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>>>
>>>> I don't understand either the initial errors
>>>> "Tcl_InitNotifier: unable to start notifier thread".
>>>> Are they coming from Torque perhaps?
>>>>
>>>> As I said, the cluster has Infiniband,
>>>> which is what we've been using forever, until
>>>> these errors started today.
>>>>
>>>> When I divert the traffic to tcp
>>>> (--mca btl tcp,sm,self), the jobs run normally.
>>>>
>>>> I am using the examples/connectivity_c.c program
>>>> to troubleshoot this problem.
>>>>
>>>> ***
>>>> I checked a few things on the IB side.
>>>>
>>>> The output of ibstat on all nodes seems OK (links up, etc),
>>>> and so are the output of ibhosts and ibchecknet.
>>>>
>>>> Only two connected ports had errors, as reported by ibcheckerrors,
>>>> and I cleared them with iblclearerrors.
>>>>
>>>> The IB subnet manager is running on the head node.
>>>> I restarted the daemon, but nothing changed, the job continue to
>>>> fail with the same errors.
>>>>
>>>> **
>>>>
>>>> Any hints of what is going on, how to diagnose it, and how to fix it?
>>>> Any gentler way than reboot everything and power cycling
>>>> the IB switch? (And would this brute force method work, at least?)
>>>>
>>>> Thank you,
>>>> Gus Correa
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users