Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-08-12 19:43:05


Seems strange that it would have something to do with IB - it seems that alloc itself is failing, and at only 512 bytes, that doesn't seem like something IB would cause.

If you write a little program that calls alloc (no MPI), does it also fail?

On Aug 12, 2013, at 3:35 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Ralph
>
> Sorry if this is more of an IB than an OMPI problem,
> but my view angle shows it through the OMPI jobs failing.
>
> Yes, indeed I was setting memlock to unlimited in limits.conf
> and in the pbs_mom, restarting everything, relaunching the job.
> The error message changes, but it still fails on Infiniband,
> now complaining about the IB driver, but also that it cannot
> allocate memory.
>
> Weird because when I ssh to the node and do ibstat it
> responds (see below, please).
> I actually ran ibstat everywhere, and all IB host adapters seem OK.
>
> Thank you,
> Gus Correa
>
>
> *********************** the job stderr ******************************
> unable to alloc 512 bytes
> Abort: Command not found.
> unable to realloc 1600 bytes
> Abort: Command not found.
> libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
> libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot allocate memory
> [node15:29683] *** Process received signal ***
> [node15:29683] Signal: Segmentation fault (11)
> [node15:29683] Signal code: (128)
> [node15:29683] Failing at address: (nil)
> [node15:29683] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 29683 on node node15.cluster exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [node15.cluster:29682] [[7785,0],0]-[[7785,1],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> ************************************************************
>
> *************** ibstat on node15 *************************
>
> [root_at_node15 ~]# ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 1
> Firmware version: 2.7.700
> Hardware version: b0
> Node GUID: 0x002590ffff16284c
> System image GUID: 0x002590ffff16284f
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 40
> Base lid: 11
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510868
> Port GUID: 0x002590ffff16284d
> Link layer: IB
>
>
> ************************************************************
>
> On 08/12/2013 05:29 PM, Ralph Castain wrote:
>> No, this has nothing to do with the registration limit.
>> For some reason, the system is refusing to create a thread -
>> i.e., it is pthread_create that is failing.
>> I have no idea what would be causing that to happen.
>>
>> Try setting it to unlimited and see if it allows the thread
>> to start, I guess.
>>
>>
>> On Aug 12, 2013, at 2:20 PM, Gus Correa<gus_at_[hidden]> wrote:
>>
>>> Hi Ralph, all
>>>
>>> I include more information below,
>>> after turning on btl_openib_verbose 30.
>>> As you can see, OMPI tries, and fails, to load openib.
>>>
>>> Last week I reduced the memlock limit from unlimited
>>> to ~12GB, as part of a general attempt to reign on memory
>>> use/abuse by jobs sharing a node.
>>> No parallel job ran until today, when the problem showed up.
>>> Could the memlock limit be the root of the problem?
>>>
>>> The OMPI FAQ says the memlock limit
>>> should be a "large number (or better yet, unlimited)":
>>>
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>
>>> The next two FAQ kind of indicate that
>>> it should be set to "unlimited", but don't say it clearly:
>>>
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
>>>
>>> QUESTION:
>>> Is "unlimited" a must, or is there any (magic) "large number"
>>> that would be OK for openib?
>>>
>>> I thought a 12GB memlock limit would be OK, but maybe it is not.
>>> The nodes have 64GB RAM.
>>>
>>> Thank you,
>>> Gus Correa
>>>
>>> *************************************************\
>>> [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> --------------------------------------------------------------------------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>
>>> Local host: node15.cluster
>>> Local device: mlx4_0
>>> --------------------------------------------------------------------------
>>> [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] Failed to create async event thread
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>>> Process 1 ([[8097,1],4]) is on host: node15.cluster
>>> Process 2 ([[8097,1],16]) is on host: node14
>>> BTLs attempted: self sm
>>>
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>>
>>> *************************************************
>>>
>>> On 08/12/2013 03:32 PM, Gus Correa wrote:
>>>> Thank you for the prompt help, Ralph!
>>>>
>>>> Yes, it is OMPI 1.4.3 built with openib support:
>>>>
>>>> $ ompi_info | grep openib
>>>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
>>>>
>>>> There are three libraries in prefix/lib/openmpi,
>>>> no mca_btl_openib library.
>>>>
>>>> $ ls $PREFIX/lib/openmpi/
>>>> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so
>>>>
>>>>
>>>> However, this may be just because it is an older OMPI version in
>>>> the 1.4 series.
>>>> Because those are exactly what I have in another cluster with IB,
>>>> and OMPI 1.4.3, where there isn't a problem.
>>>> The libraries' organization may have changed from
>>>> the 1.4 to the 1.6 series, right?
>>>> I only have mca_btl_openib libraries in the 1.6 series, but it
>>>> will be a hardship to migrate this program to OMPI 1.6.
>>>>
>>>> (OK, I have newer OMPI, but I need old also for some
>>>> programs).
>>>>
>>>> Why the heck it is not detecting the Infinband hardware?
>>>> [It used to detect it! :( ]
>>>>
>>>> Thank you,
>>>> Gus Correa
>>>>
>>>>
>>>> On 08/12/2013 03:01 PM, Ralph Castain wrote:
>>>>> Check ompi_info - was it built with openib support?
>>>>>
>>>>> Then check that the mca_btl_openib library is present in the
>>>>> prefix/lib/openmpi directory
>>>>>
>>>>> Sounds like it isn't finding the openib plugin
>>>>>
>>>>>
>>>>> On Aug 12, 2013, at 11:57 AM, Gus Correa<gus_at_[hidden]> wrote:
>>>>>
>>>>>> Dear Open MPI pros
>>>>>>
>>>>>> On one of the clusters here, that has Infinband,
>>>>>> I am getting this type of errors from
>>>>>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>>>>>
>>>>>> *********************************************************
>>>>>> Tcl_InitNotifier: unable to start notifier thread
>>>>>> Abort: Command not found.
>>>>>> Tcl_InitNotifier: unable to start notifier thread
>>>>>> Abort: Command not found.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> At least one pair of MPI processes are unable to reach each other for
>>>>>> MPI communications. This means that no Open MPI device has indicated
>>>>>> that it can be used to communicate between these processes. This is
>>>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>>>> each other. This error can sometimes be the result of forgetting to
>>>>>> specify the "self" BTL.
>>>>>>
>>>>>> Process 1 ([[907,1],68]) is on host: node11.cluster
>>>>>> Process 2 ([[907,1],0]) is on host: node15
>>>>>> BTLs attempted: self sm
>>>>>>
>>>>>> Your MPI job is now going to abort; sorry.
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> *********************************************************
>>>>>>
>>>>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>>>>>> The same error also happens if I force --mca btl openib,sm,self
>>>>>> in mpiexec.
>>>>>>
>>>>>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>>>>>
>>>>>> I don't understand either the initial errors
>>>>>> "Tcl_InitNotifier: unable to start notifier thread".
>>>>>> Are they coming from Torque perhaps?
>>>>>>
>>>>>> As I said, the cluster has Infiniband,
>>>>>> which is what we've been using forever, until
>>>>>> these errors started today.
>>>>>>
>>>>>> When I divert the traffic to tcp
>>>>>> (--mca btl tcp,sm,self), the jobs run normally.
>>>>>>
>>>>>> I am using the examples/connectivity_c.c program
>>>>>> to troubleshoot this problem.
>>>>>>
>>>>>> ***
>>>>>> I checked a few things on the IB side.
>>>>>>
>>>>>> The output of ibstat on all nodes seems OK (links up, etc),
>>>>>> and so are the output of ibhosts and ibchecknet.
>>>>>>
>>>>>> Only two connected ports had errors, as reported by ibcheckerrors,
>>>>>> and I cleared them with iblclearerrors.
>>>>>>
>>>>>> The IB subnet manager is running on the head node.
>>>>>> I restarted the daemon, but nothing changed, the job continue to
>>>>>> fail with the same errors.
>>>>>>
>>>>>> **
>>>>>>
>>>>>> Any hints of what is going on, how to diagnose it, and how to fix it?
>>>>>> Any gentler way than reboot everything and power cycling
>>>>>> the IB switch? (And would this brute force method work, at least?)
>>>>>>
>>>>>> Thank you,
>>>>>> Gus Correa
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users