Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30148 run failure NetBSD6-x86
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-09 11:40:45


Should now be fixed in trunk (silently fall back to not binding if cores not found) - scheduled for 1.7.4. If you could test the next trunk tarball, that would help as I can't actually test it on my machines

On Jan 9, 2014, at 6:25 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> I see the issue - there are no "cores" on this topology, only "pu's", so "bind-to core" is going to fail even though binding is supported. Will adjust.
>
> Thanks!
>
> On Jan 8, 2014, at 9:06 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> Requested verbose output below.
>> -Paul
>>
>> -bash-4.2$ mpirun -mca ess_base_verbose 10 -np 1 examples/ring_c
>> [pcp-j-17:02150] mca: base: components_register: registering ess components
>> [pcp-j-17:02150] mca: base: components_register: found loaded component env
>> [pcp-j-17:02150] mca: base: components_register: component env has no register or open function
>> [pcp-j-17:02150] mca: base: components_register: found loaded component hnp
>> [pcp-j-17:02150] mca: base: components_register: component hnp has no register or open function
>> [pcp-j-17:02150] mca: base: components_register: found loaded component singleton
>> [pcp-j-17:02150] mca: base: components_register: component singleton register function successful
>> [pcp-j-17:02150] mca: base: components_register: found loaded component tool
>> [pcp-j-17:02150] mca: base: components_register: component tool has no register or open function
>> [pcp-j-17:02150] mca: base: components_open: opening ess components
>> [pcp-j-17:02150] mca: base: components_open: found loaded component env
>> [pcp-j-17:02150] mca: base: components_open: component env open function successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component hnp
>> [pcp-j-17:02150] mca: base: components_open: component hnp open function successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component singleton
>> [pcp-j-17:02150] mca: base: components_open: component singleton open function successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component tool
>> [pcp-j-17:02150] mca: base: components_open: component tool open function successful
>> [pcp-j-17:02150] mca:base:select: Auto-selecting ess components
>> [pcp-j-17:02150] mca:base:select:( ess) Querying component [env]
>> [pcp-j-17:02150] mca:base:select:( ess) Skipping component [env]. Query failed to return a module
>> [pcp-j-17:02150] mca:base:select:( ess) Querying component [hnp]
>> [pcp-j-17:02150] mca:base:select:( ess) Query of component [hnp] set priority to 100
>> [pcp-j-17:02150] mca:base:select:( ess) Querying component [singleton]
>> [pcp-j-17:02150] mca:base:select:( ess) Skipping component [singleton]. Query failed to return a module
>> [pcp-j-17:02150] mca:base:select:( ess) Querying component [tool]
>> [pcp-j-17:02150] mca:base:select:( ess) Skipping component [tool]. Query failed to return a module
>> [pcp-j-17:02150] mca:base:select:( ess) Selected component [hnp]
>> [pcp-j-17:02150] mca: base: close: component env closed
>> [pcp-j-17:02150] mca: base: close: unloading component env
>> [pcp-j-17:02150] mca: base: close: component singleton closed
>> [pcp-j-17:02150] mca: base: close: unloading component singleton
>> [pcp-j-17:02150] mca: base: close: component tool closed
>> [pcp-j-17:02150] mca: base: close: unloading component tool
>> [pcp-j-17:02150] [[INVALID],INVALID] Topology Info:
>> [pcp-j-17:02150] Type: Machine Number of child objects: 2
>> Name=NULL
>> Backend=NetBSD
>> OSName=NetBSD
>> OSRelease=6.1
>> OSVersion="NetBSD 6.1 (CUSTOM) #0: Fri Sep 20 13:19:58 PDT 2013 phargrov_at_pcp-j-17:/home/phargrov/CUSTOM"
>> Architecture=i386
>> Backend=x86
>> Cpuset: 0x00000003
>> Online: 0x00000003
>> Allowed: 0x00000003
>> Bind CPU proc: TRUE
>> Bind CPU thread: TRUE
>> Bind MEM proc: FALSE
>> Bind MEM thread: FALSE
>> Type: PU Number of child objects: 0
>> Name=NULL
>> Cpuset: 0x00000001
>> Online: 0x00000001
>> Allowed: 0x00000001
>> Type: PU Number of child objects: 0
>> Name=NULL
>> Cpuset: 0x00000002
>> Online: 0x00000002
>> Allowed: 0x00000002
>> --------------------------------------------------------------------------
>> While computing bindings, we found no available cpus on
>> the following node:
>>
>> Node: pcp-j-17
>>
>> Please check your allocation.
>> --------------------------------------------------------------------------
>> [pcp-j-17:02150] mca: base: close: component hnp closed
>> [pcp-j-17:02150] mca: base: close: unloading component hnp
>>
>>
>>
>> On Wed, Jan 8, 2014 at 8:50 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Hmmm...looks to me like the code should protect against this - unless the system isn't correctly reporting binding support. Could you run this with "-mca ess_base_verbose 10"? This will output the topology we found, including the binding support (which isn't in the usual output).
>>
>> On Jan 8, 2014, at 8:14 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Hmmm...I see the problem. Looks like binding isn't supported on that system for some reason, so we need to turn "off" our auto-binding when we hit that condition. I'll check to see why that isn't happening (was supposed to do so)
>>>
>>>
>>> On Jan 8, 2014, at 3:43 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>>
>>>> While I have yet to get a working build on NetBSD for x86-64 h/w, I *have* successfully built Open MPI's current 1.7.4rc tarball on NetBSD-6 for x86. However, I can't *run* anything:
>>>>
>>>> Attempting the ring_c example on 2 cores:
>>>> -bash-4.2$ mpirun -mca btl sm,self -np 2 examples/ring_c
>>>> --------------------------------------------------------------------------
>>>> While computing bindings, we found no available cpus on
>>>> the following node:
>>>>
>>>> Node: pcp-j-17
>>>>
>>>> Please check your allocation.
>>>> --------------------------------------------------------------------------
>>>>
>>>> The failure is the same w/o "-mca btl sm,self"
>>>> Singleton runs fail just as the np=2 run did.
>>>>
>>>> I've attached compressed output from "ompi_info --all".
>>>>
>>>> Since this is probably an hwloc-related issue, I also build hwloc-1.7.2 from pristine sources.
>>>> I have attached compressed output of lstopo which NOTABLY indicates a failure to bind to both of the CPUs.
>>>>
>>>> For now, an explicit "--bind-to none" is working for me.
>>>> Please let me know what additional info may be required.
>>>>
>>>> -Paul
>>>>
>>>> --
>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>> Future Technologies Group
>>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>> <ompi_info-netbsd-x86.txt.bz2><lstopo172-netbsd-x86.txt.bz2>_______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>