Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] I have still a problem with rankfiles inopenmpi-1.6.4rc3
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-02-06 07:46:00


Hi

> We've been talking about this offline. Can you send us an lstopo
> output from your Solaris machine? Send us the text output and
> the xml output, e.g.:
>
> lstopo > solaris.txt
> lstopo solaris.xml

I have installed hwloc-1.3.2 and hwloc-1.6.1 and get the following
output (it's the same for both versions in the text file, but has
different xml files).

sunpc1 bin 121 lstopo --version
lstopo 1.3.2
sunpc1 bin 122 lstopo
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)

sunpc1 bin 123 cd ../../hwloc-1.6.1/bin/
sunpc1 bin 124 lstopo --version
lstopo 1.6.1
sunpc1 bin 125 lstopo
Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)
sunpc1 bin 126

I have attached the requested files.

sunpc1 bin 144 lstopo --version
lstopo 1.3.2
sunpc1 bin 145 lstopo > /tmp/sunpc1-hwloc-1.3.2.txt
sunpc1 bin 146 lstopo --of xml > /tmp/sunpc1-hwloc-1.3.2.xml
sunpc1 bin 147 cd ../../hwloc-1.6.1/bin/
sunpc1 bin 148 lstopo --version
lstopo 1.6.1
sunpc1 bin 149 lstopo > /tmp/sunpc1-hwloc-1.6.1.txt
sunpc1 bin 150 lstopo --of xml > /tmp/sunpc1-hwloc-1.6.1.xml

Thank you very much for your help in advance.

Kind regards

Siegmar

> On Feb 5, 2013, at 12:30 AM, Siegmar Gross
<Siegmar.Gross_at_[hidden]> wrote:
>
> > Hi
> >
> > now I can use all our machines once more. I have a problem on
> > Solaris 10 x86_64, because the mapping of processes doesn't
> > correspond to the rankfile. I removed the output from "hostfile"
> > and wrapped around long lines.
> >
> > tyr rankfiles 114 cat rf_ex_sunpc
> > # mpiexec -report-bindings -rf rf_ex_sunpc hostname
> >
> > rank 0=sunpc0 slot=0:0-1,1:0-1
> > rank 1=sunpc1 slot=0:0-1
> > rank 2=sunpc1 slot=1:0
> > rank 3=sunpc1 slot=1:1
> >
> >
> > tyr rankfiles 115 mpiexec -report-bindings -rf rf_ex_sunpc hostname
> > [sunpc0:17920] MCW rank 0 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> > [sunpc1:11265] MCW rank 1 bound to socket 0[core 0-1]:
> > [B B][. .] (slot list 0:0-1)
> > [sunpc1:11265] MCW rank 2 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 1:0)
> > [sunpc1:11265] MCW rank 3 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B] (slot list 1:1)
> >
> >
> > Can I provide any information to solve this problem? My
> > rankfile works as expected, if I use only Linux machines.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> >>> Hmmm....well, it certainly works for me:
> >>>
> >>> [rhc_at_odin ~/v1.6]$ cat rf
> >>> rank 0=odin093 slot=0:0-1,1:0-1
> >>> rank 1=odin094 slot=0:0-1
> >>> rank 2=odin094 slot=1:0
> >>> rank 3=odin094 slot=1:1
> >>>
> >>>
> >>> [rhc_at_odin ~/v1.6]$ mpirun -n 4 -rf ./rf --report-bindings
> >>> -mca opal_paffinity_alone 0 hostname
> >>> [odin093.cs.indiana.edu:04617] MCW rank 0 bound to
> >>> socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list
> > 0:0-1,1:0-1)
> >>> odin093.cs.indiana.edu
> >>> odin094.cs.indiana.edu
> >>> [odin094.cs.indiana.edu:04426] MCW rank 1 bound to
> >>> socket 0[core 0-1]: [B B][. .] (slot list 0:0-1)
> >>> odin094.cs.indiana.edu
> >>> [odin094.cs.indiana.edu:04426] MCW rank 2 bound to
> >>> socket 1[core 0]: [. .][B .] (slot list 1:0)
> >>> [odin094.cs.indiana.edu:04426] MCW rank 3 bound to
> >>> socket 1[core 1]: [. .][. B] (slot list 1:1)
> >>> odin094.cs.indiana.edu
> >>
> >> Interesting that it works on your machines.
> >>
> >>
> >>> I see one thing of concern to me in your output - your second node
> >>> appears to be a Sun computer. Is it the same physical architecture?
> >>> Is it also running Linux? Are you sure it is using the same version
> >>> of OMPI, built for that environment and hardware?
> >>
> >> Both machines (in fact all four machines: sunpc0, sunpc1, linpc0, and
> >> linpc1) use the same hardware. "linpc" uses openSUSE 12.1 and "sunpc"
> >> Solaris 10 x86_64. All machines use the same version of Open MPI,
> >> built for that environment. At the moment I can only use sunpc1 and
> >> linpc1 ("my" developer machines). Next week I will have access to all
> >> machines so that I can test, if I get a different behaviour when I
> >> use two machines with the same operating system (although mixed
> >> operating systems weren't a problem in the past (only machines with
> >> differnt endians)). I let you know my results.
> >>
> >>
> >> Kind regards
> >>
> >> Siegmar
> >>
> >>
> >>
> >>
> >>> On Jan 30, 2013, at 2:08 AM, Siegmar Gross
> >> <Siegmar.Gross_at_[hidden]> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I applied your patch "rmaps.diff" to openmpi-1.6.4rc3r27923 and
> >>>> it works for my previous rankfile.
> >>>>
> >>>>
> >>>>> #3493: Handle the case where rankfile provides the allocation
> >>>>> -----------------------------------+----------------------------
> >>>>> Reporter: rhc | Owner: jsquyres
> >>>>> Type: changeset move request | Status: new
> >>>>> Priority: critical | Milestone: Open MPI 1.6.4
> >>>>> Version: trunk | Keywords:
> >>>>> -----------------------------------+----------------------------
> >>>>> Please apply the attached patch that corrects the rmaps function for
> >>>>> obtaining the available nodes when rankfile is providing the
> > allocation.
> >>>>
> >>>>
> >>>> tyr rankfiles 129 more rf_linpc1
> >>>> # mpiexec -report-bindings -rf rf_linpc1 hostname
> >>>> rank 0=linpc1 slot=0:0-1,1:0-1
> >>>>
> >>>> tyr rankfiles 130 mpiexec -report-bindings -rf rf_linpc1 hostname
> >>>> [linpc1:31603] MCW rank 0 bound to socket 0[core 0-1]
> >>>> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> >>>>
> >>>>
> >>>>
> >>>> Unfortunately I don't get the expected result for the following
> >>>> rankfile.
> >>>>
> >>>> tyr rankfiles 114 more rf_bsp
> >>>> # mpiexec -report-bindings -rf rf_bsp hostname
> >>>> rank 0=linpc1 slot=0:0-1,1:0-1
> >>>> rank 1=sunpc1 slot=0:0-1
> >>>> rank 2=sunpc1 slot=1:0
> >>>> rank 3=sunpc1 slot=1:1
> >>>>
> >>>> I would expect that rank 0 gets all four cores from linpc1, rank 1
> >>>> both cores of socket 0 from sunpc1, rank 2 core 0 of socket 1, and
> >>>> rank 3 core 1 of socket 1 from sunpc1. Everything is fine for my
> >>>> processes with rank 0 and 1, but it's wrong for ranks 2 and 3,
> >>>> because they both get all four cores of sunpc1. Is something wrong
> >>>> with my rankfile or with your mapping of processes to cores? I have
> >>>> removed the output from "hostname" and wrapped long lines.
> >>>>
> >>>> tyr rankfiles 115 mpiexec -report-bindings -rf rf_bsp hostname
> >>>> [linpc1:31092] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 0:0-1,1:0-1)
> >>>> [sunpc1:12916] MCW rank 1 bound to socket 0[core 0-1]:
> >>>> [B B][. .] (slot list 0:0-1)
> >>>> [sunpc1:12916] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 1:0)
> >>>> [sunpc1:12916] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 1:1)
> >>>>
> >>>>
> >>>> I get the following output, if I add the options which you mentioned
> >>>> in a previous email.
> >>>>
> >>>> tyr rankfiles 124 mpiexec -report-bindings -rf rf_bsp \
> >>>> -display-allocation -mca ras_base_verbose 5 hostname
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> >>>> Querying component [cm]
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> >>>> Skipping component [cm]. Query failed to return a module
> >>>> [tyr.informatik.hs-fulda.de:19401] mca:base:select:( ras)
> >>>> No component selected!
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> nothing found in module - proceeding to hostfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> parsing default hostfile
> >>>> /usr/local/openmpi-1.6.4_64_cc/etc/openmpi-default-hostfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0] ras:base:allocate
> >>>> nothing found in hostfiles or dash-host - checking for rankfile
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert inserting 2 nodes
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert node linpc1
> >>>> [tyr.informatik.hs-fulda.de:19401] [[27101,0],0]
> >>>> ras:base:node_insert node sunpc1
> >>>>
> >>>> ====================== ALLOCATED NODES ======================
> >>>>
> >>>> Data for node: tyr.informatik.hs-fulda.de Num slots: 0 Max slots: 0
> >>>> Data for node: linpc1 Num slots: 1 Max slots: 0
> >>>> Data for node: sunpc1 Num slots: 3 Max slots: 0
> >>>>
> >>>> =================================================================
> >>>> [linpc1:31532] MCW rank 0 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 0:0-1,1:0-1)
> >>>> [sunpc1:13136] MCW rank 1 bound to socket 0[core 0-1]:
> >>>> [B B][. .] (slot list 0:0-1)
> >>>> [sunpc1:13136] MCW rank 2 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 1:0)
> >>>> [sunpc1:13136] MCW rank 3 bound to socket 0[core 0-1] socket 1[core
> > 0-1]:
> >>>> [B B][B B] (slot list 1:1)
> >>>>
> >>>>
> >>>> Thank you very much for any suggestions and any help in advance.
> >>>>
> >>>>
> >>>> Kind regards
> >>>>
> >>>> Siegmar
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>
>

Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_level="-1" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" online_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" allowed_nodeset="0x00000006">
    <info name="OSName" value="SunOS"/>
    <info name="OSRelease" value="5.10"/>
    <info name="OSVersion" value="Generic_147441-21"/>
    <info name="HostName" value="sunpc1"/>
    <info name="Architecture" value="i86pc"/>
    <object type="NUMANode" os_level="-1" os_index="1" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002" local_memory="4293435392">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_level="-1" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
        <object type="Core" os_level="-1" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
          <object type="PU" os_level="-1" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/>
        </object>
        <object type="Core" os_level="-1" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
          <object type="PU" os_level="-1" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/>
        </object>
      </object>
    </object>
    <object type="NUMANode" os_level="-1" os_index="2" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004" local_memory="4294967296">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_level="-1" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
        <object type="Core" os_level="-1" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
          <object type="PU" os_level="-1" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/>
        </object>
        <object type="Core" os_level="-1" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
          <object type="PU" os_level="-1" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/>
        </object>
      </object>
    </object>
  </object>
</topology>

Machine (8191MB)
  NUMANode L#0 (P#1 4095MB) + Socket L#0
    Core L#0 + PU L#0 (P#0)
    Core L#1 + PU L#1 (P#1)
  NUMANode L#1 (P#2 4096MB) + Socket L#1
    Core L#2 + PU L#2 (P#2)
    Core L#3 + PU L#3 (P#3)


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" online_cpuset="0x0000000f" allowed_cpuset="0x0000000f" nodeset="0x00000006" complete_nodeset="0x00000006" allowed_nodeset="0x00000006">
    <info name="Backend" value="Solaris"/>
    <info name="OSName" value="SunOS"/>
    <info name="OSRelease" value="5.10"/>
    <info name="OSVersion" value="Generic_147441-21"/>
    <info name="HostName" value="sunpc1"/>
    <info name="Architecture" value="i86pc"/>
    <object type="NUMANode" os_index="1" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002" local_memory="4293435392">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_index="0" cpuset="0x00000003" complete_cpuset="0x00000003" online_cpuset="0x00000003" allowed_cpuset="0x00000003" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
        <info name="CPUType" value=""/>
        <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/>
        <object type="Core" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
          <object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" online_cpuset="0x00000001" allowed_cpuset="0x00000001" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/>
        </object>
        <object type="Core" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002">
          <object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" online_cpuset="0x00000002" allowed_cpuset="0x00000002" nodeset="0x00000002" complete_nodeset="0x00000002" allowed_nodeset="0x00000002"/>
        </object>
      </object>
    </object>
    <object type="NUMANode" os_index="2" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004" local_memory="4294967296">
      <page_type size="4096" count="0"/>
      <object type="Socket" os_index="1" cpuset="0x0000000c" complete_cpuset="0x0000000c" online_cpuset="0x0000000c" allowed_cpuset="0x0000000c" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
        <info name="CPUType" value=""/>
        <info name="CPUModel" value="Dual Core AMD Opteron(tm) Processor 280"/>
        <object type="Core" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
          <object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" online_cpuset="0x00000004" allowed_cpuset="0x00000004" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/>
        </object>
        <object type="Core" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004">
          <object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" online_cpuset="0x00000008" allowed_cpuset="0x00000008" nodeset="0x00000004" complete_nodeset="0x00000004" allowed_nodeset="0x00000004"/>
        </object>
      </object>
    </object>
  </object>
</topology>