Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-10 16:01:19


We actually include hwloc v1.3.2 in the OMPI v1.6 series.

Can you download and try that on your machines?

      http://www.open-mpi.org/software/hwloc/v1.3/

In particular try the hwloc-bind executable (outside of OMPI), and see if binding works properly on your machines. I typically run a test script when I'm testing binding:

------
[12:59] svbu-mpi059:~/mpi % lstopo --no-io
Machine (64GB)
  NUMANode L#0 (P#0 32GB) + Socket L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#16)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#17)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
      PU L#4 (P#2)
      PU L#5 (P#18)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#19)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4
      PU L#8 (P#4)
      PU L#9 (P#20)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5
      PU L#10 (P#5)
      PU L#11 (P#21)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6
      PU L#12 (P#6)
      PU L#13 (P#22)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#23)
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8
      PU L#16 (P#8)
      PU L#17 (P#24)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9
      PU L#18 (P#9)
      PU L#19 (P#25)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10
      PU L#20 (P#10)
      PU L#21 (P#26)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#27)
    L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12
      PU L#24 (P#12)
      PU L#25 (P#28)
    L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13
      PU L#26 (P#13)
      PU L#27 (P#29)
    L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#30)
    L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#31)
[12:59] svbu-mpi059:~/mpi % hwloc-bind socket:1.core:5 -l ./report-bindings.sh
MCW rank (svbu-mpi059): Socket:1.Core:5.PU:13 Socket:1.Core:5.PU:29
[13:00] svbu-mpi059:~/mpi % cat report-bindings.sh
#!/bin/sh

bitmap=`hwloc-bind --get -p`
friendly=`hwloc-calc -p -H socket.core.pu $bitmap`

echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
exit 0
[13:00] svbu-mpi059:~/mpi %
-----

Try just running hwloc-bind and binding yourself to some logical location, and run my report-bindings.sh script, and see if the physical indexes that it outputs are correct.

On Sep 10, 2012, at 7:34 AM, Siegmar Gross wrote:

> Hi,
>
>>> are the following outputs helpful to find the error with
>>> a rankfile on Solaris?
>>
>> If you can't bind on the new Solaris machine, then the rankfile
>> won't do you any good. It looks like we are getting the incorrect
>> number of cores on that machine - is it possible that it has
>> hardware threads, and doesn't report "cores"? Can you download
>> and run a copy of lstopo to check the output? You get that from
>> the hwloc folks:
>>
>> http://www.open-mpi.org/software/hwloc/v1.5/
>
> I downloaded and installed the package on our machines. Perhaps it is
> easier to detect the error if you have more information. Therefore I
> provide the different hardware architecures of all machines on which
> a simple program breaks if I try to bind processes to sockets or cores.
>
> I tried the following five commands with "h" one of "tyr", "rs0",
> "linpc0", "linpc1", "linpc2", "linpc4", "sunpc0", "sunpc1",
> "sunpc2", or "sunpc4" in a shell script file which I started on
> my local machine ("tyr"). "works on" means that the small program
> (MPI_Init, printf, MPI_Finalize) didn't break. I didn't check if
> the layout of the processes was correct.
>
>
> mpiexec -report-bindings -np 4 -host h init_finalize
>
> works on: tyr, rs0, linpc0, linpc1, linpc2, linpc4, sunpc0, sunpc1,
> sunpc2, sunpc4
> breaks on: -
>
>
> mpiexec -report-bindings -np 4 -host h -bind-to-core -bycore init_finalize
>
> works on: linpc2, sunpc1
> breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
>
>
> mpiexec -report-bindings -np 4 -host h -bind-to-core -bysocket init_finalize
>
> works on: linpc2, sunpc1
> breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
>
>
> mpiexec -report-bindings -np 4 -host h -bind-to-socket -bycore init_finalize
>
> works on: tyr, linpc1, linpc2, sunpc1, sunpc2
> breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
>
>
> mpiexec -report-bindings -np 4 -host h -bind-to-socket -bysocket init_finalize
>
> works on: tyr, linpc1, linpc2, sunpc1, sunpc2
> breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
>
>
>
> "lstopo" shows the following hardware configurations for the above
> machines. The first line always shows the installed architecture.
> "lstopo" does a good job as far as I can see it.
>
> tyr:
> ----
>
> UltraSPARC-IIIi, 2 single core processors, no hardware threads
>
> tyr fd1026 183 lstopo
> Machine (4096MB)
> NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0)
> NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1)
>
> tyr fd1026 116 psrinfo -pv
> The physical processor has 1 virtual processor (0)
> UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1600 MHz)
> The physical processor has 1 virtual processor (1)
> UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1600 MHz)
>
>
> rs0, rs1:
> ---------
>
> SPARC64-VII, 2 quad-core processors, 2 hardware threads / core
>
> rs0 fd1026 105 lstopo
> Machine (32GB) + NUMANode L#0 (P#1 32GB)
> Socket L#0
> Core L#0
> PU L#0 (P#0)
> PU L#1 (P#1)
> Core L#1
> PU L#2 (P#2)
> PU L#3 (P#3)
> Core L#2
> PU L#4 (P#4)
> PU L#5 (P#5)
> Core L#3
> PU L#6 (P#6)
> PU L#7 (P#7)
> Socket L#1
> Core L#4
> PU L#8 (P#8)
> PU L#9 (P#9)
> Core L#5
> PU L#10 (P#10)
> PU L#11 (P#11)
> Core L#6
> PU L#12 (P#12)
> PU L#13 (P#13)
> Core L#7
> PU L#14 (P#14)
> PU L#15 (P#15)
>
> tyr fd1026 117 ssh rs0 psrinfo -pv
> The physical processor has 8 virtual processors (0-7)
> SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
> The physical processor has 8 virtual processors (8-15)
> SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)
>
>
> linpc0, linpc3:
> ---------------
>
> AMD Athlon64 X2, 1 dual-core processor, no hardware threads
>
> linpc0 fd1026 102 lstopo
> Machine (4023MB) + Socket L#0
> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
>
>
> It is strange that openSuSE-Linux-12.1 thinks that two
> dual-core processors are available although the machines
> are only equipped with one processor.
>
> linpc0 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor : 0
> cpu cores : 2
> processor : 1
> cpu cores : 2
>
>
> linpc1:
> -------
>
> Intel Xeon, 2 single core processors, no hardware threads
>
> linpc1 fd1026 104 lstopo
> Machine (3829MB)
> Socket L#0 + Core L#0 + PU L#0 (P#0)
> Socket L#1 + Core L#1 + PU L#1 (P#1)
>
> tyr fd1026 118 ssh linpc1 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor : 0
> cpu cores : 1
> processor : 1
> cpu cores : 1
>
>
> linpc2:
> -------
>
> AMD Opteron 280, 2 dual-core processors, no hardware threads
>
> linpc2 fd1026 103 lstopo
> Machine (8190MB)
> NUMANode L#0 (P#0 4094MB) + Socket L#0
> L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
> L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
> NUMANode L#1 (P#1 4096MB) + Socket L#1
> L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
> L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
>
> It is strange that openSuSE-Linux-12.1 thinks that four
> dual-core processors are available although the machine
> is only equipped with two processors.
>
> linpc2 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor : 0
> cpu cores : 2
> processor : 1
> cpu cores : 2
> processor : 2
> cpu cores : 2
> processor : 3
> cpu cores : 2
>
>
>
> linpc4:
> -------
>
> AMD Opteron 1218, 1 dual-core processors, no hardware threads
>
> linpc4 fd1026 100 lstopo
> Machine (4024MB) + Socket L#0
> L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
> L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
>
> It is strange that openSuSE-Linux-12.1 thinks that two
> dual-core processors are available although the machine
> is only equipped with one processor.
>
> tyr fd1026 230 ssh linpc4 cat /proc/cpuinfo | grep -e processor -e "cpu core"
> processor : 0
> cpu cores : 2
> processor : 1
> cpu cores : 2
>
>
>
> sunpc0, sunpc3:
> ---------------
>
> AMD Athlon64 X2, 1 dual-core processor, no hardware threads
>
> sunpc0 fd1026 104 lstopo
> Machine (4094MB) + NUMANode L#0 (P#0 4094MB) + Socket L#0
> Core L#0 + PU L#0 (P#0)
> Core L#1 + PU L#1 (P#1)
>
> tyr fd1026 111 ssh sunpc0 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
> x86 (chipid 0x0 AuthenticAMD family 15 model 43 step 1 clock 2000 MHz)
> AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
>
>
> sunpc1:
> -------
>
> AMD Opteron 280, 2 dual-core processors, no hardware threads
>
> sunpc1 fd1026 104 lstopo
> Machine (8191MB)
> NUMANode L#0 (P#1 4095MB) + Socket L#0
> Core L#0 + PU L#0 (P#0)
> Core L#1 + PU L#1 (P#1)
> NUMANode L#1 (P#2 4096MB) + Socket L#1
> Core L#2 + PU L#2 (P#2)
> Core L#3 + PU L#3 (P#3)
>
> tyr fd1026 112 ssh sunpc1 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
> x86 (chipid 0x0 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
> Dual Core AMD Opteron(tm) Processor 280
> The physical processor has 2 virtual processors (2 3)
> x86 (chipid 0x1 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
> Dual Core AMD Opteron(tm) Processor 280
>
>
> sunpc2:
> -------
>
> Intel Xeon, 2 single core processors, no hardware threads
>
> sunpc2 fd1026 104 lstopo
> Machine (3904MB) + NUMANode L#0 (P#0 3904MB)
> Socket L#0 + Core L#0 + PU L#0 (P#0)
> Socket L#1 + Core L#1 + PU L#1 (P#1)
>
> tyr fd1026 114 ssh sunpc2 psrinfo -pv
> The physical processor has 1 virtual processor (0)
> x86 (chipid 0x0 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
> Intel(r) Xeon(tm) CPU 2.80GHz
> The physical processor has 1 virtual processor (1)
> x86 (chipid 0x3 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
> Intel(r) Xeon(tm) CPU 2.80GHz
>
>
> sunpc4:
> -------
>
> AMD Opteron 1218, 1 dual-core processor, no hardware threads
>
> sunpc4 fd1026 104 lstopo
> Machine (4096MB) + NUMANode L#0 (P#0 4096MB) + Socket L#0
> Core L#0 + PU L#0 (P#0)
> Core L#1 + PU L#1 (P#1)
>
> tyr fd1026 115 ssh sunpc4 psrinfo -pv
> The physical processor has 2 virtual processors (0 1)
> x86 (chipid 0x0 AuthenticAMD family 15 model 67 step 2 clock 2613 MHz)
> Dual-Core AMD Opteron(tm) Processor 1218
>
>
>
>
> Among others I got the following error messages (I can provide
> the complete file if you are interested in it).
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bycore init_finalize
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
> [[30908,1],2] to cpus 0004
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
> [[30908,1],0] to cpus 0001
> [tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
> [[30908,1],1] to cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an error
> on node tyr.informatik.hs-fulda.de. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bysocket init_finalize
> --------------------------------------------------------------------------
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor.
>
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M). Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host.
>
> You job will now abort.
> --------------------------------------------------------------------------
> [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding child
> [[30907,1],0] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding child
> [[30907,1],1] to socket 1 cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an error
> on node tyr.informatik.hs-fulda.de. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bycore init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05715] [[30936,0],1] odls:default:fork binding child
> [[30936,1],0] to cpus 0001
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Resource temporarily unavailable
> Node: rs0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bysocket init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05743] [[30916,0],1] odls:default:fork binding child
> [[30916,1],0] to socket 0 cpus 0001
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Resource temporarily unavailable
> Node: rs0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bycore init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05771] [[30912,0],1] odls:default:fork binding child
> [[30912,1],0] to socket 0 cpus 0055
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Resource temporarily unavailable
> Node: rs0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bysocket init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [rs0.informatik.hs-fulda.de:05799] [[30924,0],1] odls:default:fork binding child
> [[30924,1],0] to socket 0 cpus 0055
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Resource temporarily unavailable
> Node: rs0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bycore init_finalize
> --------------------------------------------------------------------------
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --------------------------------------------------------------------------
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],0] to
> cpus 0001
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],1] to
> cpus 0002
> [linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],2] to
> cpus 0004
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an error
> on node linpc0. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bysocket
> init_finalize
> --------------------------------------------------------------------------
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor.
>
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M). Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host.
>
> You job will now abort.
> --------------------------------------------------------------------------
> [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],0] to
> socket 0 cpus 0001
> [linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],1] to
> socket 0 cpus 0002
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an error
> on node linpc0. More information may be available above.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bycore
> init_finalize
> --------------------------------------------------------------------------
> Unable to bind to socket 0 on node linpc0.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Fatal
> Node: linpc0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
> ##################
> ##################
> mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bysocket
> init_finalize
> --------------------------------------------------------------------------
> Unable to bind to socket 0 on node linpc0.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to start the specified application as it encountered an
> error:
>
> Error name: Fatal
> Node: linpc0
>
> when attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
>
>
> Hopefully this helps to track the error. Thank you very much
> for your help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
>
>>> I wrapped long lines so that they
>>> are easier to read. Have you had time to look at the
>>> segmentation fault with a rankfile which I reported in my
>>> last email (see below)?
>>
>> I'm afraid not - been too busy lately. I'd suggest first focusing
>> on getting binding to work.
>>
>>>
>>> "tyr" is a two processor single core machine.
>>>
>>> tyr fd1026 116 mpiexec -report-bindings -np 4 \
>>> -bind-to-socket -bycore rank_size
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],1] to socket 1 cpus 0002
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],2] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>>> fork binding child [[27298,1],3] to socket 1 cpus 0002
>>> I'm process 0 of 4 ...
>>>
>>>
>>> tyr fd1026 121 mpiexec -report-bindings -np 4 \
>>> -bind-to-socket -bysocket rank_size
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],1] to socket 1 cpus 0002
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],2] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>>> fork binding child [[27380,1],3] to socket 1 cpus 0002
>>> I'm process 0 of 4 ...
>>>
>>>
>>> tyr fd1026 117 mpiexec -report-bindings -np 4 \
>>> -bind-to-core -bycore rank_size
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],2] to cpus 0004
>>> ------------------------------------------------------------------
>>> An attempt to set processor affinity has failed - please check to
>>> ensure that your system supports such functionality. If so, then
>>> this is probably something that should be reported to the OMPI
>>> developers.
>>> ------------------------------------------------------------------
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],0] to cpus 0001
>>> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>>> fork binding child [[27307,1],1] to cpus 0002
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application
>>> as it encountered an error
>>> on node tyr.informatik.hs-fulda.de. More information may be
>>> available above.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>>
>>>
>>>
>>> tyr fd1026 118 mpiexec -report-bindings -np 4 \
>>> -bind-to-core -bysocket rank_size
>>> ------------------------------------------------------------------
>>> An invalid physical processor ID was returned when attempting to
>>> bind
>>> an MPI process to a unique processor.
>>>
>>> This usually means that you requested binding to more processors
>>> than
>>>
>>> exist (e.g., trying to bind N MPI processes to M processors,
>>> where N >
>>> M). Double check that you have enough unique processors for
>>> all the
>>> MPI processes that you are launching on this host.
>>>
>>> You job will now abort.
>>> ------------------------------------------------------------------
>>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>> fork binding child [[27347,1],0] to socket 0 cpus 0001
>>> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>>> fork binding child [[27347,1],1] to socket 1 cpus 0002
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error
>>> on node tyr.informatik.hs-fulda.de. More information may be
>>> available above.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>> tyr fd1026 119
>>>
>>>
>>>
>>> "linpc3" and "linpc4" are two processor dual core machines.
>>>
>>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-core -bycore rank_size
>>> [linpc4:16842] [[40914,0],0] odls:default:
>>> fork binding child [[40914,1],1] to cpus 0001
>>> [linpc4:16842] [[40914,0],0] odls:default:
>>> fork binding child [[40914,1],3] to cpus 0002
>>> [linpc3:31384] [[40914,0],1] odls:default:
>>> fork binding child [[40914,1],0] to cpus 0001
>>> [linpc3:31384] [[40914,0],1] odls:default:
>>> fork binding child [[40914,1],2] to cpus 0002
>>> I'm process 1 of 4 ...
>>>
>>>
>>> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-core -bysocket rank_size
>>> [linpc4:16846] [[40918,0],0] odls:default:
>>> fork binding child [[40918,1],1] to socket 0 cpus 0001
>>> [linpc4:16846] [[40918,0],0] odls:default:
>>> fork binding child [[40918,1],3] to socket 0 cpus 0002
>>> [linpc3:31435] [[40918,0],1] odls:default:
>>> fork binding child [[40918,1],0] to socket 0 cpus 0001
>>> [linpc3:31435] [[40918,0],1] odls:default:
>>> fork binding child [[40918,1],2] to socket 0 cpus 0002
>>> I'm process 1 of 4 ...
>>>
>>>
>>>
>>>
>>> linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-socket -bycore rank_size
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc3.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc4.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error:
>>>
>>> Error name: Fatal
>>> Node: linpc4
>>>
>>> when attempting to start process rank 1.
>>> ------------------------------------------------------------------
>>> 4 total processes failed to start
>>> linpc4 fd1026 105
>>>
>>>
>>> linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \
>>> -np 4 -bind-to-socket -bysocket rank_size
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc4.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> Unable to bind to socket 0 on node linpc3.
>>> ------------------------------------------------------------------
>>> ------------------------------------------------------------------
>>> mpiexec was unable to start the specified application as it
>>> encountered an error:
>>>
>>> Error name: Fatal
>>> Node: linpc4
>>>
>>> when attempting to start process rank 1.
>>> --------------------------------------------------------------------------
>>> 4 total processes failed to start
>>>
>>>
>>> It's interesting that commands that work on Solaris fail on Linux
>>> and vice versa.
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>>>> I couldn't really say for certain - I don't see anything obviously
>>>>> wrong with your syntax, and the code appears to be working or else
>>>>> it would fail on the other nodes as well. The fact that it fails
>>>>> solely on that machine seems suspect.
>>>>>
>>>>> Set aside the rankfile for the moment and try to just bind to cores
>>>>> on that machine, something like:
>>>>>
>>>>> mpiexec --report-bindings -bind-to-core
>>>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>>>>
>>>>> If that doesn't work, then the problem isn't with rankfile
>>>>
>>>> It doesn't work but I found out something else as you can see below.
>>>> I get a segmentation fault for some rankfiles.
>>>>
>>>>
>>>> tyr small_prog 110 mpiexec --report-bindings -bind-to-core
>>>> -host rs0.informatik.hs-fulda.de -n 2 rank_size
>>>> --------------------------------------------------------------------------
>>>> An attempt to set processor affinity has failed - please check to
>>>> ensure that your system supports such functionality. If so, then
>>>> this is probably something that should be reported to the OMPI developers.
>>>> --------------------------------------------------------------------------
>>>> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default:
>>>> fork binding child [[30561,1],0] to cpus 0001
>>>> --------------------------------------------------------------------------
>>>> mpiexec was unable to start the specified application as it
>>>> encountered an error:
>>>>
>>>> Error name: Resource temporarily unavailable
>>>> Node: rs0.informatik.hs-fulda.de
>>>>
>>>> when attempting to start process rank 0.
>>>> --------------------------------------------------------------------------
>>>> 2 total processes failed to start
>>>> tyr small_prog 111
>>>>
>>>>
>>>>
>>>>
>>>> Perhaps I have a hint for the error on Solaris Sparc. I use the
>>>> following rankfile to keep everything simple.
>>>>
>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>>> rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>>> rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>>> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>>> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>>> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>>
>>>> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size"
>>>> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine.
>>>>
>>>> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [linpc4:08018] [[49482,0],0] odls:default:fork binding child
>>>> [[49482,1],5] to slot_list 0:0
>>>> [linpc3:22030] [[49482,0],4] odls:default:fork binding child
>>>> [[49482,1],4] to slot_list 0:0
>>>> [linpc0:12887] [[49482,0],2] odls:default:fork binding child
>>>> [[49482,1],1] to slot_list 0:0
>>>> [linpc1:08323] [[49482,0],3] odls:default:fork binding child
>>>> [[49482,1],2] to slot_list 0:0
>>>> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child
>>>> [[49482,1],7] to slot_list 0:0
>>>> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork
>>>> binding child [[49482,1],9] to slot_list 0:0
>>>> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork
>>>> binding child [[49482,1],6] to slot_list 0:0
>>>> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork
>>>> binding child [[49482,1],0] to slot_list 0:0
>>>> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork
>>>> binding child [[49482,1],8] to slot_list 0:0
>>>> ...
>>>>
>>>>
>>>>
>>>> I get a segmentation fault when I run it on my local machine
>>>> (Solaris Sparc).
>>>>
>>>> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG:
>>>> Data unpack would read past end of buffer in file
>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>> at line 927
>>>> [tyr:21421] *** Process received signal ***
>>>> [tyr:21421] Signal: Segmentation Fault (11)
>>>> [tyr:21421] Signal code: Address not mapped (1)
>>>> [tyr:21421] Failing at address: 5ba
>>>>
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec
>>>> /lib/libc.so.1:0xcad04
>>>> /lib/libc.so.1:0xbf3b4
>>>> /lib/libc.so.1:0xbf59c
>>>> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)]
>>>> /lib/libc.so.1:free+0x24
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_odls_base_default_construct_child_list+0x1234
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_odls_default.so:0x90b8
>>>>
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_daemon_cmd_processor+0x328
>>>>
> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> opal_event_base_loop+0x228
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> opal_progress+0xec
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_plm_base_report_launched+0x1c4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
>>>> orte_plm_base_launch_apps+0x318
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so:
>>>> orte_plm_rsh_launch+0xac4
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24
>>>> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8
>>>> [tyr:21421] *** End of error message ***
>>>> Segmentation fault
>>>> tyr small_prog 142
>>>>
>>>>
>>>> The funny thing is that I get a segmentation fault on the Linux
>>>> machine as well if I change my rankfile in the following way.
>>>>
>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
>>>> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
>>>> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
>>>> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
>>>> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
>>>> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
>>>> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
>>>> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
>>>>
>>>>
>>>> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size
>>>> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would
>>>> read past end of buffer in file
>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>> at line 927
>>>> [linpc4:08402] *** Process received signal ***
>>>> [linpc4:08402] Signal: Segmentation fault (11)
>>>> [linpc4:08402] Signal code: Address not mapped (1)
>>>> [linpc4:08402] Failing at address: 0x5f32fffc
>>>> [linpc4:08402] [ 0] [0xffffe410]
>>>> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_odls_default.so(+0x4023) [0xf73ec023]
>>>> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(+0x42b91) [0xf7667b91]
>>>> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3]
>>>> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(+0x8f366) [0xf76b4366]
>>>> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc]
>>>> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526]
>>>> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(opal_progress+0xba) [0xf769303a]
>>>> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f]
>>>> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/
>>>> libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27]
>>>> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
>>>> mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd]
>>>> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf]
>>>> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273]
>>>> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003]
>>>> [linpc4:08402] *** End of error message ***
>>>> Segmentation fault
>>>> linpc4 small_prog 107
>>>>
>>>>
>>>> Hopefully this information helps to fix the problem.
>>>>
>>>>
>>>> Kind regards
>>>>
>>>> Siegmar
>>>>
>>>>
>>>>
>>>>
>>>>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross
>>> <Siegmar.Gross_at_[hidden]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm new to rankfiles so that I played a little bit with different
>>>>>> options. I thought that the following entry would be similar to an
>>>>>> entry in an appfile and that MPI could place the process with rank 0
>>>>>> on any core of any processor.
>>>>>>
>>>>>> rank 0=tyr.informatik.hs-fulda.de
>>>>>>
>>>>>> Unfortunately it's not allowed and I got an error. Can somebody add
>>>>>> the missing help to the file?
>>>>>>
>>>>>>
>>>>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>>>
> --------------------------------------------------------------------------
>>>>>> Sorry! You were supposed to get help about:
>>>>>> no-slot-list
>>>>>> from the file:
>>>>>> help-rmaps_rank_file.txt
>>>>>> But I couldn't find that topic in the file. Sorry!
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> As you can see below I could use a rankfile on my old local machine
>>>>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
>>>>>> logged into the machine via ssh and tried the same command once more
>>>>>> as a local user without success. It's more or less the same error as
>>>>>> before when I tried to bind the process to a remote machine.
>>>>>>
>>>>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
>>>>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
>>>>>> binding child [[19734,1],0] to slot_list 0:0
>>>>>>
> --------------------------------------------------------------------------
>>>>>> We were unable to successfully process/set the requested processor
>>>>>> affinity settings:
>>>>>>
>>>>>> Specified slot list: 0:0
>>>>>> Error: Cross-device link
>>>>>>
>>>>>> This could mean that a non-existent processor was specified, or
>>>>>> that the specification had improper syntax.
>>>>>>
> --------------------------------------------------------------------------
>>>>>>
> --------------------------------------------------------------------------
>>>>>> mpiexec was unable to start the specified application as it encountered
> an
>>> error:
>>>>>>
>>>>>> Error name: No such file or directory
>>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>>
>>>>>> when attempting to start process rank 0.
>>>>>>
> --------------------------------------------------------------------------
>>>>>> rs0 small_prog 119
>>>>>>
>>>>>>
>>>>>> The application is available.
>>>>>>
>>>>>> rs0 small_prog 119 which rank_size
>>>>>> /home/fd1026/SunOS/sparc/bin/rank_size
>>>>>>
>>>>>>
>>>>>> Is it a problem in the Open MPI implementation or in my rankfile?
>>>>>> How can I request which sockets and cores per socket are
>>>>>> available so that I can use correct values in my rankfile?
>>>>>> In lam-mpi I had a command "lamnodes" which I could use to get
>>>>>> such information. Thank you very much for any help in advance.
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>> Siegmar
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
>>>>>>>
>>>>>>> Yes, both machines are Sparc. I tried first in a homogeneous
>>>>>>> environment.
>>>>>>>
>>>>>>> tyr fd1026 106 psrinfo -v
>>>>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
>>>>>>> on-line since 08/31/2012 15:44:42.
>>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>> and has a sparcv9 floating point processor.
>>>>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
>>>>>>> on-line since 08/31/2012 15:44:39.
>>>>>>> The sparcv9 processor operates at 1600 MHz,
>>>>>>> and has a sparcv9 floating point processor.
>>>>>>> tyr fd1026 107
>>>>>>>
>>>>>>> My local machine (tyr) is a dual processor machine and the
>>>>>>> other one is equipped with two quad-core processors each
>>>>>>> capable of running two hardware threads.
>>>>>>>
>>>>>>>
>>>>>>> Kind regards
>>>>>>>
>>>>>>> Siegmar
>>>>>>>
>>>>>>>
>>>>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross
>>>>>>> <Siegmar.Gross_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> the man page for "mpiexec" shows the following:
>>>>>>>>>
>>>>>>>>> cat myrankfile
>>>>>>>>> rank 0=aa slot=1:0-2
>>>>>>>>> rank 1=bb slot=0:0,1
>>>>>>>>> rank 2=cc slot=1-2
>>>>>>>>> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
>>>>>>>>>
>>>>>>>>> Rank 0 runs on node aa, bound to socket 1, cores 0-2.
>>>>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
>>>>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2.
>>>>>>>>>
>>>>>>>>> Does it mean that the process with rank 0 should be bound to
>>>>>>>>> core 0, 1, or 2 of socket 1?
>>>>>>>>>
>>>>>>>>> I tried to use a rankfile and have a problem. My rankfile contains
>>>>>>>>> the following lines.
>>>>>>>>>
>>>>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
>>>>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
>>>>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Everything is fine if I use the file with just my local machine
>>>>>>>>> (the first two lines).
>>>>>>>>>
>>>>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
>>>>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
>>>>>>>>> I'm process 0 of 2 available processes running on
>>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>>> I'm process 1 of 2 available processes running on
>>>>>>> tyr.informatik.hs-fulda.de.
>>>>>>>>> MPI standard 2.1 is supported.
>>>>>>>>> tyr small_prog 116
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I can also change the socket number and the processes will be attached
>>>>>>>>> to the correct cores. Unfortunately it doesn't work if I add one
>>>>>>>>> other machine (third line).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
>>>>>>>>>
>>> --------------------------------------------------------------------------
>>>>>>>>> We were unable to successfully process/set the requested processor
>>>>>>>>> affinity settings:
>>>>>>>>>
>>>>>>>>> Specified slot list: 0:0
>>>>>>>>> Error: Cross-device link
>>>>>>>>>
>>>>>>>>> This could mean that a non-existent processor was specified, or
>>>>>>>>> that the specification had improper syntax.
>>>>>>>>>
>>> --------------------------------------------------------------------------
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
>>>>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
>>>>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>>>>>>> whose contact information is unknown in file
>>>>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
>>>>>>>>> to [[10212,1],0]: tag 20
>>>>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
>>>>>>>>> A message is attempting to be sent to a process whose contact
>>>>>>>>> information is unknown in file
>>>>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
>>>>>>>>> at line 2501
>>>>>>>>>
>>> --------------------------------------------------------------------------
>>>>>>>>> mpiexec was unable to start the specified application as it
>>>>>>>>> encountered an error:
>>>>>>>>>
>>>>>>>>> Error name: Error 0
>>>>>>>>> Node: rs0.informatik.hs-fulda.de
>>>>>>>>>
>>>>>>>>> when attempting to start process rank 2.
>>>>>>>>>
>>> --------------------------------------------------------------------------
>>>>>>>>> tyr small_prog 113
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The other machine has two 8 core processors.
>>>>>>>>>
>>>>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v
>>>>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
>>>>>>>>> on-line since 07/26/2012 15:03:14.
>>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>> and has a sparcv9 floating point processor.
>>>>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
>>>>>>>>> ...
>>>>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
>>>>>>>>> on-line since 07/26/2012 15:03:16.
>>>>>>>>> The sparcv9 processor operates at 2400 MHz,
>>>>>>>>> and has a sparcv9 floating point processor.
>>>>>>>>> tyr small_prog 122
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is it necessary to specify another option on the command line or
>>>>>>>>> is my rankfile faulty? Thank you very much for any suggestions in
>>>>>>>>> advance.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kind regards
>>>>>>>>>
>>>>>>>>> Siegmar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/