Hi,
> > are the following outputs helpful to find the error with
> > a rankfile on Solaris?
>
> If you can't bind on the new Solaris machine, then the rankfile
> won't do you any good. It looks like we are getting the incorrect
> number of cores on that machine - is it possible that it has
> hardware threads, and doesn't report "cores"? Can you download
> and run a copy of lstopo to check the output? You get that from
> the hwloc folks:
>
> http://www.open-mpi.org/software/hwloc/v1.5/
I downloaded and installed the package on our machines. Perhaps it is
easier to detect the error if you have more information. Therefore I
provide the different hardware architecures of all machines on which
a simple program breaks if I try to bind processes to sockets or cores.
I tried the following five commands with "h" one of "tyr", "rs0",
"linpc0", "linpc1", "linpc2", "linpc4", "sunpc0", "sunpc1",
"sunpc2", or "sunpc4" in a shell script file which I started on
my local machine ("tyr"). "works on" means that the small program
(MPI_Init, printf, MPI_Finalize) didn't break. I didn't check if
the layout of the processes was correct.
mpiexec -report-bindings -np 4 -host h init_finalize
works on: tyr, rs0, linpc0, linpc1, linpc2, linpc4, sunpc0, sunpc1,
sunpc2, sunpc4
breaks on: -
mpiexec -report-bindings -np 4 -host h -bind-to-core -bycore init_finalize
works on: linpc2, sunpc1
breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
mpiexec -report-bindings -np 4 -host h -bind-to-core -bysocket init_finalize
works on: linpc2, sunpc1
breaks on: tyr, rs0, linpc0, linpc1, linpc4, sunpc0, sunpc2, sunpc4
mpiexec -report-bindings -np 4 -host h -bind-to-socket -bycore init_finalize
works on: tyr, linpc1, linpc2, sunpc1, sunpc2
breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
mpiexec -report-bindings -np 4 -host h -bind-to-socket -bysocket init_finalize
works on: tyr, linpc1, linpc2, sunpc1, sunpc2
breaks on: rs0, linpc0, linpc4, sunpc0, sunpc4
"lstopo" shows the following hardware configurations for the above
machines. The first line always shows the installed architecture.
"lstopo" does a good job as far as I can see it.
tyr:
----
UltraSPARC-IIIi, 2 single core processors, no hardware threads
tyr fd1026 183 lstopo
Machine (4096MB)
NUMANode L#0 (P#2 2048MB) + Socket L#0 + Core L#0 + PU L#0 (P#0)
NUMANode L#1 (P#1 2048MB) + Socket L#1 + Core L#1 + PU L#1 (P#1)
tyr fd1026 116 psrinfo -pv
The physical processor has 1 virtual processor (0)
UltraSPARC-IIIi (portid 0 impl 0x16 ver 0x34 clock 1600 MHz)
The physical processor has 1 virtual processor (1)
UltraSPARC-IIIi (portid 1 impl 0x16 ver 0x34 clock 1600 MHz)
rs0, rs1:
---------
SPARC64-VII, 2 quad-core processors, 2 hardware threads / core
rs0 fd1026 105 lstopo
Machine (32GB) + NUMANode L#0 (P#1 32GB)
Socket L#0
Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
Core L#1
PU L#2 (P#2)
PU L#3 (P#3)
Core L#2
PU L#4 (P#4)
PU L#5 (P#5)
Core L#3
PU L#6 (P#6)
PU L#7 (P#7)
Socket L#1
Core L#4
PU L#8 (P#8)
PU L#9 (P#9)
Core L#5
PU L#10 (P#10)
PU L#11 (P#11)
Core L#6
PU L#12 (P#12)
PU L#13 (P#13)
Core L#7
PU L#14 (P#14)
PU L#15 (P#15)
tyr fd1026 117 ssh rs0 psrinfo -pv
The physical processor has 8 virtual processors (0-7)
SPARC64-VII (portid 1024 impl 0x7 ver 0x91 clock 2400 MHz)
The physical processor has 8 virtual processors (8-15)
SPARC64-VII (portid 1032 impl 0x7 ver 0x91 clock 2400 MHz)
linpc0, linpc3:
---------------
AMD Athlon64 X2, 1 dual-core processor, no hardware threads
linpc0 fd1026 102 lstopo
Machine (4023MB) + Socket L#0
L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
It is strange that openSuSE-Linux-12.1 thinks that two
dual-core processors are available although the machines
are only equipped with one processor.
linpc0 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor : 0
cpu cores : 2
processor : 1
cpu cores : 2
linpc1:
-------
Intel Xeon, 2 single core processors, no hardware threads
linpc1 fd1026 104 lstopo
Machine (3829MB)
Socket L#0 + Core L#0 + PU L#0 (P#0)
Socket L#1 + Core L#1 + PU L#1 (P#1)
tyr fd1026 118 ssh linpc1 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor : 0
cpu cores : 1
processor : 1
cpu cores : 1
linpc2:
-------
AMD Opteron 280, 2 dual-core processors, no hardware threads
linpc2 fd1026 103 lstopo
Machine (8190MB)
NUMANode L#0 (P#0 4094MB) + Socket L#0
L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
NUMANode L#1 (P#1 4096MB) + Socket L#1
L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
It is strange that openSuSE-Linux-12.1 thinks that four
dual-core processors are available although the machine
is only equipped with two processors.
linpc2 fd1026 104 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor : 0
cpu cores : 2
processor : 1
cpu cores : 2
processor : 2
cpu cores : 2
processor : 3
cpu cores : 2
linpc4:
-------
AMD Opteron 1218, 1 dual-core processors, no hardware threads
linpc4 fd1026 100 lstopo
Machine (4024MB) + Socket L#0
L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
It is strange that openSuSE-Linux-12.1 thinks that two
dual-core processors are available although the machine
is only equipped with one processor.
tyr fd1026 230 ssh linpc4 cat /proc/cpuinfo | grep -e processor -e "cpu core"
processor : 0
cpu cores : 2
processor : 1
cpu cores : 2
sunpc0, sunpc3:
---------------
AMD Athlon64 X2, 1 dual-core processor, no hardware threads
sunpc0 fd1026 104 lstopo
Machine (4094MB) + NUMANode L#0 (P#0 4094MB) + Socket L#0
Core L#0 + PU L#0 (P#0)
Core L#1 + PU L#1 (P#1)
tyr fd1026 111 ssh sunpc0 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
x86 (chipid 0x0 AuthenticAMD family 15 model 43 step 1 clock 2000 MHz)
AMD Athlon(tm) 64 X2 Dual Core Processor 3800+
sunpc1:
-------
AMD Opteron 280, 2 dual-core processors, no hardware threads
sunpc1 fd1026 104 lstopo
Machine (8191MB)
NUMANode L#0 (P#1 4095MB) + Socket L#0
Core L#0 + PU L#0 (P#0)
Core L#1 + PU L#1 (P#1)
NUMANode L#1 (P#2 4096MB) + Socket L#1
Core L#2 + PU L#2 (P#2)
Core L#3 + PU L#3 (P#3)
tyr fd1026 112 ssh sunpc1 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
x86 (chipid 0x0 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
Dual Core AMD Opteron(tm) Processor 280
The physical processor has 2 virtual processors (2 3)
x86 (chipid 0x1 AuthenticAMD family 15 model 33 step 2 clock 2411 MHz)
Dual Core AMD Opteron(tm) Processor 280
sunpc2:
-------
Intel Xeon, 2 single core processors, no hardware threads
sunpc2 fd1026 104 lstopo
Machine (3904MB) + NUMANode L#0 (P#0 3904MB)
Socket L#0 + Core L#0 + PU L#0 (P#0)
Socket L#1 + Core L#1 + PU L#1 (P#1)
tyr fd1026 114 ssh sunpc2 psrinfo -pv
The physical processor has 1 virtual processor (0)
x86 (chipid 0x0 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
Intel(r) Xeon(tm) CPU 2.80GHz
The physical processor has 1 virtual processor (1)
x86 (chipid 0x3 GenuineIntel family 15 model 2 step 9 clock 2791 MHz)
Intel(r) Xeon(tm) CPU 2.80GHz
sunpc4:
-------
AMD Opteron 1218, 1 dual-core processor, no hardware threads
sunpc4 fd1026 104 lstopo
Machine (4096MB) + NUMANode L#0 (P#0 4096MB) + Socket L#0
Core L#0 + PU L#0 (P#0)
Core L#1 + PU L#1 (P#1)
tyr fd1026 115 ssh sunpc4 psrinfo -pv
The physical processor has 2 virtual processors (0 1)
x86 (chipid 0x0 AuthenticAMD family 15 model 67 step 2 clock 2613 MHz)
Dual-Core AMD Opteron(tm) Processor 1218
Among others I got the following error messages (I can provide
the complete file if you are interested in it).
##################
##################
mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bycore init_finalize
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
[[30908,1],2] to cpus 0004
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
[[30908,1],0] to cpus 0001
[tyr.informatik.hs-fulda.de:23208] [[30908,0],0] odls:default:fork binding child
[[30908,1],1] to cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node tyr.informatik.hs-fulda.de. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host tyr -bind-to-core -bysocket init_finalize
--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.
This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M). Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.
You job will now abort.
--------------------------------------------------------------------------
[tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding child
[[30907,1],0] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:23215] [[30907,0],0] odls:default:fork binding child
[[30907,1],1] to socket 1 cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node tyr.informatik.hs-fulda.de. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05715] [[30936,0],1] odls:default:fork binding child
[[30936,1],0] to cpus 0001
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Resource temporarily unavailable
Node: rs0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-core -bysocket init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05743] [[30916,0],1] odls:default:fork binding child
[[30916,1],0] to socket 0 cpus 0001
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Resource temporarily unavailable
Node: rs0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05771] [[30912,0],1] odls:default:fork binding child
[[30912,1],0] to socket 0 cpus 0055
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Resource temporarily unavailable
Node: rs0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host rs0 -bind-to-socket -bysocket init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[rs0.informatik.hs-fulda.de:05799] [[30924,0],1] odls:default:fork binding child
[[30924,1],0] to socket 0 cpus 0055
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Resource temporarily unavailable
Node: rs0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bycore init_finalize
--------------------------------------------------------------------------
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--------------------------------------------------------------------------
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],0] to
cpus 0001
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],1] to
cpus 0002
[linpc0:02275] [[30964,0],1] odls:default:fork binding child [[30964,1],2] to
cpus 0004
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node linpc0. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-core -bysocket
init_finalize
--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.
This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M). Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.
You job will now abort.
--------------------------------------------------------------------------
[linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],0] to
socket 0 cpus 0001
[linpc0:02326] [[30960,0],1] odls:default:fork binding child [[30960,1],1] to
socket 0 cpus 0002
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an error
on node linpc0. More information may be available above.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bycore
init_finalize
--------------------------------------------------------------------------
Unable to bind to socket 0 on node linpc0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Fatal
Node: linpc0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
##################
##################
mpiexec -report-bindings -np 4 -host linpc0 -bind-to-socket -bysocket
init_finalize
--------------------------------------------------------------------------
Unable to bind to socket 0 on node linpc0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec was unable to start the specified application as it encountered an
error:
Error name: Fatal
Node: linpc0
when attempting to start process rank 0.
--------------------------------------------------------------------------
4 total processes failed to start
Hopefully this helps to track the error. Thank you very much
for your help in advance.
Kind regards
Siegmar
> > I wrapped long lines so that they
> > are easier to read. Have you had time to look at the
> > segmentation fault with a rankfile which I reported in my
> > last email (see below)?
>
> I'm afraid not - been too busy lately. I'd suggest first focusing
> on getting binding to work.
>
> >
> > "tyr" is a two processor single core machine.
> >
> > tyr fd1026 116 mpiexec -report-bindings -np 4 \
> > -bind-to-socket -bycore rank_size
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> > fork binding child [[27298,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> > fork binding child [[27298,1],1] to socket 1 cpus 0002
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> > fork binding child [[27298,1],2] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
> > fork binding child [[27298,1],3] to socket 1 cpus 0002
> > I'm process 0 of 4 ...
> >
> >
> > tyr fd1026 121 mpiexec -report-bindings -np 4 \
> > -bind-to-socket -bysocket rank_size
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> > fork binding child [[27380,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> > fork binding child [[27380,1],1] to socket 1 cpus 0002
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> > fork binding child [[27380,1],2] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
> > fork binding child [[27380,1],3] to socket 1 cpus 0002
> > I'm process 0 of 4 ...
> >
> >
> > tyr fd1026 117 mpiexec -report-bindings -np 4 \
> > -bind-to-core -bycore rank_size
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> > fork binding child [[27307,1],2] to cpus 0004
> > ------------------------------------------------------------------
> > An attempt to set processor affinity has failed - please check to
> > ensure that your system supports such functionality. If so, then
> > this is probably something that should be reported to the OMPI
> > developers.
> > ------------------------------------------------------------------
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> > fork binding child [[27307,1],0] to cpus 0001
> > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
> > fork binding child [[27307,1],1] to cpus 0002
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application
> > as it encountered an error
> > on node tyr.informatik.hs-fulda.de. More information may be
> > available above.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> >
> >
> >
> > tyr fd1026 118 mpiexec -report-bindings -np 4 \
> > -bind-to-core -bysocket rank_size
> > ------------------------------------------------------------------
> > An invalid physical processor ID was returned when attempting to
> > bind
> > an MPI process to a unique processor.
> >
> > This usually means that you requested binding to more processors
> > than
> >
> > exist (e.g., trying to bind N MPI processes to M processors,
> > where N >
> > M). Double check that you have enough unique processors for
> > all the
> > MPI processes that you are launching on this host.
> >
> > You job will now abort.
> > ------------------------------------------------------------------
> > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
> > fork binding child [[27347,1],0] to socket 0 cpus 0001
> > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
> > fork binding child [[27347,1],1] to socket 1 cpus 0002
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> > encountered an error
> > on node tyr.informatik.hs-fulda.de. More information may be
> > available above.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> > tyr fd1026 119
> >
> >
> >
> > "linpc3" and "linpc4" are two processor dual core machines.
> >
> > linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
> > -np 4 -bind-to-core -bycore rank_size
> > [linpc4:16842] [[40914,0],0] odls:default:
> > fork binding child [[40914,1],1] to cpus 0001
> > [linpc4:16842] [[40914,0],0] odls:default:
> > fork binding child [[40914,1],3] to cpus 0002
> > [linpc3:31384] [[40914,0],1] odls:default:
> > fork binding child [[40914,1],0] to cpus 0001
> > [linpc3:31384] [[40914,0],1] odls:default:
> > fork binding child [[40914,1],2] to cpus 0002
> > I'm process 1 of 4 ...
> >
> >
> > linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
> > -np 4 -bind-to-core -bysocket rank_size
> > [linpc4:16846] [[40918,0],0] odls:default:
> > fork binding child [[40918,1],1] to socket 0 cpus 0001
> > [linpc4:16846] [[40918,0],0] odls:default:
> > fork binding child [[40918,1],3] to socket 0 cpus 0002
> > [linpc3:31435] [[40918,0],1] odls:default:
> > fork binding child [[40918,1],0] to socket 0 cpus 0001
> > [linpc3:31435] [[40918,0],1] odls:default:
> > fork binding child [[40918,1],2] to socket 0 cpus 0002
> > I'm process 1 of 4 ...
> >
> >
> >
> >
> > linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
> > -np 4 -bind-to-socket -bycore rank_size
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc3.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc4.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> > encountered an error:
> >
> > Error name: Fatal
> > Node: linpc4
> >
> > when attempting to start process rank 1.
> > ------------------------------------------------------------------
> > 4 total processes failed to start
> > linpc4 fd1026 105
> >
> >
> > linpc4 fd1026 105 mpiexec -report-bindings -host linpc3,linpc4 \
> > -np 4 -bind-to-socket -bysocket rank_size
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc4.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > Unable to bind to socket 0 on node linpc3.
> > ------------------------------------------------------------------
> > ------------------------------------------------------------------
> > mpiexec was unable to start the specified application as it
> > encountered an error:
> >
> > Error name: Fatal
> > Node: linpc4
> >
> > when attempting to start process rank 1.
> > --------------------------------------------------------------------------
> > 4 total processes failed to start
> >
> >
> > It's interesting that commands that work on Solaris fail on Linux
> > and vice versa.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >>> I couldn't really say for certain - I don't see anything obviously
> >>> wrong with your syntax, and the code appears to be working or else
> >>> it would fail on the other nodes as well. The fact that it fails
> >>> solely on that machine seems suspect.
> >>>
> >>> Set aside the rankfile for the moment and try to just bind to cores
> >>> on that machine, something like:
> >>>
> >>> mpiexec --report-bindings -bind-to-core
> >>> -host rs0.informatik.hs-fulda.de -n 2 rank_size
> >>>
> >>> If that doesn't work, then the problem isn't with rankfile
> >>
> >> It doesn't work but I found out something else as you can see below.
> >> I get a segmentation fault for some rankfiles.
> >>
> >>
> >> tyr small_prog 110 mpiexec --report-bindings -bind-to-core
> >> -host rs0.informatik.hs-fulda.de -n 2 rank_size
> >> --------------------------------------------------------------------------
> >> An attempt to set processor affinity has failed - please check to
> >> ensure that your system supports such functionality. If so, then
> >> this is probably something that should be reported to the OMPI developers.
> >> --------------------------------------------------------------------------
> >> [rs0.informatik.hs-fulda.de:14695] [[30561,0],1] odls:default:
> >> fork binding child [[30561,1],0] to cpus 0001
> >> --------------------------------------------------------------------------
> >> mpiexec was unable to start the specified application as it
> >> encountered an error:
> >>
> >> Error name: Resource temporarily unavailable
> >> Node: rs0.informatik.hs-fulda.de
> >>
> >> when attempting to start process rank 0.
> >> --------------------------------------------------------------------------
> >> 2 total processes failed to start
> >> tyr small_prog 111
> >>
> >>
> >>
> >>
> >> Perhaps I have a hint for the error on Solaris Sparc. I use the
> >> following rankfile to keep everything simple.
> >>
> >> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
> >> rank 2=linpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
> >> rank 4=linpc3.informatik.hs-fulda.de slot=0:0
> >> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
> >> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
> >> rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
> >> rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
> >> rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
> >> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
> >>
> >> When I execute "mpiexec -report-bindings -rf my_rankfile rank_size"
> >> on a Linux-x86_64 or Solaris-10-x86_64 machine everything works fine.
> >>
> >> linpc4 small_prog 104 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [linpc4:08018] [[49482,0],0] odls:default:fork binding child
> >> [[49482,1],5] to slot_list 0:0
> >> [linpc3:22030] [[49482,0],4] odls:default:fork binding child
> >> [[49482,1],4] to slot_list 0:0
> >> [linpc0:12887] [[49482,0],2] odls:default:fork binding child
> >> [[49482,1],1] to slot_list 0:0
> >> [linpc1:08323] [[49482,0],3] odls:default:fork binding child
> >> [[49482,1],2] to slot_list 0:0
> >> [sunpc1:17786] [[49482,0],6] odls:default:fork binding child
> >> [[49482,1],7] to slot_list 0:0
> >> [sunpc3.informatik.hs-fulda.de:08482] [[49482,0],8] odls:default:fork
> >> binding child [[49482,1],9] to slot_list 0:0
> >> [sunpc0.informatik.hs-fulda.de:11568] [[49482,0],5] odls:default:fork
> >> binding child [[49482,1],6] to slot_list 0:0
> >> [tyr.informatik.hs-fulda.de:21484] [[49482,0],1] odls:default:fork
> >> binding child [[49482,1],0] to slot_list 0:0
> >> [sunpc2.informatik.hs-fulda.de:28638] [[49482,0],7] odls:default:fork
> >> binding child [[49482,1],8] to slot_list 0:0
> >> ...
> >>
> >>
> >>
> >> I get a segmentation fault when I run it on my local machine
> >> (Solaris Sparc).
> >>
> >> tyr small_prog 141 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [tyr.informatik.hs-fulda.de:21421] [[29113,0],0] ORTE_ERROR_LOG:
> >> Data unpack would read past end of buffer in file
> >> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >> at line 927
> >> [tyr:21421] *** Process received signal ***
> >> [tyr:21421] Signal: Segmentation Fault (11)
> >> [tyr:21421] Signal code: Address not mapped (1)
> >> [tyr:21421] Failing at address: 5ba
> >>
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x15d3ec
> >> /lib/libc.so.1:0xcad04
> >> /lib/libc.so.1:0xbf3b4
> >> /lib/libc.so.1:0xbf59c
> >> /lib/libc.so.1:0x58bd0 [ Signal 11 (SEGV)]
> >> /lib/libc.so.1:free+0x24
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> orte_odls_base_default_construct_child_list+0x1234
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/
> >> mca_odls_default.so:0x90b8
> >>
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x5e8d4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> orte_daemon_cmd_processor+0x328
> >>
/export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:0x12e324
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> opal_event_base_loop+0x228
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> opal_progress+0xec
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> orte_plm_base_report_launched+0x1c4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/libopen-rte.so.4.0.0:
> >> orte_plm_base_launch_apps+0x318
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/lib/openmpi/mca_plm_rsh.so:
> >> orte_plm_rsh_launch+0xac4
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:orterun+0x16a8
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:main+0x24
> >> /export2/prog/SunOS_sparc/openmpi-1.6_32_cc/bin/orterun:_start+0xd8
> >> [tyr:21421] *** End of error message ***
> >> Segmentation fault
> >> tyr small_prog 142
> >>
> >>
> >> The funny thing is that I get a segmentation fault on the Linux
> >> machine as well if I change my rankfile in the following way.
> >>
> >> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >> rank 1=linpc0.informatik.hs-fulda.de slot=0:0
> >> #rank 2=linpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 3=linpc2.informatik.hs-fulda.de slot=0:0
> >> #rank 4=linpc3.informatik.hs-fulda.de slot=0:0
> >> rank 5=linpc4.informatik.hs-fulda.de slot=0:0
> >> rank 6=sunpc0.informatik.hs-fulda.de slot=0:0
> >> #rank 7=sunpc1.informatik.hs-fulda.de slot=0:0
> >> #rank 8=sunpc2.informatik.hs-fulda.de slot=0:0
> >> #rank 9=sunpc3.informatik.hs-fulda.de slot=0:0
> >> rank 10=sunpc4.informatik.hs-fulda.de slot=0:0
> >>
> >>
> >> linpc4 small_prog 107 mpiexec -report-bindings -rf my_rankfile rank_size
> >> [linpc4:08402] [[65226,0],0] ORTE_ERROR_LOG: Data unpack would
> >> read past end of buffer in file
> >> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >> at line 927
> >> [linpc4:08402] *** Process received signal ***
> >> [linpc4:08402] Signal: Segmentation fault (11)
> >> [linpc4:08402] Signal code: Address not mapped (1)
> >> [linpc4:08402] Failing at address: 0x5f32fffc
> >> [linpc4:08402] [ 0] [0xffffe410]
> >> [linpc4:08402] [ 1] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
> >> mca_odls_default.so(+0x4023) [0xf73ec023]
> >> [linpc4:08402] [ 2] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(+0x42b91) [0xf7667b91]
> >> [linpc4:08402] [ 3] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(orte_daemon_cmd_processor+0x313) [0xf76655c3]
> >> [linpc4:08402] [ 4] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(+0x8f366) [0xf76b4366]
> >> [linpc4:08402] [ 5] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(opal_event_base_loop+0x18c) [0xf76b46bc]
> >> [linpc4:08402] [ 6] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(opal_event_loop+0x26) [0xf76b4526]
> >> [linpc4:08402] [ 7] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(opal_progress+0xba) [0xf769303a]
> >> [linpc4:08402] [ 8] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(orte_plm_base_report_launched+0x13f) [0xf767d62f]
> >> [linpc4:08402] [ 9] /usr/local/openmpi-1.6_32_cc/lib/
> >> libopen-rte.so.4(orte_plm_base_launch_apps+0x1b7) [0xf767bf27]
> >> [linpc4:08402] [10] /usr/local/openmpi-1.6_32_cc/lib/openmpi/
> >> mca_plm_rsh.so(orte_plm_rsh_launch+0xb2d) [0xf74228fd]
> >> [linpc4:08402] [11] mpiexec(orterun+0x102f) [0x804e7bf]
> >> [linpc4:08402] [12] mpiexec(main+0x13) [0x804c273]
> >> [linpc4:08402] [13] /lib/libc.so.6(__libc_start_main+0xf3) [0xf745e003]
> >> [linpc4:08402] *** End of error message ***
> >> Segmentation fault
> >> linpc4 small_prog 107
> >>
> >>
> >> Hopefully this information helps to fix the problem.
> >>
> >>
> >> Kind regards
> >>
> >> Siegmar
> >>
> >>
> >>
> >>
> >>> On Sep 5, 2012, at 5:50 AM, Siegmar Gross
> > <Siegmar.Gross_at_[hidden]> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm new to rankfiles so that I played a little bit with different
> >>>> options. I thought that the following entry would be similar to an
> >>>> entry in an appfile and that MPI could place the process with rank 0
> >>>> on any core of any processor.
> >>>>
> >>>> rank 0=tyr.informatik.hs-fulda.de
> >>>>
> >>>> Unfortunately it's not allowed and I got an error. Can somebody add
> >>>> the missing help to the file?
> >>>>
> >>>>
> >>>> tyr small_prog 126 mpiexec -rf my_rankfile -report-bindings rank_size
> >>>>
--------------------------------------------------------------------------
> >>>> Sorry! You were supposed to get help about:
> >>>> no-slot-list
> >>>> from the file:
> >>>> help-rmaps_rank_file.txt
> >>>> But I couldn't find that topic in the file. Sorry!
> >>>>
--------------------------------------------------------------------------
> >>>>
> >>>>
> >>>> As you can see below I could use a rankfile on my old local machine
> >>>> (Sun Ultra 45) but not on our "new" one (Sun Server M4000). Today I
> >>>> logged into the machine via ssh and tried the same command once more
> >>>> as a local user without success. It's more or less the same error as
> >>>> before when I tried to bind the process to a remote machine.
> >>>>
> >>>> rs0 small_prog 118 mpiexec -rf my_rankfile -report-bindings rank_size
> >>>> [rs0.informatik.hs-fulda.de:13745] [[19734,0],0] odls:default:fork
> >>>> binding child [[19734,1],0] to slot_list 0:0
> >>>>
--------------------------------------------------------------------------
> >>>> We were unable to successfully process/set the requested processor
> >>>> affinity settings:
> >>>>
> >>>> Specified slot list: 0:0
> >>>> Error: Cross-device link
> >>>>
> >>>> This could mean that a non-existent processor was specified, or
> >>>> that the specification had improper syntax.
> >>>>
--------------------------------------------------------------------------
> >>>>
--------------------------------------------------------------------------
> >>>> mpiexec was unable to start the specified application as it encountered
an
> > error:
> >>>>
> >>>> Error name: No such file or directory
> >>>> Node: rs0.informatik.hs-fulda.de
> >>>>
> >>>> when attempting to start process rank 0.
> >>>>
--------------------------------------------------------------------------
> >>>> rs0 small_prog 119
> >>>>
> >>>>
> >>>> The application is available.
> >>>>
> >>>> rs0 small_prog 119 which rank_size
> >>>> /home/fd1026/SunOS/sparc/bin/rank_size
> >>>>
> >>>>
> >>>> Is it a problem in the Open MPI implementation or in my rankfile?
> >>>> How can I request which sockets and cores per socket are
> >>>> available so that I can use correct values in my rankfile?
> >>>> In lam-mpi I had a command "lamnodes" which I could use to get
> >>>> such information. Thank you very much for any help in advance.
> >>>>
> >>>>
> >>>> Kind regards
> >>>>
> >>>> Siegmar
> >>>>
> >>>>
> >>>>
> >>>>>> Are *all* the machines Sparc? Or just the 3rd one (rs0)?
> >>>>>
> >>>>> Yes, both machines are Sparc. I tried first in a homogeneous
> >>>>> environment.
> >>>>>
> >>>>> tyr fd1026 106 psrinfo -v
> >>>>> Status of virtual processor 0 as of: 09/04/2012 07:32:14
> >>>>> on-line since 08/31/2012 15:44:42.
> >>>>> The sparcv9 processor operates at 1600 MHz,
> >>>>> and has a sparcv9 floating point processor.
> >>>>> Status of virtual processor 1 as of: 09/04/2012 07:32:14
> >>>>> on-line since 08/31/2012 15:44:39.
> >>>>> The sparcv9 processor operates at 1600 MHz,
> >>>>> and has a sparcv9 floating point processor.
> >>>>> tyr fd1026 107
> >>>>>
> >>>>> My local machine (tyr) is a dual processor machine and the
> >>>>> other one is equipped with two quad-core processors each
> >>>>> capable of running two hardware threads.
> >>>>>
> >>>>>
> >>>>> Kind regards
> >>>>>
> >>>>> Siegmar
> >>>>>
> >>>>>
> >>>>>> On Sep 3, 2012, at 12:43 PM, Siegmar Gross
> >>>>> <Siegmar.Gross_at_[hidden]> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> the man page for "mpiexec" shows the following:
> >>>>>>>
> >>>>>>> cat myrankfile
> >>>>>>> rank 0=aa slot=1:0-2
> >>>>>>> rank 1=bb slot=0:0,1
> >>>>>>> rank 2=cc slot=1-2
> >>>>>>> mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out So that
> >>>>>>>
> >>>>>>> Rank 0 runs on node aa, bound to socket 1, cores 0-2.
> >>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1.
> >>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2.
> >>>>>>>
> >>>>>>> Does it mean that the process with rank 0 should be bound to
> >>>>>>> core 0, 1, or 2 of socket 1?
> >>>>>>>
> >>>>>>> I tried to use a rankfile and have a problem. My rankfile contains
> >>>>>>> the following lines.
> >>>>>>>
> >>>>>>> rank 0=tyr.informatik.hs-fulda.de slot=0:0
> >>>>>>> rank 1=tyr.informatik.hs-fulda.de slot=1:0
> >>>>>>> #rank 2=rs0.informatik.hs-fulda.de slot=0:0
> >>>>>>>
> >>>>>>>
> >>>>>>> Everything is fine if I use the file with just my local machine
> >>>>>>> (the first two lines).
> >>>>>>>
> >>>>>>> tyr small_prog 115 mpiexec -report-bindings -rf my_rankfile rank_size
> >>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> >>>>>>> odls:default:fork binding child [[9849,1],0] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01133] [[9849,0],0]
> >>>>>>> odls:default:fork binding child [[9849,1],1] to slot_list 1:0
> >>>>>>> I'm process 0 of 2 available processes running on
> >>>>> tyr.informatik.hs-fulda.de.
> >>>>>>> MPI standard 2.1 is supported.
> >>>>>>> I'm process 1 of 2 available processes running on
> >>>>> tyr.informatik.hs-fulda.de.
> >>>>>>> MPI standard 2.1 is supported.
> >>>>>>> tyr small_prog 116
> >>>>>>>
> >>>>>>>
> >>>>>>> I can also change the socket number and the processes will be attached
> >>>>>>> to the correct cores. Unfortunately it doesn't work if I add one
> >>>>>>> other machine (third line).
> >>>>>>>
> >>>>>>>
> >>>>>>> tyr small_prog 112 mpiexec -report-bindings -rf my_rankfile rank_size
> >>>>>>>
> > --------------------------------------------------------------------------
> >>>>>>> We were unable to successfully process/set the requested processor
> >>>>>>> affinity settings:
> >>>>>>>
> >>>>>>> Specified slot list: 0:0
> >>>>>>> Error: Cross-device link
> >>>>>>>
> >>>>>>> This could mean that a non-existent processor was specified, or
> >>>>>>> that the specification had improper syntax.
> >>>>>>>
> > --------------------------------------------------------------------------
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> odls:default:fork binding child [[10212,1],0] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> odls:default:fork binding child [[10212,1],1] to slot_list 1:0
> >>>>>>> [rs0.informatik.hs-fulda.de:12047] [[10212,0],1]
> >>>>>>> odls:default:fork binding child [[10212,1],2] to slot_list 0:0
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0]
> >>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
> >>>>>>> whose contact information is unknown in file
> >>>>>>> ../../../../../openmpi-1.6/orte/mca/rml/oob/rml_oob_send.c at line 145
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] attempted to send
> >>>>>>> to [[10212,1],0]: tag 20
> >>>>>>> [tyr.informatik.hs-fulda.de:01520] [[10212,0],0] ORTE_ERROR_LOG:
> >>>>>>> A message is attempting to be sent to a process whose contact
> >>>>>>> information is unknown in file
> >>>>>>> ../../../../openmpi-1.6/orte/mca/odls/base/odls_base_default_fns.c
> >>>>>>> at line 2501
> >>>>>>>
> > --------------------------------------------------------------------------
> >>>>>>> mpiexec was unable to start the specified application as it
> >>>>>>> encountered an error:
> >>>>>>>
> >>>>>>> Error name: Error 0
> >>>>>>> Node: rs0.informatik.hs-fulda.de
> >>>>>>>
> >>>>>>> when attempting to start process rank 2.
> >>>>>>>
> > --------------------------------------------------------------------------
> >>>>>>> tyr small_prog 113
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> The other machine has two 8 core processors.
> >>>>>>>
> >>>>>>> tyr small_prog 121 ssh rs0 psrinfo -v
> >>>>>>> Status of virtual processor 0 as of: 09/03/2012 19:51:15
> >>>>>>> on-line since 07/26/2012 15:03:14.
> >>>>>>> The sparcv9 processor operates at 2400 MHz,
> >>>>>>> and has a sparcv9 floating point processor.
> >>>>>>> Status of virtual processor 1 as of: 09/03/2012 19:51:15
> >>>>>>> ...
> >>>>>>> Status of virtual processor 15 as of: 09/03/2012 19:51:15
> >>>>>>> on-line since 07/26/2012 15:03:16.
> >>>>>>> The sparcv9 processor operates at 2400 MHz,
> >>>>>>> and has a sparcv9 floating point processor.
> >>>>>>> tyr small_prog 122
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Is it necessary to specify another option on the command line or
> >>>>>>> is my rankfile faulty? Thank you very much for any suggestions in
> >>>>>>> advance.
> >>>>>>>
> >>>>>>>
> >>>>>>> Kind regards
> >>>>>>>
> >>>>>>> Siegmar
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >
>
>
|