Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] some problems with openmpi-1.9a1r30100
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2014-01-01 04:48:08


Hi,

yesterday I installed openmpi-1.9a1r30100 on "Solaris 10 x86_64",
"Solaris 10 Sparc", and "openSUSE Linux 12.1 x86_64" with Sun C
5.12. First of all the good news: "configure", "make", "make
install", and "make check" completed without errors, i.e., "make
check" doesn't produce a "SIGBUS Error" on "Solaris Sparc" and
and doesn't block in or after "opal_path_nfs" on Linux any longer.
I reported both problems before. Thank you very much to everybody
who solved these problems.

Unfortunately I still get a "SIGBUS Error" on "Solaris Sparc"
for "ompi_info -a".

tyr openmpi-1.9 99 ompi_info | grep MPI:
                Open MPI: 1.9a1r30100
tyr openmpi-1.9 100 ompi_info -a |& grep Signal
[tyr:09699] Signal: Bus Error (10)
[tyr:09699] Signal code: Invalid address alignment (1)
.../openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x1321b8
  [ Signal 2099900312 (?)]
Bus error
tyr openmpi-1.9 101

I can compile and run a small MPI program without "SIGBUS Error".
Jeff, thank you very much for solving this problem.

tyr small_prog 110 mpicc init_finalize.c
tyr small_prog 111 mpiexec -np 1 a.out
Hello!
tyr small_prog 112

"make install" didn't install the Javadoc documentation for the
new Java interface. Is it necessary to install it in a separate
step?

tyr small_prog 118 ls -l /usr/local/openmpi-1.9_64_cc/share/
total 6
drwxr-xr-x 5 root root 512 Dec 31 12:03 man
drwxr-xr-x 3 root root 3584 Dec 31 12:05 openmpi
drwxr-xr-x 3 root root 512 Dec 31 12:04 vampirtrace
tyr small_prog 119

In the past I could run a small program in a real heterogeneous
system with little (sunpc1, linpc1) and big endian (rs0, tyr)
machines.

tyr small_prog 101 ompi_info | grep MPI:
                Open MPI: 1.6.6a1r29175
tyr small_prog 102 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size
I'm process 1 of 3 available processes running on sunpc1.
MPI standard 2.1 is supported.
I'm process 0 of 3 available processes running on rs0.informatik.hs-fulda.de.
MPI standard 2.1 is supported.
I'm process 2 of 3 available processes running on linpc1.
MPI standard 2.1 is supported.
tyr small_prog 103

Now I get no output at all.

tyr small_prog 130 ompi_info | grep MPI:
                Open MPI: 1.9a1r30100
tyr small_prog 131 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size
tyr small_prog 132 mpiexec -np 3 -host rs0,sunpc1,linpc1 \
  --hetero-nodes --hetero-apps rank_size
tyr small_prog 133

Perhaps this behaviour is intended, because Open MPI doesn't
support little and big endian machines in the same cluster or
virtual computer (I know only LAM-MPI which works in such an
environment). On the other side: Does it make sense to run
the command without any output, if it doesn't work (even if
"mpiexec" returns "1")?

Nevertheless I have another problem with my small program.

tyr small_prog 158 uname -p
sparc
tyr small_prog 159 ssh rs0 uname -p
sparc

tyr small_prog 160 mpiexec rank_size
I'm process 0 of 1 available processes running on tyr.informatik.hs-fulda.de.
MPI standard 2.2 is supported.

tyr small_prog 161 ssh rs0 mpiexec rank_size
I'm process 0 of 1 available processes running on rs0.informatik.hs-fulda.de.
MPI standard 2.2 is supported.

tyr small_prog 162 mpiexec -np 2 -host tyr,rs0 rank_size
tyr small_prog 163 echo $status
1
tyr small_prog 164

The command works as expected on little endian machines.

linpc1 small_prog 93 mpiexec -np 2 -host linpc1,sunpc1 rank_size
I'm process 0 of 2 available processes running on linpc1.
MPI standard 2.2 is supported.
I'm process 1 of 2 available processes running on sunpc1.
MPI standard 2.2 is supported.
linpc1 small_prog 94

Next I tried process binding.

rf_linpc:
---------
rank 0=linpc1 slot=0:0,1;1:0,1

rf_linpc_linpc:
---------------

rank 0=linpc0 slot=0:0-1;1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=linpc1 slot=1:0
rank 3=linpc1 slot=1:1

rf_linpc_linpc_comma:
---------------------

rank 0=linpc0 slot=0:0,1;1:0,1
rank 1=linpc1 slot=0:0,1
rank 2=linpc1 slot=1:0
rank 3=linpc1 slot=1:1

linpc1 openmpi_1.7.x_or_newer 103 mpiexec -report-bindings -np 1 \
  -rf rf_linpc hostname
[linpc1:08461] MCW rank 0 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]], socket 1[core 2[hwt 0]],
  socket 1[core 3[hwt 0]]: [B/B][B/B]
linpc1
linpc1 openmpi_1.7.x_or_newer 104

That's the output which I expected, but I don't get the expected
output for the following commands.

linpc1 openmpi_1.7.x_or_newer 105 mpiexec -report-bindings -np 4 \
  -rf rf_linpc_linpc hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
linpc1 openmpi_1.7.x_or_newer 106

linpc1 openmpi_1.7.x_or_newer 110 mpiexec -report-bindings -np 4 \
  -rf rf_linpc_linpc_comma hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
linpc1 openmpi_1.7.x_or_newer 111

It works well in Open MPI 1.6.x (similar rank file, but using "," to
separate sockets due to a different syntax).

linpc1 openmpi_1.6.x 109 mpiexec -report-bindings -np 4 \
  -rf rf_linpc_linpc hostname
[linpc1:08675] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[linpc1:08675] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[linpc1:08675] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
linpc1
linpc1
[linpc0:00677] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc0
linpc1
linpc1 openmpi_1.6.x 110

Open MPI 1.6.x supports even little and big endian machines for
this simple command.

linpc1 openmpi_1.6.x 112 ompi_info | grep MPI:
                Open MPI: 1.6.6a1r29175
linpc1 openmpi_1.6.x 113 mpiexec -report-bindings -np 4 \
  -rf rf_linpc_sunpc_tyr hostname
[linpc1:08697] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[linpc0:00758] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc0
linpc1
tyr.informatik.hs-fulda.de
[tyr.informatik.hs-fulda.de:10286] MCW rank 3 bound to
  socket 1[core 0]: [.][B] (slot list 1:0)
[sunpc1:21136] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
sunpc1
linpc1 openmpi_1.6.x 114

Option "-bycore" isn't available any longer. Is this intended?

linpc1 openmpi_1.7.x_or_newer 111 mpiexec -report-bindings -np 4 \
  -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 -bycore \
  -bind-to-core hostname
mpiexec: Error: unknown option "-bycore"
Type 'mpiexec --help' for usage.
linpc1 openmpi_1.7.x_or_newer 112

linpc1 openmpi_1.7.x_or_newer 112 mpiexec -report-bindings \
  -np 4 -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 \
  -bind-to-core hostname
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node: linpc0
   #processes: 2
   #cpus: 1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
linpc1 openmpi_1.7.x_or_newer 113

It worked with Open MPI 1.6.x.

linpc1 openmpi_1.6.x 105 mpiexec -report-bindings -np 4 \
  -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 -bycore \
  -bind-to-core hostname
[linpc1:09465] MCW rank 1 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B]
linpc1
[linpc0:01036] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B]
linpc0
[sunpc0:03796] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B]
sunpc0
[sunpc1:21335] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B]
sunpc1
linpc1 openmpi_1.6.x 106

Have you changed the syntax once more so that I can get the
expected bindings with different command line options or is it
a problem in Open MPI 1.9.x?

I have similar problems with Java.

tyr java 197 mpiexec -np 4 java BcastIntArrayMain

Process 0 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33

Process 1 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33

Process 2 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33

Process 3 running on tyr.informatik.hs-fulda.de.
  intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33

tyr java 198 mpiexec -np 4 -host rs0,tyr java BcastIntArrayMain
tyr java 199 echo $status
1
tyr java 200

Why? Both machines are big endian machines. By the way, I have similar
problems with openmpi-1.7.x. Java isn't available at the moment as I
reported before.

tyr small_prog 103 ompi_info | grep MPI:
                Open MPI: 1.7.4rc2r30094

tyr small_prog 104 ompi_info -a |& grep Signal
[tyr:10441] Signal: Bus Error (10)
[tyr:10441] Signal code: Invalid address alignment (1)
.../openmpi-1.7.4_64_cc/lib64/libopen-pal.so.6.1.0:0x137af8
  [ Signal 2099922960 (?)]
Bus error
tyr small_prog 105

tyr small_prog 105 mpicc init_finalize.c
tyr small_prog 106 mpiexec -np 1 a.out
Hello!
tyr small_prog 107

tyr small_prog 107 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size
tyr small_prog 108 mpiexec -np 3 -host rs0,sunpc1,linpc1 \
? --hetero-nodes --hetero-apps rank_size
tyr small_prog 109

and so on

I'm sorry that I still cause trouble, but on the other side I would
be very grateful, if somebody can solve all problems. Thank you very
much for any help in advance.

Kind regards

Siegmar