Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] segmentation fault / bus error in openmpi-1.9a1r27342 (Solaris, Oracle C)
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-09-17 07:34:59


Hi,

I have installed openmpi-1.9a1r27342 on Solaris 10 with Oracle
Solaris Studio compiler 12.3.

rs0 fd1026 106 mpicc -showme
cc -I/usr/local/openmpi-1.9_64_cc/include -mt -m64 \
   -L/usr/local/openmpi-1.9_64_cc/lib64 -lmpi -lpicl -lm -lkstat \
   -llgrp -lsocket -lnsl -lrt -lm

I can run the following command.

rs0 fd1026 107 mpiexec -report-bindings -np 2 -bind-to hwthread date
[rs0.informatik.hs-fulda.de:19704] MCW rank 0 bound to :
  [B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:19704] MCW rank 1 bound to :
  [../B./../..][../../../..]
Mon Sep 17 13:07:34 CEST 2012
Mon Sep 17 13:07:34 CEST 2012

I get a segmention fault if I increase the number of processes to 3.

rs0 fd1026 108 mpiexec -report-bindings -np 3 -bind-to hwthread date
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 19711 on node
  rs0.informatik.hs-fulda.de exited on signal 11 (Segmentation Fault).
--------------------------------------------------------------------------
[rs0:19713] *** Process received signal ***
[rs0:19713] Signal: Segmentation Fault (11)
[rs0:19713] Signal code: Invalid permissions (2)
[rs0:19713] Failing at address: 1000002e8
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
/lib/sparcv9/libc.so.1:0xd8684
/lib/sparcv9/libc.so.1:0xcc1f8
/lib/sparcv9/libc.so.1:0xcc404
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c1488 [ Signal 11 (SEGV)]
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
/usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
/usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
/usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
/usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
/usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
[rs0:19713] *** End of error message ***
...
(same output for the other two processes)

If I add "-bynode" I get a bus error.

rs0 fd1026 110 mpiexec -report-bindings -np 2 -bynode -bind-to hwthread date
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 19724 on node
  rs0.informatik.hs-fulda.de exited on signal 10 (Bus Error).
--------------------------------------------------------------------------
[rs0:19724] *** Process received signal ***
[rs0:19724] Signal: Bus Error (10)
[rs0:19724] Signal code: Invalid address alignment (1)
[rs0:19724] Failing at address: 1
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
/lib/sparcv9/libc.so.1:0xd8684
/lib/sparcv9/libc.so.1:0xcc1f8
/lib/sparcv9/libc.so.1:0xcc404
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c147c [ Signal 10 (BUS)]
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
/usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
/usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
/usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
/usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
/usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
/usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
[rs0:19724] *** End of error message ***
...
(same output for the other two processes)

I get a segmentation fault for the following commands.

mpiexec -report-bindings -np 2 -map-by slot -bind-to hwthread date
mpiexec -report-bindings -np 2 -map-by numa -bind-to hwthread date
mpiexec -report-bindings -np 2 -map-by node -bind-to hwthread date

I get a bus error for the following command.

mpiexec -report-bindings -np 2 -map-by socket -bind-to hwthread date

The following commands work.

rs0 fd1026 120 mpiexec -report-bindings -np 2 -map-by hwthread -bind-to hwthread date
[rs0.informatik.hs-fulda.de:19788] MCW rank 0 bound to : [B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:19788] MCW rank 1 bound to : [.B/../../..][../../../..]
Mon Sep 17 13:20:30 CEST 2012
Mon Sep 17 13:20:30 CEST 2012

rs0 fd1026 121 mpiexec -report-bindings -np 2 -map-by core -bind-to hwthread date
[rs0.informatik.hs-fulda.de:19793] MCW rank 0 bound to : [B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:19793] MCW rank 1 bound to : [../B./../..][../../../..]
Mon Sep 17 13:21:06 CEST 2012
Mon Sep 17 13:21:06 CEST 2012

I think that the following output is correct because I have a Sun M4000
server with two quad-core processors each supporting two hardware-threads.

rs0 fd1026 124 mpiexec -report-bindings -np 2 -map-by board -bind-to hwthread date
--------------------------------------------------------------------------
The specified mapping policy is not recognized:

  Policy: BYBOARD

Please check for a typo or ensure that the option is a supported
one.
--------------------------------------------------------------------------

In my opinion I should be able to start and bind up to 16 processes
if a map and bind to hwthreads or not? Thank you very much for any
help in advance.

Kind regards

Siegmar