Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segmentation fault / bus error in openmpi-1.9a1r27342(Solaris, Oracle C)
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-09-17 08:24:33


Hi,

I just found out that I get no segmentation fault or bus error if I
add "-display-devel-map" to the commands.

rs0 fd1026 110 mpiexec -report-bindings -np 3 -bind-to hwthread -display-devel-map date

 Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSLOT Ranking policy: SLOT Binding policy:
HWTHREAD[HWTHREAD] Cpu set: NULL PPR: NULL
        Num new daemons: 0 New daemon starting vpid INVALID
        Num nodes: 1

 Data for node: rs0.informatik.hs-fulda.de Launch id: -1 State: 2
        Daemon: [[10411,0],0] Daemon launched: True
        Num slots: 1 Slots in use: 1 Oversubscribed: TRUE
        Num slots allocated: 1 Max slots: 0
        Username on node: NULL
        Num procs: 3 Next node_rank: 3
        Data for proc: [[10411,1],0]
                Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
                State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 0[0]
        Data for proc: [[10411,1],1]
                Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
                State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 2[2]
        Data for proc: [[10411,1],2]
                Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
                State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 4[4]
[rs0.informatik.hs-fulda.de:20492] MCW rank 0 bound to : [B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:20492] MCW rank 1 bound to : [../B./../..][../../../..]
[rs0.informatik.hs-fulda.de:20492] MCW rank 2 bound to : [../../B./..][../../../..]
Mon Sep 17 14:20:50 CEST 2012
Mon Sep 17 14:20:50 CEST 2012
Mon Sep 17 14:20:50 CEST 2012

rs0 fd1026 111 mpiexec -report-bindings -np 2 -bynode -bind-to hwthread -display-devel-map date

 Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE Ranking policy: NODE Binding policy:
HWTHREAD[HWTHREAD] Cpu set: NULL PPR: NULL
        Num new daemons: 0 New daemon starting vpid INVALID
        Num nodes: 1

 Data for node: rs0.informatik.hs-fulda.de Launch id: -1 State: 2
        Daemon: [[10417,0],0] Daemon launched: True
        Num slots: 1 Slots in use: 1 Oversubscribed: TRUE
        Num slots allocated: 1 Max slots: 0
        Username on node: NULL
        Num procs: 2 Next node_rank: 2
        Data for proc: [[10417,1],0]
                Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
                State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 0[0]
        Data for proc: [[10417,1],1]
                Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
                State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 2[2]
[rs0.informatik.hs-fulda.de:20502] MCW rank 0 bound to : [B./../../..][../../../..]
[rs0.informatik.hs-fulda.de:20502] MCW rank 1 bound to : [../B./../..][../../../..]
Mon Sep 17 14:22:10 CEST 2012
Mon Sep 17 14:22:10 CEST 2012

Any ideas why an additional option "solves" the problem?

Kind regards

Siegmar

> I have installed openmpi-1.9a1r27342 on Solaris 10 with Oracle
> Solaris Studio compiler 12.3.
>
> rs0 fd1026 106 mpicc -showme
> cc -I/usr/local/openmpi-1.9_64_cc/include -mt -m64 \
> -L/usr/local/openmpi-1.9_64_cc/lib64 -lmpi -lpicl -lm -lkstat \
> -llgrp -lsocket -lnsl -lrt -lm
>
> I can run the following command.
>
> rs0 fd1026 107 mpiexec -report-bindings -np 2 -bind-to hwthread date
> [rs0.informatik.hs-fulda.de:19704] MCW rank 0 bound to :
> [B./../../..][../../../..]
> [rs0.informatik.hs-fulda.de:19704] MCW rank 1 bound to :
> [../B./../..][../../../..]
> Mon Sep 17 13:07:34 CEST 2012
> Mon Sep 17 13:07:34 CEST 2012
>
> I get a segmention fault if I increase the number of processes to 3.
>
> rs0 fd1026 108 mpiexec -report-bindings -np 3 -bind-to hwthread date
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 19711 on node
> rs0.informatik.hs-fulda.de exited on signal 11 (Segmentation Fault).
> --------------------------------------------------------------------------
> [rs0:19713] *** Process received signal ***
> [rs0:19713] Signal: Segmentation Fault (11)
> [rs0:19713] Signal code: Invalid permissions (2)
> [rs0:19713] Failing at address: 1000002e8
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
> /lib/sparcv9/libc.so.1:0xd8684
> /lib/sparcv9/libc.so.1:0xcc1f8
> /lib/sparcv9/libc.so.1:0xcc404
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c1488 [ Signal 11 (SEGV)]
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
> /usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
> /usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
> /usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
> [rs0:19713] *** End of error message ***
> ...
> (same output for the other two processes)
>
>
> If I add "-bynode" I get a bus error.
>
> rs0 fd1026 110 mpiexec -report-bindings -np 2 -bynode -bind-to hwthread date
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 19724 on node
> rs0.informatik.hs-fulda.de exited on signal 10 (Bus Error).
> --------------------------------------------------------------------------
> [rs0:19724] *** Process received signal ***
> [rs0:19724] Signal: Bus Error (10)
> [rs0:19724] Signal code: Invalid address alignment (1)
> [rs0:19724] Failing at address: 1
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
> /lib/sparcv9/libc.so.1:0xd8684
> /lib/sparcv9/libc.so.1:0xcc1f8
> /lib/sparcv9/libc.so.1:0xcc404
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c147c [ Signal 10 (BUS)]
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
> /usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
> /usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
> /usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
> [rs0:19724] *** End of error message ***
> ...
> (same output for the other two processes)
>
>
> I get a segmentation fault for the following commands.
>
> mpiexec -report-bindings -np 2 -map-by slot -bind-to hwthread date
> mpiexec -report-bindings -np 2 -map-by numa -bind-to hwthread date
> mpiexec -report-bindings -np 2 -map-by node -bind-to hwthread date
>
>
> I get a bus error for the following command.
>
> mpiexec -report-bindings -np 2 -map-by socket -bind-to hwthread date
>
>
> The following commands work.
>
> rs0 fd1026 120 mpiexec -report-bindings -np 2 -map-by hwthread -bind-to hwthread date
> [rs0.informatik.hs-fulda.de:19788] MCW rank 0 bound to : [B./../../..][../../../..]
> [rs0.informatik.hs-fulda.de:19788] MCW rank 1 bound to : [.B/../../..][../../../..]
> Mon Sep 17 13:20:30 CEST 2012
> Mon Sep 17 13:20:30 CEST 2012
>
> rs0 fd1026 121 mpiexec -report-bindings -np 2 -map-by core -bind-to hwthread date
> [rs0.informatik.hs-fulda.de:19793] MCW rank 0 bound to : [B./../../..][../../../..]
> [rs0.informatik.hs-fulda.de:19793] MCW rank 1 bound to : [../B./../..][../../../..]
> Mon Sep 17 13:21:06 CEST 2012
> Mon Sep 17 13:21:06 CEST 2012
>
>
> I think that the following output is correct because I have a Sun M4000
> server with two quad-core processors each supporting two hardware-threads.
>
> rs0 fd1026 124 mpiexec -report-bindings -np 2 -map-by board -bind-to hwthread date
> --------------------------------------------------------------------------
> The specified mapping policy is not recognized:
>
> Policy: BYBOARD
>
> Please check for a typo or ensure that the option is a supported
> one.
> --------------------------------------------------------------------------
>
>
> In my opinion I should be able to start and bind up to 16 processes
> if a map and bind to hwthreads or not? Thank you very much for any
> help in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users