Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segmentation fault / bus error in openmpi-1.9a1r27342(Solaris, Oracle C)
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-09-19 23:06:03


Well, unfortunately it all works fine for me on a Linux x86_64 box. Every command works without issue, though I think my -map-by node -bind-to hwthread doesn't quite generate the pattern I think it should (will have to look more closely at that one).

Outside of that, all works fine. I'm not sure what would be causing the bus errors, nor why changing options would make a difference - most likely cause is that some memory corruption occurs because of the hwloc issues, but that's just a guess.

If you want to put gdb on the core dump and see where it breaks, I could take a look. However, I'm not sure I can solve the basic Solaris issue you're encountering.

On Sep 17, 2012, at 5:24 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> I just found out that I get no segmentation fault or bus error if I
> add "-display-devel-map" to the commands.
>
> rs0 fd1026 110 mpiexec -report-bindings -np 3 -bind-to hwthread -display-devel-map date
>
> Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSLOT Ranking policy: SLOT Binding policy:
> HWTHREAD[HWTHREAD] Cpu set: NULL PPR: NULL
> Num new daemons: 0 New daemon starting vpid INVALID
> Num nodes: 1
>
> Data for node: rs0.informatik.hs-fulda.de Launch id: -1 State: 2
> Daemon: [[10411,0],0] Daemon launched: True
> Num slots: 1 Slots in use: 1 Oversubscribed: TRUE
> Num slots allocated: 1 Max slots: 0
> Username on node: NULL
> Num procs: 3 Next node_rank: 3
> Data for proc: [[10411,1],0]
> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 0[0]
> Data for proc: [[10411,1],1]
> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 2[2]
> Data for proc: [[10411,1],2]
> Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 4[4]
> [rs0.informatik.hs-fulda.de:20492] MCW rank 0 bound to : [B./../../..][../../../..]
> [rs0.informatik.hs-fulda.de:20492] MCW rank 1 bound to : [../B./../..][../../../..]
> [rs0.informatik.hs-fulda.de:20492] MCW rank 2 bound to : [../../B./..][../../../..]
> Mon Sep 17 14:20:50 CEST 2012
> Mon Sep 17 14:20:50 CEST 2012
> Mon Sep 17 14:20:50 CEST 2012
>
>
>
> rs0 fd1026 111 mpiexec -report-bindings -np 2 -bynode -bind-to hwthread -display-devel-map date
>
> Mapper requested: NULL Last mapper: round_robin Mapping policy: BYNODE Ranking policy: NODE Binding policy:
> HWTHREAD[HWTHREAD] Cpu set: NULL PPR: NULL
> Num new daemons: 0 New daemon starting vpid INVALID
> Num nodes: 1
>
> Data for node: rs0.informatik.hs-fulda.de Launch id: -1 State: 2
> Daemon: [[10417,0],0] Daemon launched: True
> Num slots: 1 Slots in use: 1 Oversubscribed: TRUE
> Num slots allocated: 1 Max slots: 0
> Username on node: NULL
> Num procs: 2 Next node_rank: 2
> Data for proc: [[10417,1],0]
> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 0[0]
> Data for proc: [[10417,1],1]
> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
> State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0-15 Binding: 2[2]
> [rs0.informatik.hs-fulda.de:20502] MCW rank 0 bound to : [B./../../..][../../../..]
> [rs0.informatik.hs-fulda.de:20502] MCW rank 1 bound to : [../B./../..][../../../..]
> Mon Sep 17 14:22:10 CEST 2012
> Mon Sep 17 14:22:10 CEST 2012
>
>
> Any ideas why an additional option "solves" the problem?
>
>
> Kind regards
>
> Siegmar
>
>
>
>> I have installed openmpi-1.9a1r27342 on Solaris 10 with Oracle
>> Solaris Studio compiler 12.3.
>>
>> rs0 fd1026 106 mpicc -showme
>> cc -I/usr/local/openmpi-1.9_64_cc/include -mt -m64 \
>> -L/usr/local/openmpi-1.9_64_cc/lib64 -lmpi -lpicl -lm -lkstat \
>> -llgrp -lsocket -lnsl -lrt -lm
>>
>> I can run the following command.
>>
>> rs0 fd1026 107 mpiexec -report-bindings -np 2 -bind-to hwthread date
>> [rs0.informatik.hs-fulda.de:19704] MCW rank 0 bound to :
>> [B./../../..][../../../..]
>> [rs0.informatik.hs-fulda.de:19704] MCW rank 1 bound to :
>> [../B./../..][../../../..]
>> Mon Sep 17 13:07:34 CEST 2012
>> Mon Sep 17 13:07:34 CEST 2012
>>
>> I get a segmention fault if I increase the number of processes to 3.
>>
>> rs0 fd1026 108 mpiexec -report-bindings -np 3 -bind-to hwthread date
>> --------------------------------------------------------------------------
>> mpiexec noticed that process rank 0 with PID 19711 on node
>> rs0.informatik.hs-fulda.de exited on signal 11 (Segmentation Fault).
>> --------------------------------------------------------------------------
>> [rs0:19713] *** Process received signal ***
>> [rs0:19713] Signal: Segmentation Fault (11)
>> [rs0:19713] Signal code: Invalid permissions (2)
>> [rs0:19713] Failing at address: 1000002e8
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
>> /lib/sparcv9/libc.so.1:0xd8684
>> /lib/sparcv9/libc.so.1:0xcc1f8
>> /lib/sparcv9/libc.so.1:0xcc404
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c1488 [ Signal 11 (SEGV)]
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
>> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
>> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
>> [rs0:19713] *** End of error message ***
>> ...
>> (same output for the other two processes)
>>
>>
>> If I add "-bynode" I get a bus error.
>>
>> rs0 fd1026 110 mpiexec -report-bindings -np 2 -bynode -bind-to hwthread date
>> --------------------------------------------------------------------------
>> mpiexec noticed that process rank 0 with PID 19724 on node
>> rs0.informatik.hs-fulda.de exited on signal 10 (Bus Error).
>> --------------------------------------------------------------------------
>> [rs0:19724] *** Process received signal ***
>> [rs0:19724] Signal: Bus Error (10)
>> [rs0:19724] Signal code: Invalid address alignment (1)
>> [rs0:19724] Failing at address: 1
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x282640
>> /lib/sparcv9/libc.so.1:0xd8684
>> /lib/sparcv9/libc.so.1:0xcc1f8
>> /lib/sparcv9/libc.so.1:0xcc404
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2c147c [ Signal 10 (BUS)]
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_hwloc_base_cset2str+0x28
>> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xab00
>> /usr/local/openmpi-1.9_64_cc/lib64/openmpi/mca_odls_default.so:0xb7e4
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_odls_base_default_launch_local+0xa20
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x2997f4
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:0x299a20
>> /usr/local/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:opal_libevent2019_event_base_loop+0x1e8
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:orterun+0x1920
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:main+0x24
>> /usr/local/openmpi-1.9_64_cc/bin/orterun:_start+0x12c
>> [rs0:19724] *** End of error message ***
>> ...
>> (same output for the other two processes)
>>
>>
>> I get a segmentation fault for the following commands.
>>
>> mpiexec -report-bindings -np 2 -map-by slot -bind-to hwthread date
>> mpiexec -report-bindings -np 2 -map-by numa -bind-to hwthread date
>> mpiexec -report-bindings -np 2 -map-by node -bind-to hwthread date
>>
>>
>> I get a bus error for the following command.
>>
>> mpiexec -report-bindings -np 2 -map-by socket -bind-to hwthread date
>>
>>
>> The following commands work.
>>
>> rs0 fd1026 120 mpiexec -report-bindings -np 2 -map-by hwthread -bind-to hwthread date
>> [rs0.informatik.hs-fulda.de:19788] MCW rank 0 bound to : [B./../../..][../../../..]
>> [rs0.informatik.hs-fulda.de:19788] MCW rank 1 bound to : [.B/../../..][../../../..]
>> Mon Sep 17 13:20:30 CEST 2012
>> Mon Sep 17 13:20:30 CEST 2012
>>
>> rs0 fd1026 121 mpiexec -report-bindings -np 2 -map-by core -bind-to hwthread date
>> [rs0.informatik.hs-fulda.de:19793] MCW rank 0 bound to : [B./../../..][../../../..]
>> [rs0.informatik.hs-fulda.de:19793] MCW rank 1 bound to : [../B./../..][../../../..]
>> Mon Sep 17 13:21:06 CEST 2012
>> Mon Sep 17 13:21:06 CEST 2012
>>
>>
>> I think that the following output is correct because I have a Sun M4000
>> server with two quad-core processors each supporting two hardware-threads.
>>
>> rs0 fd1026 124 mpiexec -report-bindings -np 2 -map-by board -bind-to hwthread date
>> --------------------------------------------------------------------------
>> The specified mapping policy is not recognized:
>>
>> Policy: BYBOARD
>>
>> Please check for a typo or ensure that the option is a supported
>> one.
>> --------------------------------------------------------------------------
>>
>>
>> In my opinion I should be able to start and bind up to 16 processes
>> if a map and bind to hwthreads or not? Thank you very much for any
>> help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users