Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-10-13 16:19:50


Karl --

Yikes. This looks like an alignment or memory write ordering kind of
error; I have a dim recollection about doing some fixes for this, but
am on a plane at the moment and cannot check the SVN logs.

Could you try the latest 1.1.2 RC and see if the problem still occurs
for you? It's available on the general download page on the web site.

Thanks!

On Oct 7, 2006, at 7:34 PM, Karl Dockendorf wrote:

> I just (yesterday) made the move from LAM/MPI to OpenMPI. The
> configure / compile / install went smoothly (version 1.1.1).
> However, after recompiling my source and executing it usually
> crashes in MPI_INIT. Seems to be coming from the same place MOST
> of the time. Usually spits out a message something like this.
>
> Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
> Failing at addr:0xfdff8018
> *** End of error message ***
> Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
> Failing at addr:0x2807000
> *** End of error message ***
>
> The test system (before moving back to the cluster) is a G4
> PowerBook with OS 10.4.8 (not using Xgrid at the moment). I'm
> oversubscribing it (2 processes, it knows there is only one).
> Attached are the config info from the install. And listed below
> seems to be the crash point from the mca_bml_r2_progress function.
> Any help is much appreciated.
>
> Karl
>
> CRASH 1:
> Command: nm
> Path: /Users/karl/programs/nm/build/Release/nm
> Parent: orted [830]
>
> Version: ??? (???)
>
> PID: 834
> Thread: 0
>
> Exception: EXC_BAD_ACCESS (0x0001)
> Codes: KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018
>
> Thread 0 Crashed:
> 0 mca_btl_sm.so 0x003abbec mca_btl_sm_component_progress +
> 3164
> 1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88
> 2 libopal.0.dylib 0x0032309c opal_progress + 236
> 3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52
> 4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128
> 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
> 6 mca_gpr_proxy.so 0x00059bd4 orte_gpr_proxy_put + 804
> 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
> 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
> 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
> 10 nm 0x00002e60 init_model + 48
> 11 nm 0x00002c70 main + 48
> 12 nm 0x00002494 _start + 340 (crt.c:272)
> 13 nm 0x0000233c start + 60
>
> Thread 0 crashed with PPC Thread State 64:
> srr0: 0x00000000003abbec srr1:
> 0x000000000200f930 vrsave: 0x0000000000000000
> cr: 0x28004222 xer: 0x0000000000000004 lr:
> 0x00000000003aafa0 ctr: 0x00000000003aaf90
> r0: 0x0000000000000000 r1: 0x00000000bfffe8d0 r2:
> 0x00000000fdff8000 r3: 0x0000000000000001
> r4: 0x0000000000049814 r5: 0x00000000bfffe888 r6:
> 0x0000000000000000 r7: 0x00000000fdff8000
> r8: 0x0000000000000004 r9: 0x00000000004177e0 r10:
> 0x0000000000000004 r11: 0x0000000000000000
> r12: 0x00000000003aaf90 r13: 0x00000000fffffffe r14:
> 0x00000000003ad004 r15: 0x00000000003441e8
> r16: 0x00000000003ad8c4 r17: 0x0000000000000004 r18:
> 0x0000000000000000 r19: 0x0000000000000000
> r20: 0x0000000000000014 r21: 0x0000000000000000 r22:
> 0x00000000003ae0c4 r23: 0x0000000000000001
> r24: 0x0000000000000000 r25: 0x0000000000000004 r26:
> 0x0000000000029c50 r27: 0x0000000000000000
> r28: 0x0000000000000000 r29: 0x0000000000000001 r30:
> 0x0000000000000000 r31: 0x00000000003aafa0
>
>
>
> CRASH 2:
> Command: nm
> Path: /Users/karl/programs/nm/build/Release/nm
> Parent: orted [830]
>
> Version: ??? (???)
>
> PID: 832
> Thread: 0
>
> Exception: EXC_BAD_ACCESS (0x0001)
> Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000000
>
> Thread 0 Crashed:
> 0 <<00000000>> 0x00000000 0 + 0
> 1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88
> 2 libopal.0.dylib 0x0032309c opal_progress + 236
> 3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52
> 4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128
> 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
> 6 mca_gpr_proxy.so 0x00059bd4 orte_gpr_proxy_put + 804
> 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
> 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
> 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
> 10 nm 0x00002e60 init_model + 48
> 11 nm 0x00002c70 main + 48
> 12 nm 0x00002494 _start + 340 (crt.c:272)
> 13 nm 0x0000233c start + 60
>
> Thread 0 crashed with PPC Thread State 64:
> srr0: 0x0000000000000000 srr1:
> 0x000000004000d930 vrsave: 0x0000000000000000
> cr: 0x28004222 xer: 0x0000000000000004 lr:
> 0x00000000003abe5c ctr: 0x0000000000000000
> r0: 0x0000000000000000 r1: 0x00000000bfffe8d0 r2:
> 0x0000000002008000 r3: 0x00000000003ad864
> r4: 0x0000000000000000 r5: 0x0000000002008000 r6:
> 0x0000000000000000 r7: 0x0000000002008000
> r8: 0x00000000003ad8c4 r9: 0x00000000004177e0 r10:
> 0x0000000000000000 r11: 0x0000000000000000
> r12: 0x0000000000000000 r13: 0x00000000fffffffe r14:
> 0x00000000003ad004 r15: 0x00000000003441e8
> r16: 0x00000000003ad8c4 r17: 0x0000000000000000 r18:
> 0x0000000000000000 r19: 0x0000000000000000
> r20: 0x0000000000000000 r21: 0x0000000000000000 r22:
> 0x00000000003ae0c4 r23: 0x00000000003441e8
> r24: 0x0000000000000000 r25: 0x0000000002008000 r26:
> 0x00000000003ae0c4 r27: 0x0000000000000001
> r28: 0x0000000000000004 r29: 0x0000000000000001 r30:
> 0x0000000000000000 r31: 0x00000000003aafa0
>
>
>
>
> CRASH 3:
> Command: nm
> Path: /Users/karl/programs/nm/build/Debug/nm
> Parent: orted [1790]
>
> Version: ??? (???)
>
> PID: 1794
> Thread: 0
>
> Exception: EXC_BAD_ACCESS (0x0001)
> Codes: KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018
>
> Thread 0 Crashed:
> 0 mca_btl_sm.so 0x003bcbec mca_btl_sm_component_progress +
> 3164
> 1 mca_bml_r2.so 0x003b1d38 mca_bml_r2_progress + 88
> 2 libopal.0.dylib 0x0032309c opal_progress + 236
> 3 mca_oob_tcp.so 0x00055f14 mca_oob_tcp_msg_wait + 52
> 4 mca_oob_tcp.so 0x0005b0a8 mca_oob_tcp_recv + 1128
> 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
> 6 mca_gpr_proxy.so 0x00068bd4 orte_gpr_proxy_put + 804
> 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
> 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
> 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
> 10 nm 0x000028fc init_model + 80 (model.c:16)
> 11 nm 0x00002644 main + 72 (main.c:16)
> 12 nm 0x00001e54 _start + 340 (crt.c:272)
> 13 nm 0x00001cfc start + 60
>
> Thread 0 crashed with PPC Thread State 64:
> srr0: 0x00000000003bcbec srr1:
> 0x000000000200f930 vrsave: 0x0000000000000000
> cr: 0x28004222 xer: 0x0000000000000004 lr:
> 0x00000000003bbfa0 ctr: 0x00000000003bbf90
> r0: 0x0000000000000000 r1: 0x00000000bfffe8f0 r2:
> 0x00000000fdff8000 r3: 0x0000000000000001
> r4: 0x0000000000049814 r5: 0x00000000bfffe8a8 r6:
> 0x0000000000000000 r7: 0x00000000fdff8000
> r8: 0x0000000000000004 r9: 0x00000000004177d0 r10:
> 0x0000000000000004 r11: 0x0000000000000000
> r12: 0x00000000003bbf90 r13: 0x00000000fffffffe r14:
> 0x00000000003be004 r15: 0x00000000003441e8
> r16: 0x00000000003be8c4 r17: 0x0000000000000004 r18:
> 0x0000000000000000 r19: 0x0000000000000000
> r20: 0x0000000000000014 r21: 0x0000000000000000 r22:
> 0x00000000003bf0c4 r23: 0x0000000000000001
> r24: 0x0000000000000000 r25: 0x0000000000000004 r26:
> 0x000000000005ac50 r27: 0x0000000000000000
> r28: 0x0000000000000000 r29: 0x0000000000000001 r30:
> 0x0000000000000000 r31: 0x00000000003bbfa0
>
>
>
> CRASH 4:
> Command: nm
> Path: /Users/karl/programs/nm/build/Debug/nm
> Parent: orted [1790]
>
> Version: ??? (???)
>
> PID: 1792
> Thread: 0
>
> Exception: EXC_BAD_ACCESS (0x0001)
> Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000000
>
> Thread 0 Crashed:
> 0 <<00000000>> 0x00000000 0 + 0
> 1 mca_bml_r2.so 0x003b1d38 mca_bml_r2_progress + 88
> 2 libopal.0.dylib 0x0032309c opal_progress + 236
> 3 mca_oob_tcp.so 0x00055f14 mca_oob_tcp_msg_wait + 52
> 4 mca_oob_tcp.so 0x0005b0a8 mca_oob_tcp_recv + 1128
> 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
> 6 mca_gpr_proxy.so 0x00068bd4 orte_gpr_proxy_put + 804
> 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
> 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
> 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
> 10 nm 0x000028fc init_model + 80 (model.c:16)
> 11 nm 0x00002644 main + 72 (main.c:16)
> 12 nm 0x00001e54 _start + 340 (crt.c:272)
> 13 nm 0x00001cfc start + 60
>
> Thread 0 crashed with PPC Thread State 64:
> srr0: 0x0000000000000000 srr1:
> 0x000000004000d930 vrsave: 0x0000000000000000
> cr: 0x28004222 xer: 0x0000000000000004 lr:
> 0x00000000003bce5c ctr: 0x0000000000000000
> r0: 0x0000000000000000 r1: 0x00000000bfffe8f0 r2:
> 0x0000000002008000 r3: 0x00000000003be864
> r4: 0x0000000000000000 r5: 0x0000000002008000 r6:
> 0x0000000000000000 r7: 0x0000000002008000
> r8: 0x00000000003be8c4 r9: 0x00000000004177d0 r10:
> 0x0000000000000000 r11: 0x0000000000000000
> r12: 0x0000000000000000 r13: 0x00000000fffffffe r14:
> 0x00000000003be004 r15: 0x00000000003441e8
> r16: 0x00000000003be8c4 r17: 0x0000000000000000 r18:
> 0x0000000000000000 r19: 0x0000000000000000
> r20: 0x0000000000000000 r21: 0x0000000000000000 r22:
> 0x00000000003bf0c4 r23: 0x00000000003441e8
> r24: 0x0000000000000000 r25: 0x0000000002008000 r26:
> 0x00000000003bf0c4 r27: 0x0000000000000001
> r28: 0x0000000000000004 r29: 0x0000000000000001 r30:
> 0x0000000000000000 r31: 0x00000000003bbfa0
>
>
>
>
>
>
> <info.tar.gz>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems