Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1/4/3rc1 over MX
From: Scott Atchley (atchley_at_[hidden])
Date: 2010-09-01 09:10:22


Jeff,

I posted a patch for this on the ticket.

Scott

On Aug 26, 2010, at 10:10 AM, Scott Atchley wrote:

> Hi all,
>
> I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not like the memory manager and MX. Compiling using --without-memory-manager works fine. The output below is form the default configure (i.e. --with-memory-manager).
>
> Note, I still see unusual latencies for some tests when using the BTL such as reduce-scatter, allgather, etc. I do not see them with the MTL. An example of BTL latencies from reduce-scatter is:
>
> 256 1000 7.01 7.01 7.01
> 512 1000 7.56 7.56 7.56
> 1024 1000 8.58 8.58 8.58
> 2048 1000 10.36 10.36 10.36
> 4096 1000 14.49 14.49 14.49
> 8192 1000 5180.16 5180.57 5180.36
> 16384 1000 94.96 94.97 94.96
> 32768 1000 4676.30 4676.68 4676.49
> 65536 640 4625.85 4626.23 4626.04
> 131072 320 243.43 243.46 243.45
> 262144 160 425.56 425.66 425.61
>
> Scott
>
> % mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
> [rain16:22509] *** Process received signal ***
> [rain16:22509] Signal: Segmentation fault (11)
> [rain16:22509] Signal code: Address not mapped (1)
> [rain16:22509] Failing at address: 0x2c0
> [rain15:24145] *** Process received signal ***
> [rain15:24145] Signal: Segmentation fault (11)
> [rain15:24145] Signal code: Address not mapped (1)
> [rain15:24145] Failing at address: 0x25a0
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> gdb shows:
>
> #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> (gdb) bt
> #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
> #3 0x00002af68e7a47de in opal_backtrace_buffer ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #4 0x00002af68e7a24ce in show_stackframe ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #5 <signal handler called>
> #6 0x00000000000002c0 in ?? ()
> #7 0x00002af690520640 in mca_mpool_fake_release_memory ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
> #8 0x00002af68e2f49ce in mca_mpool_base_mem_cb ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #9 0x00002af68e78347b in opal_mem_hooks_release_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #10 0x00002af68e7a791f in opal_mem_free_ptmalloc2_munmap ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #11 0x00002af68e7ac2b1 in opal_memory_ptmalloc2_free_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #12 0x0000003d060727c1 in free () from /lib64/libc.so.6
> #13 0x00002af69197aaad in mx__rl_fini (rl=0xab5f928)
> at ../../../libmyriexpress/userspace/../mx__request.c:102
> #14 0x00002af69196924d in mx_close_endpoint (endpoint=0xab5f820)
> at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124
> #15 0x00002af69155e3dc in ompi_mtl_mx_finalize ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so
> #16 0x00002af68e2f87e0 in mca_pml_base_select ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #17 0x00002af68e2bcf40 in ompi_mpi_init ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #18 0x00002af68e2da2b1 in PMPI_Init_thread ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #19 0x0000000000403359 in main ()
>
>
> If I tell it to use BTLs only it changes to:
>
> % mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
> [rain16:22552] *** Process received signal ***
> [rain15:24195] *** Process received signal ***
> [rain15:24195] Signal: Segmentation fault (11)
> [rain15:24195] Signal code: Address not mapped (1)
> [rain15:24195] Failing at address: 0x290
> [rain16:22552] Signal: Segmentation fault (11)
> [rain16:22552] Signal code: Address not mapped (1)
> [rain16:22552] Failing at address: 0x290
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 22552 on node rain16 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> gdb shows:
>
> #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
> #3 0x00002b8310ee17de in opal_backtrace_buffer ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #4 0x00002b8310edf4ce in show_stackframe ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #5 <signal handler called>
> #6 0x0000000000000290 in ?? ()
> #7 0x00002b8312c5d640 in mca_mpool_fake_release_memory ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
> #8 0x00002b8310a319ce in mca_mpool_base_mem_cb ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #9 0x00002b8310ec047b in opal_mem_hooks_release_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #10 0x00002b8310ee5195 in sYSTRIm ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #11 0x00002b8310ee92da in opal_memory_ptmalloc2_free_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #12 0x0000003d060727c1 in free () from /lib64/libc.so.6
> #13 0x0000003d060960bd in closedir () from /lib64/libc.so.6
> #14 0x00002b8310ec7cc9 in foreachfile_callback ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #15 0x00002b8310ec797a in foreach_dirinpath ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #16 0x00002b8310ec7a1e in lt_dlforeachfile ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #17 0x00002b8310ecf2a5 in mca_base_component_find ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #18 0x00002b8310ecfc75 in mca_base_components_open ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #19 0x00002b8310a2eb46 in ompi_dpm_base_open ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #20 0x00002b83109fa3c2 in ompi_mpi_init ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #21 0x00002b8310a172b1 in PMPI_Init_thread ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #22 0x0000000000403359 in main ()
>
>
> Lastly, with just the MTL:
>
> % mpirun -mca pml cm -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
> [rain16:22607] *** Process received signal ***
> [rain15:24247] *** Process received signal ***
> [rain15:24247] Signal: Segmentation fault (11)
> [rain15:24247] Signal code: Address not mapped (1)
> [rain15:24247] Failing at address: 0x38e0
> [rain16:22607] Signal: Segmentation fault (11)
> [rain16:22607] Signal code: Address not mapped (1)
> [rain16:22607] Failing at address: 0x38e0
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 22607 on node rain16 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> gdb shows:
>
> #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
> #3 0x00002afa78ae87de in opal_backtrace_buffer ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #4 0x00002afa78ae64ce in show_stackframe ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #5 <signal handler called>
> #6 0x00000000000038e0 in ?? ()
> #7 0x00002afa7a864640 in mca_mpool_fake_release_memory ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
> #8 0x00002afa786389ce in mca_mpool_base_mem_cb ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #9 0x00002afa78ac747b in opal_mem_hooks_release_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #10 0x00002afa78aec195 in sYSTRIm ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #11 0x00002afa78af02da in opal_memory_ptmalloc2_free_hook ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #12 0x0000003d060727c1 in free () from /lib64/libc.so.6
> #13 0x00002afa78acec45 in foreachfile_callback ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #14 0x00002afa78ace97a in foreach_dirinpath ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #15 0x00002afa78acea1e in lt_dlforeachfile ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #16 0x00002afa78ad62a5 in mca_base_component_find ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #17 0x00002afa78ad6c75 in mca_base_components_open ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #18 0x00002afa7863ca26 in ompi_pubsub_base_open ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #19 0x00002afa78601394 in ompi_mpi_init ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #20 0x00002afa7861e2b1 in PMPI_Init_thread ()
> from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #21 0x0000000000403359 in main ()
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel