Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] 1/4/3rc1 over MX
From: Scott Atchley (atchley_at_[hidden])
Date: 2010-08-26 10:10:24


Hi all,

I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not like the memory manager and MX. Compiling using --without-memory-manager works fine. The output below is form the default configure (i.e. --with-memory-manager).

Note, I still see unusual latencies for some tests when using the BTL such as reduce-scatter, allgather, etc. I do not see them with the MTL. An example of BTL latencies from reduce-scatter is:

          256 1000 7.01 7.01 7.01
          512 1000 7.56 7.56 7.56
         1024 1000 8.58 8.58 8.58
         2048 1000 10.36 10.36 10.36
         4096 1000 14.49 14.49 14.49
         8192 1000 5180.16 5180.57 5180.36
        16384 1000 94.96 94.97 94.96
        32768 1000 4676.30 4676.68 4676.49
        65536 640 4625.85 4626.23 4626.04
       131072 320 243.43 243.46 243.45
       262144 160 425.56 425.66 425.61

Scott

% mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
[rain16:22509] *** Process received signal ***
[rain16:22509] Signal: Segmentation fault (11)
[rain16:22509] Signal code: Address not mapped (1)
[rain16:22509] Failing at address: 0x2c0
[rain15:24145] *** Process received signal ***
[rain15:24145] Signal: Segmentation fault (11)
[rain15:24145] Signal code: Address not mapped (1)
[rain15:24145] Failing at address: 0x25a0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

gdb shows:

#0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
(gdb) bt
#0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
#1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
#3 0x00002af68e7a47de in opal_backtrace_buffer ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#4 0x00002af68e7a24ce in show_stackframe ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#5 <signal handler called>
#6 0x00000000000002c0 in ?? ()
#7 0x00002af690520640 in mca_mpool_fake_release_memory ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
#8 0x00002af68e2f49ce in mca_mpool_base_mem_cb ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#9 0x00002af68e78347b in opal_mem_hooks_release_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#10 0x00002af68e7a791f in opal_mem_free_ptmalloc2_munmap ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#11 0x00002af68e7ac2b1 in opal_memory_ptmalloc2_free_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#12 0x0000003d060727c1 in free () from /lib64/libc.so.6
#13 0x00002af69197aaad in mx__rl_fini (rl=0xab5f928)
    at ../../../libmyriexpress/userspace/../mx__request.c:102
#14 0x00002af69196924d in mx_close_endpoint (endpoint=0xab5f820)
    at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124
#15 0x00002af69155e3dc in ompi_mtl_mx_finalize ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so
#16 0x00002af68e2f87e0 in mca_pml_base_select ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#17 0x00002af68e2bcf40 in ompi_mpi_init ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#18 0x00002af68e2da2b1 in PMPI_Init_thread ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#19 0x0000000000403359 in main ()

If I tell it to use BTLs only it changes to:

% mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
[rain16:22552] *** Process received signal ***
[rain15:24195] *** Process received signal ***
[rain15:24195] Signal: Segmentation fault (11)
[rain15:24195] Signal code: Address not mapped (1)
[rain15:24195] Failing at address: 0x290
[rain16:22552] Signal: Segmentation fault (11)
[rain16:22552] Signal code: Address not mapped (1)
[rain16:22552] Failing at address: 0x290
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 22552 on node rain16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

gdb shows:

#0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
#1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
#3 0x00002b8310ee17de in opal_backtrace_buffer ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#4 0x00002b8310edf4ce in show_stackframe ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#5 <signal handler called>
#6 0x0000000000000290 in ?? ()
#7 0x00002b8312c5d640 in mca_mpool_fake_release_memory ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
#8 0x00002b8310a319ce in mca_mpool_base_mem_cb ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#9 0x00002b8310ec047b in opal_mem_hooks_release_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#10 0x00002b8310ee5195 in sYSTRIm ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#11 0x00002b8310ee92da in opal_memory_ptmalloc2_free_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#12 0x0000003d060727c1 in free () from /lib64/libc.so.6
#13 0x0000003d060960bd in closedir () from /lib64/libc.so.6
#14 0x00002b8310ec7cc9 in foreachfile_callback ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#15 0x00002b8310ec797a in foreach_dirinpath ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#16 0x00002b8310ec7a1e in lt_dlforeachfile ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#17 0x00002b8310ecf2a5 in mca_base_component_find ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#18 0x00002b8310ecfc75 in mca_base_components_open ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#19 0x00002b8310a2eb46 in ompi_dpm_base_open ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#20 0x00002b83109fa3c2 in ompi_mpi_init ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#21 0x00002b8310a172b1 in PMPI_Init_thread ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#22 0x0000000000403359 in main ()

Lastly, with just the MTL:

% mpirun -mca pml cm -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
[rain16:22607] *** Process received signal ***
[rain15:24247] *** Process received signal ***
[rain15:24247] Signal: Segmentation fault (11)
[rain15:24247] Signal code: Address not mapped (1)
[rain15:24247] Failing at address: 0x38e0
[rain16:22607] Signal: Segmentation fault (11)
[rain16:22607] Signal code: Address not mapped (1)
[rain16:22607] Failing at address: 0x38e0
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 22607 on node rain16 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

gdb shows:

#0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1
#1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6
#3 0x00002afa78ae87de in opal_backtrace_buffer ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#4 0x00002afa78ae64ce in show_stackframe ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#5 <signal handler called>
#6 0x00000000000038e0 in ?? ()
#7 0x00002afa7a864640 in mca_mpool_fake_release_memory ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
#8 0x00002afa786389ce in mca_mpool_base_mem_cb ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#9 0x00002afa78ac747b in opal_mem_hooks_release_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#10 0x00002afa78aec195 in sYSTRIm ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#11 0x00002afa78af02da in opal_memory_ptmalloc2_free_hook ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#12 0x0000003d060727c1 in free () from /lib64/libc.so.6
#13 0x00002afa78acec45 in foreachfile_callback ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#14 0x00002afa78ace97a in foreach_dirinpath ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#15 0x00002afa78acea1e in lt_dlforeachfile ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#16 0x00002afa78ad62a5 in mca_base_component_find ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#17 0x00002afa78ad6c75 in mca_base_components_open ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#18 0x00002afa7863ca26 in ompi_pubsub_base_open ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#19 0x00002afa78601394 in ompi_mpi_init ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#20 0x00002afa7861e2b1 in PMPI_Init_thread ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#21 0x0000000000403359 in main ()