Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] MPI_T SEGV on DSO
From: KAWASHIMA Takahiro (rivis.kawashima_at_[hidden])
Date: 2014-07-29 19:17:15


Nathan,

Thanks for your response.

Yes. My previous mail was the result of uncommented code.
Now I also pulled latest varList source code which uncommented
the section you mentioned, but the result was same.

If MPI_T_cvar_get_info should return MPI_T_ERR_INVALID_INDEX
for variables for unloaded components, not returning
MPI_T_ERR_INVALID_INDEX is the problem.

I run varList on GDB and found that MPI_T_cvar_get_info returns
MPI_T_ERR_INVALID_INDEX for shmem_sysv_priority (this is sane).
But it returns MPI_SUCCESS for shmem_sysv_major_version.
The difference is mbv_flags values. mbv_flags is 0x44 for
shmem_sysv_priority on MPI_T_cvar_get_info call so that
mca_base_var_get function in opal/mca/base/mca_base_var.c
returns OPAL_ERR_NOT_FOUND. But mbv_flags is 0x10003 for
shmem_sysv_major_version so that mca_base_var_get function
returns OPAL_SUCCESS.

Control variables for unloaded components are not deregistered
completely?

I can track it more when I have time.

My environment:
  OS: Debian GNU/Linux wheezy
  CPU: x86_64
  Run: mpiexec -n 1 varList
  Open MPI source: trunk r32338 (almost latest)
  Open MPI configure:
    enable_picky=yes
    enable_debug=yes
    enable_mem_debug=yes
    enable_mem_profile=yes
    enable_memchecker=no
    enable_mca_no_build=btl-elan,btl-gm,btl-mx,btl-ofud,btl-portals,btl-sctp,btl-template,btl-udapl,common-mx,common-portals,ess-alps,ess-cnos,ess-lsf,ess-portals_utcp,ess-singleton,ess-slurm,grpcomm-cnos,mpool-fake,mtl,notifier,plm-alps,plm-ccp,plm-lsf,plm-process,plm-slurm,plm-submit,plm-tm,plm-xgrid,pml-cm,pml-csum,pml-example,pml-v,ras
    enable_contrib_no_build=vt
    enable_mpi_cxx=no
    enable_mpi_f77=no
    enable_mpi_f90=no
    enable_ipv6=no
    enable_mpi_io=no
    with_devel_headers=no
    with_wrapper_cflags=-g
    with_wrapper_cxxflags=-g
    with_wrapper_fflags=-g
    with_wrapper_fcflags=-g

Regards,
KAWASHIMA Takahiro

> The problem is the code in question does not check the return code of
> MPI_T_cvar_handle_alloc . We are returning an error and they still try
> to use the handle (which is stale). Uncomment this section of the code:
>
>
> //if (MPI_T_ERR_INVALID_INDEX == err)// { NOTE TZI: This variable is not recognized by Mvapich. It is OpenMPI specific.
> // continue;
>
>
> Note that MPI_T_ERR_INVALID_INDEX is in the MPI-3 standard but mvapich
> must not have implemented it (and thus should not claim to be MPI 3.0).
>
> -Nathan
>
> On Wed, Jul 30, 2014 at 12:04:55AM +0900, KAWASHIMA Takahiro wrote:
> > Hi,
> >
> > I encountered the same SEGV reported on the users list when
> > running varList program.
> >
> > http://www.open-mpi.org/community/lists/users/2014/07/24792.php
> >
> > mpiexec -n 1 ./varList:
> > ----------------------------------------------------------------
> > ... snip ...
> > event U/D-2 CHAR n/a ALL
> > event_base_verbose D/D-8 INT n/a LOCAL 0
> > event_libevent2021_event_include U/A-3 CHAR n/a LOCAL poll
> > opal_event_include U/A-3 CHAR n/a LOCAL poll
> > event_libevent2021_major_version D/A-9 INT n/a UNKNOWN 1
> > event_libevent2021_minor_version D/A-9 INT n/a UNKNOWN 9
> > event_libevent2021_release_version D/A-9 INT n/a UNKNOWN 0
> > shmem U/D-2 CHAR n/a ALL
> > shmem_base_verbose D/D-8 INT n/a LOCAL 0
> > shmem_base_RUNTIME_QUERY_hint D/A-9 CHAR n/a ALL-EQ
> > shmem_mmap_priority U/A-3 INT n/a ALL 50
> > shmem_mmap_enable_nfs_warning D/A-9 INT n/a LOCAL true
> > shmem_mmap_relocate_backing_file D/A-9 INT n/a ALL 0
> > shmem_mmap_backing_file_base_dir D/A-9 CHAR n/a ALL /dev/shm
> > shmem_mmap_major_version D/A-9 INT n/a UNKNOWN 1
> > shmem_mmap_minor_version D/A-9 INT n/a UNKNOWN 9
> > shmem_mmap_release_version D/A-9 INT n/a UNKNOWN 0
> > shmem_posix_major_version D/A-9 INT n/a UNKNOWN 1201644720
> > shmem_posix_minor_version D/A-9 INT n/a UNKNOWN 32756
> > shmem_posix_release_version D/A-9 INT n/a UNKNOWN 6
> > [ppc:12688] *** Process received signal ***
> > [ppc:12688] Signal: Segmentation fault (11)
> > [ppc:12688] Signal code: Invalid permissions (2)
> > [ppc:12688] Failing at address: 0x7ff4479f83d8
> > [ppc:12688] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x325c0)[0x7ff4493015c0]
> > [ppc:12688] [ 1] /home/rivis/opt/openmpi-trunk-debug/lib/libmpi.so.0(PMPI_T_cvar_read+0xbc)[0x7ff44970abb7]
> > [ppc:12688] [ 2] ./varlist(list_cvars+0x56a)[0x4029bc]
> > [ppc:12688] [ 3] ./varlist(main+0x42b)[0x403598]
> > [ppc:12688] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7ff4492edeed]
> > [ppc:12688] [ 5] ./varlist[0x4016c9]
> > [ppc:12688] *** End of error message ***
> > ----------------------------------------------------------------
> >
> > I tracked this error and found that this seems related to DSO.
> >
> > The error occurs when accessing value->intval for the
> > control variable shmem_sysv_major_version in MPI_T_cvar_read.
> >
> > https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mpi/tool/cvar_read.c
> >
> > The 'value' was gotten by mca_base_var_get_value and it points
> > mca_shmem_sysv_component.super.base_version.mca_component_major_version,
> > which was dlclose'd in MPI_INIT for DSO.
> > (component mmap is selected on my environment)
> >
> > Abnormal shmem_posix_{major,minor,relase}_version values in
> > my output above are the same reason. SEGV occurs if the memory
> > was returned to kernel, and abnormal values are printed
> > if not yet.
> >
> > So this SEGV doesn't occur if I configure Open MPI with
> > --disable-dlopen option. I think it's the reason why Nathan
> > doesn't see this error.