Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_T SEGV on DSO
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-07-30 11:45:20


Yup, just noticed that. All component variables should be registered
with mca_base_component_var_register but the versions were registered
with the generic register function. The code in question is the oldest
part of the MCA rewrite so it probably was missed when the component
variable register function was added. Fixing now.

-Nathan

On Thu, Jul 31, 2014 at 12:40:55AM +0900, KAWASHIMA Takahiro wrote:
> Nathan,
>
> The diffrences seems to be the flags on registering.
>
> Normal MCA variables shmem_sysv_priority etc. have flag
> MCA_BASE_VAR_FLAG_DWG so that they are deregistered through
> mca_base_var_group_deregister in mca_base_component_unload.
>
> But shmem_sysv_major_version doesn't have the flag.
>
> Regards,
> KAWASHIMA Takahiro
>
> > This is odd. The variable in question is registered by the MCA itself. I
> > will take a look and see if I can determine why it isn't being
> > deregistered correctly when the rest of the component's parameters are.
> >
> > -Nathan
> >
> > On Wed, Jul 30, 2014 at 08:17:15AM +0900, KAWASHIMA Takahiro wrote:
> > > Nathan,
> > >
> > > Thanks for your response.
> > >
> > > Yes. My previous mail was the result of uncommented code.
> > > Now I also pulled latest varList source code which uncommented
> > > the section you mentioned, but the result was same.
> > >
> > > If MPI_T_cvar_get_info should return MPI_T_ERR_INVALID_INDEX
> > > for variables for unloaded components, not returning
> > > MPI_T_ERR_INVALID_INDEX is the problem.
> > >
> > > I run varList on GDB and found that MPI_T_cvar_get_info returns
> > > MPI_T_ERR_INVALID_INDEX for shmem_sysv_priority (this is sane).
> > > But it returns MPI_SUCCESS for shmem_sysv_major_version.
> > > The difference is mbv_flags values. mbv_flags is 0x44 for
> > > shmem_sysv_priority on MPI_T_cvar_get_info call so that
> > > mca_base_var_get function in opal/mca/base/mca_base_var.c
> > > returns OPAL_ERR_NOT_FOUND. But mbv_flags is 0x10003 for
> > > shmem_sysv_major_version so that mca_base_var_get function
> > > returns OPAL_SUCCESS.
> > >
> > > Control variables for unloaded components are not deregistered
> > > completely?
> > >
> > > I can track it more when I have time.
> > >
> > > My environment:
> > > OS: Debian GNU/Linux wheezy
> > > CPU: x86_64
> > > Run: mpiexec -n 1 varList
> > > Open MPI source: trunk r32338 (almost latest)
> > > Open MPI configure:
> > > enable_picky=yes
> > > enable_debug=yes
> > > enable_mem_debug=yes
> > > enable_mem_profile=yes
> > > enable_memchecker=no
> > > enable_mca_no_build=btl-elan,btl-gm,btl-mx,btl-ofud,btl-portals,btl-sctp,btl-template,btl-udapl,common-mx,common-portals,ess-alps,ess-cnos,ess-lsf,ess-portals_utcp,ess-singleton,ess-slurm,grpcomm-cnos,mpool-fake,mtl,notifier,plm-alps,plm-ccp,plm-lsf,plm-process,plm-slurm,plm-submit,plm-tm,plm-xgrid,pml-cm,pml-csum,pml-example,pml-v,ras
> > > enable_contrib_no_build=vt
> > > enable_mpi_cxx=no
> > > enable_mpi_f77=no
> > > enable_mpi_f90=no
> > > enable_ipv6=no
> > > enable_mpi_io=no
> > > with_devel_headers=no
> > > with_wrapper_cflags=-g
> > > with_wrapper_cxxflags=-g
> > > with_wrapper_fflags=-g
> > > with_wrapper_fcflags=-g
> > >
> > > Regards,
> > > KAWASHIMA Takahiro
> > >
> > > > The problem is the code in question does not check the return code of
> > > > MPI_T_cvar_handle_alloc . We are returning an error and they still try
> > > > to use the handle (which is stale). Uncomment this section of the code:
> > > >
> > > >
> > > > //if (MPI_T_ERR_INVALID_INDEX == err)// { NOTE TZI: This variable is not recognized by Mvapich. It is OpenMPI specific.
> > > > // continue;
> > > >
> > > >
> > > > Note that MPI_T_ERR_INVALID_INDEX is in the MPI-3 standard but mvapich
> > > > must not have implemented it (and thus should not claim to be MPI 3.0).
> > > >
> > > > -Nathan
> > > >
> > > > On Wed, Jul 30, 2014 at 12:04:55AM +0900, KAWASHIMA Takahiro wrote:
> > > > > Hi,
> > > > >
> > > > > I encountered the same SEGV reported on the users list when
> > > > > running varList program.
> > > > >
> > > > > http://www.open-mpi.org/community/lists/users/2014/07/24792.php
> > > > >
> > > > > mpiexec -n 1 ./varList:
> > > > > ----------------------------------------------------------------
> > > > > ... snip ...
> > > > > event U/D-2 CHAR n/a ALL
> > > > > event_base_verbose D/D-8 INT n/a LOCAL 0
> > > > > event_libevent2021_event_include U/A-3 CHAR n/a LOCAL poll
> > > > > opal_event_include U/A-3 CHAR n/a LOCAL poll
> > > > > event_libevent2021_major_version D/A-9 INT n/a UNKNOWN 1
> > > > > event_libevent2021_minor_version D/A-9 INT n/a UNKNOWN 9
> > > > > event_libevent2021_release_version D/A-9 INT n/a UNKNOWN 0
> > > > > shmem U/D-2 CHAR n/a ALL
> > > > > shmem_base_verbose D/D-8 INT n/a LOCAL 0
> > > > > shmem_base_RUNTIME_QUERY_hint D/A-9 CHAR n/a ALL-EQ
> > > > > shmem_mmap_priority U/A-3 INT n/a ALL 50
> > > > > shmem_mmap_enable_nfs_warning D/A-9 INT n/a LOCAL true
> > > > > shmem_mmap_relocate_backing_file D/A-9 INT n/a ALL 0
> > > > > shmem_mmap_backing_file_base_dir D/A-9 CHAR n/a ALL /dev/shm
> > > > > shmem_mmap_major_version D/A-9 INT n/a UNKNOWN 1
> > > > > shmem_mmap_minor_version D/A-9 INT n/a UNKNOWN 9
> > > > > shmem_mmap_release_version D/A-9 INT n/a UNKNOWN 0
> > > > > shmem_posix_major_version D/A-9 INT n/a UNKNOWN 1201644720
> > > > > shmem_posix_minor_version D/A-9 INT n/a UNKNOWN 32756
> > > > > shmem_posix_release_version D/A-9 INT n/a UNKNOWN 6
> > > > > [ppc:12688] *** Process received signal ***
> > > > > [ppc:12688] Signal: Segmentation fault (11)
> > > > > [ppc:12688] Signal code: Invalid permissions (2)
> > > > > [ppc:12688] Failing at address: 0x7ff4479f83d8
> > > > > [ppc:12688] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x325c0)[0x7ff4493015c0]
> > > > > [ppc:12688] [ 1] /home/rivis/opt/openmpi-trunk-debug/lib/libmpi.so.0(PMPI_T_cvar_read+0xbc)[0x7ff44970abb7]
> > > > > [ppc:12688] [ 2] ./varlist(list_cvars+0x56a)[0x4029bc]
> > > > > [ppc:12688] [ 3] ./varlist(main+0x42b)[0x403598]
> > > > > [ppc:12688] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7ff4492edeed]
> > > > > [ppc:12688] [ 5] ./varlist[0x4016c9]
> > > > > [ppc:12688] *** End of error message ***
> > > > > ----------------------------------------------------------------
> > > > >
> > > > > I tracked this error and found that this seems related to DSO.
> > > > >
> > > > > The error occurs when accessing value->intval for the
> > > > > control variable shmem_sysv_major_version in MPI_T_cvar_read.
> > > > >
> > > > > https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mpi/tool/cvar_read.c
> > > > >
> > > > > The 'value' was gotten by mca_base_var_get_value and it points
> > > > > mca_shmem_sysv_component.super.base_version.mca_component_major_version,
> > > > > which was dlclose'd in MPI_INIT for DSO.
> > > > > (component mmap is selected on my environment)
> > > > >
> > > > > Abnormal shmem_posix_{major,minor,relase}_version values in
> > > > > my output above are the same reason. SEGV occurs if the memory
> > > > > was returned to kernel, and abnormal values are printed
> > > > > if not yet.
> > > > >
> > > > > So this SEGV doesn't occur if I configure Open MPI with
> > > > > --disable-dlopen option. I think it's the reason why Nathan
> > > > > doesn't see this error.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15361.php



  • application/pgp-signature attachment: stored