Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Question regarding recently common shared-memory component
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2010-09-20 18:24:06


Hi Ananda,

This issue should be resolved in r23781. Please let me know if it is
not.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 11:26 AM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden] 
 > wrote:
> I have used following options to build:
> ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran  
> FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/ 
> usr/local/pbs --with-openib --with-threads=posix --enable-mpi-thread- 
> multiple --enable-ft-thread --enable-debug --with-ft=cr --with-blcr=/ 
> usr/blcr --with-blcr-libdir=/usr/blcr/lib
>
> Alsop please note that this is with r23756 build.
>
> Let me know if you need any other information.
>
> Thanks
> Ananda
> Let me take a look at it. How did you configure your build?
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 10:14 AM, <ananda.mudar_at_[hidden]>  
> <ananda.mudar_at_[hidden]
>  > wrote:
> > Hi
> >
> > I believe the new common shared memory component was committed to
> > the trunk sometime towards the later part of August. I had not tried
> > this trunk version until last week and I have seen some discrepancy
> > with this component specifically related to checkpoint
> > functionality. I am not able to checkpoint any program with the
> > latest trunk version. Am I missing something here? Should I be using
> > any other options to enable checkpoint functionality for shared
> > memory component?
> >
> > However if I disable shared memory component and use only self, tcp,
> > and openib (--mca btl self,tcp,openib), I can checkpoint
> > successfully!!
> >
> > Following are the options I have used with mpirun:
> >
> > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > sstore_stage_global_is_shared 1 --mca
> > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI
> > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../ 
> hellompi
> >
> > Please note that hellompi is a very simple program without any
> > collective calls. When I issue checkpoint, this program fails with
> > the following messages:
> >
> > hplcnlj158:13937] Signal: Segmentation fault (11)
> > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > [hplcnlj158:13937] Failing at address: 0x2aaa00000001
> > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c3c11b]
> > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018b01a0d]
> > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> > [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c0f6e7]
> > [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> > [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d)  
> [0x2b4019ce5f7d]
> > [hplcnlj158:13937] *** End of error message ***
> > [hplcnlj161:00637] *** Process received signal ***
> > [hplcnlj161:00637] Signal: Segmentation fault (11)
> > [hplcnlj161:00637] Signal code: Address not mapped (1)
> > [hplcnlj161:00637] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00649] *** Process received signal ***
> > [hplcnlj161:00649] Signal: Segmentation fault (11)
> > [hplcnlj161:00649] Signal code: Address not mapped (1)
> > [hplcnlj161:00649] Failing at address: 0x2aaa00000001
> > /users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012
> > Segmentation fault      /usr/blcr/bin/cr_restart --no-restore-pid  
> "$@"
> > [hplcnlj161:00643] *** Process received signal ***
> > [hplcnlj161:00643] Signal: Segmentation fault (11)
> > [hplcnlj161:00643] Signal code: Address not mapped (1)
> > [hplcnlj161:00643] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00640] *** Process received signal ***
> > [hplcnlj161:00640] Signal: Segmentation fault (11)
> > [hplcnlj161:00640] Signal code: Address not mapped (1)
> > [hplcnlj161:00640] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00636] *** Process received signal ***
> > [hplcnlj161:00652] *** Process received signal ***
> > [hplcnlj161:00652] Signal: Segmentation fault (11)
> > [hplcnlj161:00652] Signal code: Address not mapped (1)
> > [hplcnlj161:00652] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00636] Signal: Segmentation fault (11)
> > [hplcnlj161:00636] Signal code: Address not mapped (1)
> > [hplcnlj161:00636] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
> > [hplcnlj161:00637] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c669f11b]
> > [hplcnlj161:00637] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b86c669e70b]
> > [hplcnlj161:00637] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b86c65c50fe]
> > [hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c6564a0d]
> > [hplcnlj161:00637] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
> > [hplcnlj161:00637] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b86c6671fab]
> > [hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj161:00637] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b86c6671cd3]
> > [hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c66726e7]
> > [hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
> > [hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d)  
> [0x2b86c7748f7d]
> > [hplcnlj161:00637] *** End of error message ***
> > [hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
> > [hplcnlj161:00649] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf985611b]
> > [hplcnlj161:00649] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b7bf985570b]
> > [hplcnlj161:00649] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b7bf977c0fe]
> > [hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf971ba0d]
> > [hplcnlj161:00649] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
> > [hplcnlj161:00649] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b7bf9828fab]
> > [hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj161:00649] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b7bf9828cd3]
> > [hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf98296e7]
> > [hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
> > [hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d)  
> [0x2b7bfa8fff7d]
> > [hplcnlj161:00649] *** End of error message ***
> >
> >
> > Thanks
> > Ananda
> >
> > Ananda B Mudar, PMP
> > Senior Technical Architect
> > Wipro Technologies
> > Ph: 972 765 8093
> > ananda.mudar_at_[hidden]
> >
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
> > attachments to this message are intended for the exclusive use of
> > the addressee(s) and may contain proprietary, confidential or
> > privileged information. If you are not the intended recipient, you
> > should not disseminate, distribute or copy this e-mail. Please
> > notify the sender immediately and destroy all copies of this message
> > and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
> > recipient should check this email and any attachments for the
> > presence of viruses. The company accepts no liability for any damage
> > caused by any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any  
> attachments to this message are intended for the exclusive use of  
> the addressee(s) and may contain proprietary, confidential or  
> privileged information. If you are not the intended recipient, you  
> should not disseminate, distribute or copy this e-mail. Please  
> notify the sender immediately and destroy all copies of this message  
> and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The  
> recipient should check this email and any attachments for the  
> presence of viruses. The company accepts no liability for any damage  
> caused by any virus transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel