Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Question regarding recently common shared-memory component
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2010-09-20 12:31:33


Let me take a look at it. How did you configure your build?

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 10:14 AM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden] 
 > wrote:
> Hi
>
> I believe the new common shared memory component was committed to  
> the trunk sometime towards the later part of August. I had not tried  
> this trunk version until last week and I have seen some discrepancy  
> with this component specifically related to checkpoint  
> functionality. I am not able to checkpoint any program with the  
> latest trunk version. Am I missing something here? Should I be using  
> any other options to enable checkpoint functionality for shared  
> memory component?
>
> However if I disable shared memory component and use only self, tcp,  
> and openib (--mca btl self,tcp,openib), I can checkpoint  
> successfully!!
>
> Following are the options I have used with mpirun:
>
> mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca  
> sstore_stage_global_is_shared 1 --mca  
> sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI  
> --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../hellompi
>
> Please note that hellompi is a very simple program without any  
> collective calls. When I issue checkpoint, this program fails with  
> the following messages:
>
> hplcnlj158:13937] Signal: Segmentation fault (11)
> [hplcnlj158:13937] Signal code: Address not mapped (1)
> [hplcnlj158:13937] Failing at address: 0x2aaa00000001
> [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/ 
> libmca_common_sm.so.0(mca_common_sm_param_register+0x262)  
> [0x2aaaad96628a]
> [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_btl_sm.so [0x2aaaaf0a55e8]
> [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b4018c3c11b]
> [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_bml_r2.so [0x2aaaadd9e4fb]
> [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_ob1.so [0x2aaaae5fa429]
> [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_crcpw.so [0x2aaaadfadce6]
> [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b4018b01a0d]
> [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_snapc_full.so [0x2aaaabd280fc]
> [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b4018c0f6e7]
> [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d) [0x2b4019ce5f7d]
> [hplcnlj158:13937] *** End of error message ***
> [hplcnlj161:00637] *** Process received signal ***
> [hplcnlj161:00637] Signal: Segmentation fault (11)
> [hplcnlj161:00637] Signal code: Address not mapped (1)
> [hplcnlj161:00637] Failing at address: 0x2aaa00000001
> [hplcnlj161:00649] *** Process received signal ***
> [hplcnlj161:00649] Signal: Segmentation fault (11)
> [hplcnlj161:00649] Signal code: Address not mapped (1)
> [hplcnlj161:00649] Failing at address: 0x2aaa00000001
> /users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012  
> Segmentation fault      /usr/blcr/bin/cr_restart --no-restore-pid "$@"
> [hplcnlj161:00643] *** Process received signal ***
> [hplcnlj161:00643] Signal: Segmentation fault (11)
> [hplcnlj161:00643] Signal code: Address not mapped (1)
> [hplcnlj161:00643] Failing at address: 0x2aaa00000001
> [hplcnlj161:00640] *** Process received signal ***
> [hplcnlj161:00640] Signal: Segmentation fault (11)
> [hplcnlj161:00640] Signal code: Address not mapped (1)
> [hplcnlj161:00640] Failing at address: 0x2aaa00000001
> [hplcnlj161:00636] *** Process received signal ***
> [hplcnlj161:00652] *** Process received signal ***
> [hplcnlj161:00652] Signal: Segmentation fault (11)
> [hplcnlj161:00652] Signal code: Address not mapped (1)
> [hplcnlj161:00652] Failing at address: 0x2aaa00000001
> [hplcnlj161:00636] Signal: Segmentation fault (11)
> [hplcnlj161:00636] Signal code: Address not mapped (1)
> [hplcnlj161:00636] Failing at address: 0x2aaa00000001
> [hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
> [hplcnlj161:00637] [ 1] /users/amudar/openmpi-1.7/lib/ 
> libmca_common_sm.so.0(mca_common_sm_param_register+0x262)  
> [0x2aaaad96628a]
> [hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_btl_sm.so [0x2aaaaf0a55e8]
> [hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b86c669f11b]
> [hplcnlj161:00637] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_base_components_open+0x3ef) [0x2b86c669e70b]
> [hplcnlj161:00637] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_btl_base_open+0xfd) [0x2b86c65c50fe]
> [hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_bml_r2.so [0x2aaaadd9e4fb]
> [hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_ob1.so [0x2aaaae5fa429]
> [hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_crcpw.so [0x2aaaadfadce6]
> [hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b86c6564a0d]
> [hplcnlj161:00637] [10] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
> [hplcnlj161:00637] [11] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_inc_core_recover+0xed) [0x2b86c6671fab]
> [hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_snapc_full.so [0x2aaaabd280fc]
> [hplcnlj161:00637] [13] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b86c6671cd3]
> [hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b86c66726e7]
> [hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
> [hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d) [0x2b86c7748f7d]
> [hplcnlj161:00637] *** End of error message ***
> [hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
> [hplcnlj161:00649] [ 1] /users/amudar/openmpi-1.7/lib/ 
> libmca_common_sm.so.0(mca_common_sm_param_register+0x262)  
> [0x2aaaad96628a]
> [hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_btl_sm.so [0x2aaaaf0a55e8]
> [hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b7bf985611b]
> [hplcnlj161:00649] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_base_components_open+0x3ef) [0x2b7bf985570b]
> [hplcnlj161:00649] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(mca_btl_base_open+0xfd) [0x2b7bf977c0fe]
> [hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_bml_r2.so [0x2aaaadd9e4fb]
> [hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_ob1.so [0x2aaaae5fa429]
> [hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_pml_crcpw.so [0x2aaaadfadce6]
> [hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b7bf971ba0d]
> [hplcnlj161:00649] [10] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
> [hplcnlj161:00649] [11] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_inc_core_recover+0xed) [0x2b7bf9828fab]
> [hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/ 
> mca_snapc_full.so [0x2aaaabd280fc]
> [hplcnlj161:00649] [13] /users/amudar/openmpi-1.7/lib/libmpi.so. 
> 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b7bf9828cd3]
> [hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0  
> [0x2b7bf98296e7]
> [hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
> [hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d) [0x2b7bfa8fff7d]
> [hplcnlj161:00649] *** End of error message ***
>
>
> Thanks
> Ananda
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mudar_at_[hidden]
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any  
> attachments to this message are intended for the exclusive use of  
> the addressee(s) and may contain proprietary, confidential or  
> privileged information. If you are not the intended recipient, you  
> should not disseminate, distribute or copy this e-mail. Please  
> notify the sender immediately and destroy all copies of this message  
> and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The  
> recipient should check this email and any attachments for the  
> presence of viruses. The company accepts no liability for any damage  
> caused by any virus transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel