Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Question regarding recently common shared-memory component
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2010-09-21 11:44:11


Hi,

Just to be clear - do you see similar checkpoint performance
differences in 1.5rc6 and 1.4.2 with and without shared memory enabled?

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 21, 2010, at 9:35 AM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden] 
 > wrote:
> Hello Samuel
> This problem seems to be resolved after I moved to r23781. However,  
> I see another discrepancy in checkpoint image creation time when I  
> disable shared memory (--mca btl self,tcp,openib) vs using it. I  
> mean the time to create checkpoint image for this simple program is  
> about 0.4 seconds if I disable shared memory while it is close to  
> 6.5 seconds when I use shared memory component. I have not seen this  
> behavior earlier. Do I have to tune any other parameter to reduce  
> the time?
> Thanks
> Ananda
> Hi Ananda,
>
> This issue should be resolved in r23781. Please let me know if it is
> not.
>
> Thanks!
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 11:26 AM, <ananda.mudar_at_[hidden]>  
> <ananda.mudar_at_[hidden]
>  > wrote:
> > I have used following options to build:
> > ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran
> > FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/
> > usr/local/pbs --with-openib --with-threads=posix --enable-mpi- 
> thread-
> > multiple --enable-ft-thread --enable-debug --with-ft=cr --with- 
> blcr=/
> > usr/blcr --with-blcr-libdir=/usr/blcr/lib
> >
> > Alsop please note that this is with r23756 build.
> >
> > Let me know if you need any other information.
> >
> > Thanks
> > Ananda
> > Let me take a look at it. How did you configure your build?
> > Thanks,
> >
> > --
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> > On Sep 20, 2010, at 10:14 AM, <ananda.mudar_at_[hidden]>
> > <ananda.mudar_at_[hidden]
> >  > wrote:
> > > Hi
> > >
> > > I believe the new common shared memory component was committed to
> > > the trunk sometime towards the later part of August. I had not  
> tried
> > > this trunk version until last week and I have seen some  
> discrepancy
> > > with this component specifically related to checkpoint
> > > functionality. I am not able to checkpoint any program with the
> > > latest trunk version. Am I missing something here? Should I be  
> using
> > > any other options to enable checkpoint functionality for shared
> > > memory component?
> > >
> > > However if I disable shared memory component and use only self,  
> tcp,
> > > and openib (--mca btl self,tcp,openib), I can checkpoint
> > > successfully!!
> > >
> > > Following are the options I have used with mpirun:
> > >
> > > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > > sstore_stage_global_is_shared 1 --mca
> > > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/ 
> FWI
> > > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../
> > hellompi
> > >
> > > Please note that hellompi is a very simple program without any
> > > collective calls. When I issue checkpoint, this program fails with
> > > the following messages:
> > >
> > > hplcnlj158:13937] Signal: Segmentation fault (11)
> > > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > > [hplcnlj158:13937] Failing at address: 0x2aaa00000001
> > > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > > [0x2aaaad96628a]
> > > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_btl_sm.so [0x2aaaaf0a55e8]
> > > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b4018c3c11b]
> > > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_bml_r2.so [0x2aaaadd9e4fb]
> > > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_ob1.so [0x2aaaae5fa429]
> > > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_crcpw.so [0x2aaaadfadce6]
> > > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b4018b01a0d]
> > > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_snapc_full.so [0x2aaaabd280fc]
> > > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> > > [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b4018c0f6e7]
> > > [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> > > [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d)
> > [0x2b4019ce5f7d]
> > > [hplcnlj158:13937] *** End of error message ***
> > > [hplcnlj161:00637] *** Process received signal ***
> > > [hplcnlj161:00637] Signal: Segmentation fault (11)
> > > [hplcnlj161:00637] Signal code: Address not mapped (1)
> > > [hplcnlj161:00637] Failing at address: 0x2aaa00000001
> > > [hplcnlj161:00649] *** Process received signal ***
> > > [hplcnlj161:00649] Signal: Segmentation fault (11)
> > > [hplcnlj161:00649] Signal code: Address not mapped (1)
> > > [hplcnlj161:00649] Failing at address: 0x2aaa00000001
> > > /users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012
> > > Segmentation fault      /usr/blcr/bin/cr_restart --no-restore-pid
> > "$@"
> > > [hplcnlj161:00643] *** Process received signal ***
> > > [hplcnlj161:00643] Signal: Segmentation fault (11)
> > > [hplcnlj161:00643] Signal code: Address not mapped (1)
> > > [hplcnlj161:00643] Failing at address: 0x2aaa00000001
> > > [hplcnlj161:00640] *** Process received signal ***
> > > [hplcnlj161:00640] Signal: Segmentation fault (11)
> > > [hplcnlj161:00640] Signal code: Address not mapped (1)
> > > [hplcnlj161:00640] Failing at address: 0x2aaa00000001
> > > [hplcnlj161:00636] *** Process received signal ***
> > > [hplcnlj161:00652] *** Process received signal ***
> > > [hplcnlj161:00652] Signal: Segmentation fault (11)
> > > [hplcnlj161:00652] Signal code: Address not mapped (1)
> > > [hplcnlj161:00652] Failing at address: 0x2aaa00000001
> > > [hplcnlj161:00636] Signal: Segmentation fault (11)
> > > [hplcnlj161:00636] Signal code: Address not mapped (1)
> > > [hplcnlj161:00636] Failing at address: 0x2aaa00000001
> > > [hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
> > > [hplcnlj161:00637] [ 1] /users/amudar/openmpi-1.7/lib/
> > > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > > [0x2aaaad96628a]
> > > [hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_btl_sm.so [0x2aaaaf0a55e8]
> > > [hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b86c669f11b]
> > > [hplcnlj161:00637] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_base_components_open+0x3ef) [0x2b86c669e70b]
> > > [hplcnlj161:00637] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_btl_base_open+0xfd) [0x2b86c65c50fe]
> > > [hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_bml_r2.so [0x2aaaadd9e4fb]
> > > [hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_ob1.so [0x2aaaae5fa429]
> > > [hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_crcpw.so [0x2aaaadfadce6]
> > > [hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b86c6564a0d]
> > > [hplcnlj161:00637] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
> > > [hplcnlj161:00637] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_inc_core_recover+0xed) [0x2b86c6671fab]
> > > [hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_snapc_full.so [0x2aaaabd280fc]
> > > [hplcnlj161:00637] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b86c6671cd3]
> > > [hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b86c66726e7]
> > > [hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
> > > [hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d)
> > [0x2b86c7748f7d]
> > > [hplcnlj161:00637] *** End of error message ***
> > > [hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
> > > [hplcnlj161:00649] [ 1] /users/amudar/openmpi-1.7/lib/
> > > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > > [0x2aaaad96628a]
> > > [hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_btl_sm.so [0x2aaaaf0a55e8]
> > > [hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b7bf985611b]
> > > [hplcnlj161:00649] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_base_components_open+0x3ef) [0x2b7bf985570b]
> > > [hplcnlj161:00649] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_btl_base_open+0xfd) [0x2b7bf977c0fe]
> > > [hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_bml_r2.so [0x2aaaadd9e4fb]
> > > [hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_ob1.so [0x2aaaae5fa429]
> > > [hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_crcpw.so [0x2aaaadfadce6]
> > > [hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b7bf971ba0d]
> > > [hplcnlj161:00649] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
> > > [hplcnlj161:00649] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_inc_core_recover+0xed) [0x2b7bf9828fab]
> > > [hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_snapc_full.so [0x2aaaabd280fc]
> > > [hplcnlj161:00649] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b7bf9828cd3]
> > > [hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b7bf98296e7]
> > > [hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
> > > [hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d)
> > [0x2b7bfa8fff7d]
> > > [hplcnlj161:00649] *** End of error message ***
> > >
> > >
> > > Thanks
> > > Ananda
> > >
> > > Ananda B Mudar, PMP
> > > Senior Technical Architect
> > > Wipro Technologies
> > > Ph: 972 765 8093              972 765 8093
> > > ananda.mudar_at_[hidden]
> > >
> > > Please do not print this email unless it is absolutely necessary.
> > >
> > > The information contained in this electronic message and any
> > > attachments to this message are intended for the exclusive use of
> > > the addressee(s) and may contain proprietary, confidential or
> > > privileged information. If you are not the intended recipient, you
> > > should not disseminate, distribute or copy this e-mail. Please
> > > notify the sender immediately and destroy all copies of this  
> message
> > > and any attachments.
> > >
> > > WARNING: Computer viruses can be transmitted via email. The
> > > recipient should check this email and any attachments for the
> > > presence of viruses. The company accepts no liability for any  
> damage
> > > caused by any virus transmitted by this email.
> > >
> > > www.wipro.com
> > >
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
> > attachments to this message are intended for the exclusive use of
> > the addressee(s) and may contain proprietary, confidential or
> > privileged information. If you are not the intended recipient, you
> > should not disseminate, distribute or copy this e-mail. Please
> > notify the sender immediately and destroy all copies of this message
> > and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
> > recipient should check this email and any attachments for the
> > presence of viruses. The company accepts no liability for any damage
> > caused by any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mudar_at_[hidden]
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any  
> attachments to this message are intended for the exclusive use of  
> the addressee(s) and may contain proprietary, confidential or  
> privileged information. If you are not the intended recipient, you  
> should not disseminate, distribute or copy this e-mail. Please  
> notify the sender immediately and destroy all copies of this message  
> and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The  
> recipient should check this email and any attachments for the  
> presence of viruses. The company accepts no liability for any damage  
> caused by any virus transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel