Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] core from today
From: Marcin Skoczylas (Marcin.Skoczylas_at_[hidden])
Date: 2007-11-14 06:23:52


Terry,

Linux Slackware, quite outdated kernel 2.6.15, gcc 3.4.6,
Yes, that was slightly oversubscribed - just to perform sanity tests
locally. Obviously, I changed that part and I send data in bigger chunks
(I think that should work faster anyway), also on not oversubscribed
cluster machine this does not appear. Just was curious about the core.
Thanks!

greets, Marcin

Terry Dontje wrote:
> Marcin,
>
> A couple questions:
>
> What OS are you running on?
> Did you run this job oversubscribed, that is more processes than there
> are cpus?
>
> I've found with oversubscribed jobs that the recursive calls to
> opal_progress by the SM BTL that
> the yield within opal_progress (intending to give up the cpu to others)
> doesn't always work for all
> OSes.
>
> --td
>
>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 13 Nov 2007 12:26:43 +0100
>> From: Marcin Skoczylas <Marcin.Skoczylas_at_[hidden]>
>> Subject: [OMPI users] core from today
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <473989F3.8070808_at_[hidden]>
>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>>
>> OpenMPI 1.2.4
>>
>> mpirun noticed that job rank 0 with PID 19021 on node pc801 exited on
>> signal 15 (Terminated).
>> 11 additional processes aborted (not shown)
>>
>> (gdb) bt
>> #0 0x411b776c in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #1 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #2 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #3 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #4 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #5 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #6 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #7 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #8 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #9 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #10 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #11 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #12 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #13 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #14 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #15 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #16 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #17 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #18 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #19 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #20 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #21 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #22 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #23 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #24 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #25 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #26 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #27 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #28 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #29 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #30 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #31 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #32 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #33 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #34 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #35 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #36 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #37 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #38 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #39 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #40 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #41 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #42 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #43 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #44 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> (...)
>> #19661 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #19662 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #19663 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #19664 0x411b87cb in mca_pml_ob1_recv_frag_match () from
>> /usr/local/openmpi//lib/openmpi/mca_pml_ob1.so
>> #19665 0x411ce010 in mca_btl_sm_component_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_btl_sm.so
>> #19666 0x411c2df9 in mca_bml_r2_progress () from
>> /usr/local/openmpi//lib/openmpi/mca_bml_r2.so
>> #19667 0x404fb549 in opal_progress () from
>> /usr/local/openmpi/lib/libopen-pal.so.0
>> #19668 0x400d9bb5 in ompi_request_wait_all () from
>> /usr/local/openmpi/lib/libmpi.so.0
>> #19669 0x411f57a3 in ompi_coll_tuned_bcast_intra_generic () from
>> /usr/local/openmpi//lib/openmpi/mca_coll_tuned.so
>> #19670 0x411f5e55 in ompi_coll_tuned_bcast_intra_binomial () from
>> /usr/local/openmpi//lib/openmpi/mca_coll_tuned.so
>> #19671 0x411efb3f in ompi_coll_tuned_bcast_intra_dec_fixed () from
>> /usr/local/openmpi//lib/openmpi/mca_coll_tuned.so
>> #19672 0x400ee239 in PMPI_Bcast () from /usr/local/openmpi/lib/libmpi.so.0
>> #19673 0x081009a3 in CProcessing::postProcessWorker (this=0x843a3c8) at
>> CProcessing.cpp:403
>> #19674 0x081042ee in CInputSetMap::postProcessWorker (this=0x843a260) at
>> CInputSetMap.cpp:554
>> #19675 0x0812f0f5 in CInputSetMap::processWorker (this=0x843a3f8) at
>> CInputSetMap.cpp:580
>> #19676 0x080b0945 in CLS_WorkerStart () at CLS_WorkerStartup.cpp:11
>> #19677 0x080ac2e9 in CLS_Worker () at CLS_Worker.cpp:44
>> #19678 0x0813706f in main (argc=1, argv=0xbfae84d4) at SYS_Main.cpp:201
>>
>> Seems like recursive endless loop for me...
>> Unfortunately I have to spread one double per one MPI_Bcast (not whole
>> vector for example), as the behavior later needs such approach (don't
>> ask why). I commented out everything that can be dangerous, in fact I'm
>> just spreading data now and this is enough to crash... it appears only
>> on a big input set, whole code works perfecly on smaller datasets.
>>
>> code:
>>
>> HEAD:
>> for(i=0; i < numAlphaSets; i++)
>> {
>> CAlphaSet *alphaSet = *alphaSetIterator;
>> for(cols=0; cols < numCols; cols++)
>> {
>> double alpha =alphaSet->alpha[cols-1];
>> MPI_Bcast(&alpha, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
>> }
>> *alphaSetIterator++;
>> }
>>
>> WORKER:
>> double alpha;
>> for(i=0; i < numAlphaSets; i++)
>> {
>> for(cols=0; cols < numCols; cols++)
>> {
>> MPI_Bcast(&alpha, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
>> // do something with alpha, commented out for debug
>> }
>> }
>>
>> I try to spread around 820,000 MPI_DOUBLEs that way. Obviously, I will
>> re-write this to spread data in bigger chunks and split them on workers,
>> but seems strange anyway... could be some buffer issues, or...?
>>
>> greets, Marcin
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 740, Issue 1
>> *************************************
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>