Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing hangs with OpenMPI-1.3.1
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-04-28 08:27:30


On Apr 28, 2009, at 7:27 AM, neeraj_at_[hidden] wrote:

> Hi Josh,
>
> Thanks for your reply. Actually the reason for hang was
> missing
> blcr library in LD_LIBRARY_PATH.
>
> After setting it right, checkpoint was working but as you
> mentioned before, datatype error is coming along with, and hence
> restart
> is not working.

Thanks for letting me know. I was worried that I was missing
something since I could not reproduce the hang.

>
> a) The errors coming is
>
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> [n0:12674] *** An error occurred in MPI_Barrier
>
> [n0:12674] *** on communicator MPI_COMM_WORLD
>
> [n0:12674] *** MPI_ERR_BUFFER: invalid buffer pointer
>
> [n0:12674] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>
> [n0:12674] crcp:bkmrk: drain_message_copy_remove(): Datatype copy
> failed
> (1)
>
> [n0:12674] crcp:bkmrk: drain_check_recv(): Datatype copy failed (1)
>
> [n0:12674] crcp:bkmrk: pml_recv(): Failed trying to find a drained
> message. ---------- This should never happen ----------
> (crcp_bkmrk_pml.c:2411)
>
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxx
>
>

Yeah this is the error that I fixed yesterday. The next release of
Open MPI will have this fix in it. If you need it before then you can
download the current trunk or (soon) the v1.3 branch from SVN. The
following ticket will let you know when the patch is committed to the
v1.3 branch:
   https://svn.open-mpi.org/trac/ompi/ticket/1899

>
>
>
> b) The other error i am getting while checkpointing is
> segmentation
> fault. This is independent of the previous error and the scenario
> is bit
> different.
> I have two nodes, one with infiniband and other with TCP. I am
> running simple mpirun with no option of selecting/deselecting btl's
> leaving openmpi to decide at runtime.
>
> The error i am getting is as follows:
>
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxxxx
> [n5:29211] *** Process received signal ***
>
> [n5:29211] Signal: Segmentation fault (11)
>
> [n5:29211] Signal code: Address not mapped (1)
>
> [n5:29211] Failing at address: (nil)
>
> [n5:29211] [ 0] /lib64/tls/libpthread.so.0 [0x399e80c4f0]
>
> [n5:29211] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
> [0x3496108fb0]
>
> [n5:29211] [ 2]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_btl_openib.so
> [0x2a979369ed]
>
> [n5:29211] [ 3]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_btl_openib.so
> [0x2a979376b5]
>
> [n5:29211] [ 4]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_bml_r2.so
> [0x2a97186a33]
>
> [n5:29211] [ 5]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_pml_ob1.so
> [0x2a96f68e5d]
>
> [n5:29211] [ 6]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_pml_crcpw.so
> [0x2a96e638d5]
>
> [n5:29211] [ 7]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/
> libmpi.so.0(ompi_cr_coord+0x127)
> [0x2a95591127]
>
> [n5:29211] [ 8]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/libopen-
> pal.so.0(opal_cr_inc_core+0x33)
> [0x2a95858403]
>
> [n5:29211] [ 9]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_snapc_full.so
> [0x2a965432b1]
>
> [n5:29211] [10]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/libopen-
> pal.so.0(opal_cr_test_if_checkpoint_ready+0x52)
> [0x2a95857662]
>
> [n5:29211] [11]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/libmpi.so.0
> [0x2a9558e13b]
>
> [n5:29211] [12]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_coll_tuned.so
> [0x2a98421c12]
>
> [n5:29211] [13]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/openmpi/
> mca_coll_tuned.so
> [0x2a9842a00e]
>
> [n5:29211] [14]
> /home/syssoft/alap/fts/openmpi-1.3.1/ft_thread_install/lib/
> libmpi.so.0(PMPI_Barrier+0x140)
> [0x2a955a4af0]
>
> [n5:29211] [15] ./a.out(main+0x64) [0x4009bc]
>
> [n5:29211] [16] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x399e11c3fb]
>
> [n5:29211] [17] ./a.out [0x4008ca]
>
> [n5:29211] *** End of error message ***
>
> ----------------------------------------------------------------------
> ----
>
> mpirun noticed that process rank 1 with PID 29211 on node n5 exited on
> signal 11 (Segmentation fault).
>
> ----------------------------------------------------------------------
> ----
>
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> xxxxxxxxxxxxxxxxxxx
>
>
> Soon, this error goes away, if i force mpirun to use tcp for
> communication using mca parameters and then error a) starts coming
> which
> is related with some datatype handling during checkpoint.

I believe Pasha fixed the segfault in r20872. However that was after
the release of v1.3.1, but should be in v1.3.2. Can you try the
v1.3.2 release or the current SVN to confirm that this fixes the
segfault problem.

Best,
Josh

>
>
> Regards
>
> Neeraj Chourasia
> Member of Technical Staff
> Computational Research Laboratories Limited
> (A wholly Owned Subsidiary of TATA SONS Ltd)
> P: +91.9225520634
>
>
>
>
>
> Josh Hursey <jjhursey_at_[hidden]>
> Sent by: users-bounces_at_[hidden]
> 04/28/2009 12:34 AM
> Please respond to
> Open MPI Users <users_at_[hidden]>
>
>
> To
> Open MPI Users <users_at_[hidden]>
> cc
>
> Subject
> Re: [OMPI users] Checkpointing hangs with OpenMPI-1.3.1
>
>
>
>
>
>
> I still have not been able to reproduce the hang, but I'm still
> looking into it.
>
> I did commit a fix for the datatype copy error that I mentioned
> (r21080 in the Open MPI trunk, and it is in the pipeline for v1.3).
>
> Can you put in a print statement before MPI_Finalize, then try the
> program again? I am wondering if the problem is not with the
> MPI_Barrier, but MPI_Finalize. I wonder if one (or both) of the
> processes enter MPI_Finalize while a checkpoint is occurring.
> Unfortunately, I have not tested the MPI_Finalize scenario in a long
> time, but will put that on my todo list.
>
> Cheers,
> Josh
>
> On Apr 27, 2009, at 9:48 AM, Josh Hursey wrote:
>
>> Sorry for the long delay to respond.
>>
>> It is a bit odd that the hang does not occur when running on only
>> one host. I suspect that is more due to timing than anything else.
>>
>> I am not able to reproduce the hang at the moment, but I do get an
>> occasional datatype copy error which could be symptomatic of a
>> related problem. I'll dig into this a bit more this week and let you
>> know when I have a fix and if I can reproduce the hang.
>>
>> Thanks for the bug report.
>>
>> Cheers,
>> Josh
>>
>> On Apr 10, 2009, at 4:56 AM, neeraj_at_[hidden] wrote:
>>
>>>
>>> Dear All,
>>>
>>> I am trying to checkpoint a test application using openmpi-1.3.1,
>>> but fails to do so, when run multiple process on different nodes.
>>>
>>> Checkpointing runs fine, if process is running on the same node
>>> along with mpirun process. But the moment i launch MPI process from
>>> different node, it hangs.
>>>
>>> ex.
>>> mpirun -np 2 ./test (will checkpoint fine using ompi-checkpoint -
>>> v <mpirun_pid> )
>>> but
>>> mpirun -np 2 -H host1 ./test (Checkpointing will hang)
>>>
>>> Similarly
>>> mpirun -np 2 -H localhost,host1 ./test would still hangs while
>>> checkpointing.
>>>
>>> Please find the output which comes while checkpointing
>>>
>>> --------------
>>> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx----------------------------
>>> [n0:01596] orte_checkpoint: Checkpointing...
>>> [n0:01596] PID 1514
>>> [n0:01596] Connected to Mpirun [[11946,0],0]
>>> [n0:01596] orte_checkpoint: notify_hnp: Contact Head Node Process
>>> PID 1514
>>> [n0:01596] orte_checkpoint: notify_hnp: Requested a checkpoint of
>>> jobid [INVALID]
>>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>>> [n0:01596] Requested - Global Snapshot Reference:
>>> (null)
>>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>>> [n0:01596] Pending - Global Snapshot Reference:
>>> (null)
>>> [n0:01596] orte_checkpoint: hnp_receiver: Receive a command message.
>>> [n0:01596] orte_checkpoint: hnp_receiver: Status Update.
>>> [n0:01596] Running - Global Snapshot Reference:
>>> (null)
>>>
>>> Note: It hangs here
>>>
>>> ------------------------------
>>> *******************************---------------------
>>>
>>> Command used to launch program is
>>>
>>> /usr/local/openmpi-1.3.1/install/bin/mpirun -np 2 -H n5 -am ft-
>>> enable-cr --mca btl tcp,self a.out
>>>
>>> And the dummy program is pretty simple as follows
>>>
>>> #include<time.h>
>>> #include<stdio.h>
>>> #include<mpi.h>
>>>
>>>
>>> #define LIMIT 10000000
>>>
>>> main(int argc,char ** argv)
>>> {
>>> int i;
>>>
>>> int my_rank; /* Rank of process */
>>> int np; /* Number of process */
>>>
>>>
>>> MPI_Init(&argc,&argv);
>>> MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &np);
>>>
>>>
>>> for(i=0; i<=LIMIT; i++)
>>> {
>>> printf("n HELLO %d",i);
>>> //sleep(10);
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> }
>>> MPI_Finalize();
>>> }
>>>
>>>
>>>
>>> Let me know, what could be the error. I feel there is the error in
>>> MPI process coordination.
>>>
>>> Regards
>>>
>>>
>>> Neeraj Chourasia
>>> Member of Technical Staff
>>> Computational Research Laboratories Limited
>>> (A wholly Owned Subsidiary of TATA SONS Ltd)
>>> P: +91.9890003757
>>>
>>> =====-----=====-----===== Notice: The information contained in this
>>> e-mail message and/or attachments to it may contain confidential or
>>> privileged information. If you are not the intended recipient, any
>>> dissemination, use, review, distribution, printing or copying of
>>> the information contained in this e-mail message and/or attachments
>>> to it are strictly prohibited. If you have received this
>>> communication in error, please notify us by reply e-mail or
>>> telephone and immediately and permanently delete the message and
>>> any attachments. Internet communications cannot be guaranteed to be
>>> timely, secure, error or virus-free. The sender does not accept
>>> liability for any errors or omissions.Thank you =====-----=====-----
>>> =====
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> =====-----=====-----=====
>
>
>
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments.
>
> Internet communications cannot be guaranteed to be timely,
> secure, error or virus-free. The sender does not accept liability
> for any errors or omissions.Thank you
>
> =====-----=====-----=====
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users