Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] orte-checkpoint hangs
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-02-25 15:04:15


On Feb 10, 2010, at 9:45 AM, Addepalli, Srirangam V wrote:

> I am trying to test orte-checkpoint with a MPI JOB. It how ever hangs for all jobs. This is how i submit the job is started
> mpirun -np 8 -mca ft-enable cr /apps/nwchem-5.1.1/bin/LINUX64/nwchem siosi6.nw

This might be the problem, if it wasn't a typo. The command line flag is "-am ft-enable-cr" not "-mca ft-enable cr". The former activates a set of MCA parameters (in the AMCA file 'ft-enable-cr'). The latter should be ignored by the MCA system.

Give that a try and let us know if the behavior changes.

-- Josh

>> From another terminal i try the orte-checkpoint
>
> ompi-checkpoint -v --term 9338
> [compute-19-12.local:09377] orte_checkpoint: Checkpointing...
> [compute-19-12.local:09377] PID 9338
> [compute-19-12.local:09377] Connected to Mpirun [[5009,0],0]
> [compute-19-12.local:09377] Terminating after checkpoint
> [compute-19-12.local:09377] orte_checkpoint: notify_hnp: Contact Head Node Process PID 9338
> [compute-19-12.local:09377] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
> [compute-19-12.local:09377] orte_checkpoint: hnp_receiver: Receive a command message.
> [compute-19-12.local:09377] orte_checkpoint: hnp_receiver: Status Update.
>
>
> Is there any way to debug the issue to get more information or log messages.
>
> Rangam
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users