Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-clean on single executable
From: Nicolas Deladerriere (nicolas.deladerriere_at_[hidden])
Date: 2012-10-26 07:14:35


Thanks all for your comments

Ralph

What I was initially looking at is a tool (or option of orte-clean) that
clean up the mess you are talking about, but only the mess that have been
created by a single mpirun command. As far I have understood, orte-clean
clean all mess on a node associated to all open-mpi process that have run
(or are currently running).

According to Rolph comment, usually, mpirun command does not leave any
zombie processes, Hence it seems that the effect of orte-clean is limited.
But, since it exists, I was wondering that it is doing usefull stuff ?

Cheers,
Nicolas

2012/10/25 Ralph Castain <rhc_at_[hidden]>

> Okay, now I'm confused. If all you want to do is cleanly "kill" a running
> OMPI job, then why not just issue
>
> $ kill SIGTERM <pid-for-that-mpirun>
>
> This will cause mpirun to order the clean termination of all remote procs
> within that execution, and then cleanly terminate itself. No tool we create
> could do it any better.
>
> Is there an issue with doing so?
>
> orte-clean was intended to cleanup the mess if/when the above method
> doesn't work - i.e., when you have to "kill SIGKILL mpirun", which forcibly
> kills mpirun but might leave zombie orteds on the remote nodes.
>
>
> On Oct 24, 2012, at 10:39 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> > Or perhaps cloned, renamed to orte-kill, and modified to kill a single
> (or multiple) specific job(s). That would be POSIX-like ("kill" vs.
> "clean").
> >
> >
> > On Oct 24, 2012, at 1:32 PM, Rolf vandeVaart wrote:
> >
> >> And just to give a little context, ompi-clean was created initially to
> "clean" up a node, not for cleaning up a specific job. It was for the case
> where MPI jobs would leave some files behind or leave some processes
> running. (I do not believe this happens much at all anymore.) But, as was
> said, no reason it could not be modified.
> >>
> >>> -----Original Message-----
> >>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> >>> On Behalf Of Jeff Squyres
> >>> Sent: Wednesday, October 24, 2012 12:56 PM
> >>> To: Open MPI Users
> >>> Subject: Re: [OMPI users] ompi-clean on single executable
> >>>
> >>> ...but patches would be greatly appreciated. :-)
> >>>
> >>> On Oct 24, 2012, at 12:24 PM, Ralph Castain wrote:
> >>>
> >>>> All things are possible, including what you describe. Not sure when we
> >>> would get to it, though.
> >>>>
> >>>>
> >>>> On Oct 24, 2012, at 4:01 AM, Nicolas Deladerriere
> >>> <nicolas.deladerriere_at_[hidden]> wrote:
> >>>>
> >>>>> Reuti,
> >>>>>
> >>>>> The problem I am facing is a small small part of our production
> >>>>> system, and I cannot modify our mpirun submission system. This is why
> >>>>> i am looking at solution using only ompi-clean of mpirun command
> >>>>> specification.
> >>>>>
> >>>>> Thanks,
> >>>>> Nicolas
> >>>>>
> >>>>> 2012/10/24, Reuti <reuti_at_[hidden]>:
> >>>>>> Am 24.10.2012 um 11:33 schrieb Nicolas Deladerriere:
> >>>>>>
> >>>>>>> Reuti,
> >>>>>>>
> >>>>>>> Thanks for your comments,
> >>>>>>>
> >>>>>>> In our case, we are currently running different mpirun commands on
> >>>>>>> clusters sharing the same frontend. Basically we use a wrapper to
> >>>>>>> run the mpirun command and to run an ompi-clean command to clean
> >>> up
> >>>>>>> the mpi job if required.
> >>>>>>> Using ompi-clean like this just kills all other mpi jobs running on
> >>>>>>> same frontend. I cannot use queuing system
> >>>>>>
> >>>>>> Why? Using it on a single machine was only one possible setup. Its
> >>>>>> purpose is to distribute jobs to slave hosts. If you have already
> >>>>>> one frontend as login-machine it fits perfect: the qmaster (in case
> >>>>>> of SGE) can run there and the execd on the nodes.
> >>>>>>
> >>>>>> -- Reuti
> >>>>>>
> >>>>>>
> >>>>>>> as you have suggested this
> >>>>>>> is why I was wondering a option or other solution associated to
> >>>>>>> ompi-clean command to avoid this general mpi jobs cleaning.
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>> Nicolas
> >>>>>>>
> >>>>>>> 2012/10/24, Reuti <reuti_at_[hidden]>:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Am 24.10.2012 um 09:36 schrieb Nicolas Deladerriere:
> >>>>>>>>
> >>>>>>>>> I am having issue running ompi-clean which clean up (this is
> >>>>>>>>> normal) session associated to a user which means it kills all
> >>>>>>>>> running jobs assoicated to this session (this is also normal).
> >>>>>>>>> But I would like to be able to clean up session associated to a
> >>>>>>>>> job (a not user).
> >>>>>>>>>
> >>>>>>>>> Here is my point:
> >>>>>>>>>
> >>>>>>>>> I am running two executable :
> >>>>>>>>>
> >>>>>>>>> % mpirun -np 2 myexec1
> >>>>>>>>> --> run with PID 2399 ...
> >>>>>>>>> % mpirun -np 2 myexec2
> >>>>>>>>> --> run with PID 2402 ...
> >>>>>>>>>
> >>>>>>>>> When I run orte-clean I got this result :
> >>>>>>>>> % orte-clean -v
> >>>>>>>>> orte-clean: cleaning session dir tree
> >>>>>>>>> openmpi-sessions-ndelader_at_myhost_0
> >>>>>>>>> orte-clean: killing any lingering procs
> >>>>>>>>> orte-clean: found potential rogue orterun process
> >>>>>>>>> (pid=2399,user=ndelader), sending SIGKILL...
> >>>>>>>>> orte-clean: found potential rogue orterun process
> >>>>>>>>> (pid=2402,user=ndelader), sending SIGKILL...
> >>>>>>>>>
> >>>>>>>>> Which means that both jobs have been killed :-( Basically I would
> >>>>>>>>> like to perform orte-clean using executable name or PID or
> >>>>>>>>> whatever that identify which job I want to stop an clean. It
> >>>>>>>>> seems I would need to create an openmpi session per job. Does it
> >>> make sense ?
> >>>>>>>>> And
> >>>>>>>>> I would like to be able to do something like following command
> >>>>>>>>> and get following result :
> >>>>>>>>>
> >>>>>>>>> % orte-clean -v myexec1
> >>>>>>>>> orte-clean: cleaning session dir tree
> >>>>>>>>> openmpi-sessions-ndelader_at_myhost_0
> >>>>>>>>> orte-clean: killing any lingering procs
> >>>>>>>>> orte-clean: found potential rogue orterun process
> >>>>>>>>> (pid=2399,user=ndelader), sending SIGKILL...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Does it make sense ? Is there a way to perform this kind of
> >>>>>>>>> selection in cleaning process ?
> >>>>>>>>
> >>>>>>>> How many jobs are you starting on how many nodes at one time? This
> >>>>>>>> requirement could be a point to start to use a queuing system,
> >>>>>>>> where can remove job individually and also serialize your
> >>>>>>>> workflow. In fact: we use GridEngine also local on workstations
> >>>>>>>> for this purpose.
> >>>>>>>>
> >>>>>>>> -- Reuti
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquyres_at_[hidden]
> >>> For corporate legal information go to:
> >>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> -----------------------------------------------------------------------------------
> >> This email message is for the sole use of the intended recipient(s) and
> may contain
> >> confidential information. Any unauthorized review, use, disclosure or
> distribution
> >> is prohibited. If you are not the intended recipient, please contact
> the sender by
> >> reply email and destroy all copies of the original message.
> >>
> -----------------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres_at_[hidden]
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>