Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1
From: Malone, Scott (Scott.Malone_at_[hidden])
Date: 2009-03-19 12:19:05


Since I'm new to openMPI I wanted to make sure that I understand this. When the jobs starts orted is daemonized and because of this they are not bound the sge_shephered on each node. This results in the loss of account for those processes. I guess that when I start mpirun with debugging, the orted is no longer daemonized and is attached to the sge_shephered? If this is true, is their anyway to started the orted not daemonized without turning on debugging until 1.3.2 is available?

Thanks!

Scott Malone
Manager, High Performance Computing Facility
Information Sciences - Research Informatics
St. Jude Children's Research Hospital
332 North Lauderdale
Memphis, TN 38105
901.495.4947
scott.malone_at_[hidden]
 

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Reuti
> Sent: Thursday, March 19, 2009 10:32 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1
>
> Hi,
>
> Am 19.03.2009 um 16:07 schrieb Malone, Scott:
>
> > I am having two problem with the integration of OpenMPI 1.3 and SGE
> > 6.2u1, which we are new with both. The troubles are getting jobs
> > to suspend/resume and collect cpu time correctly.
> >
> >
> >
> > For suspend/resume I have added the following to my mpirun command:
> >
> >
> >
> > --mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1
> >
> why? In 1.3 the orted is already daemonizing because of a bug and I
> only found that it's necessary for the notify feature to daemonize
> the orted.
>
> > and adjusted the suspend_method for the queue that it's running
> > in. I have not gotten it to place any process into the T state.
> > Although this is not a huge problem, I hope to have this working in
> > the future.
> >
> >
> >
> > My main problem is getting the cpu time correct. On a multiple cpu
> > job only the master nodes shows the cpu time correct for that
> > process, the others are very short and not sure what they are
> > measuring. (I believe startup time). Here's and example:
> >
> When the orted daemonize, they are no longer bound to the
> sge_shephered. As a result of this, there is noone tracking their
> accounting on the nodes. This will be fixed AFAIK in 1.3.2, so that
> the daemons are still bound to a running sge_shephered.
>
> If you need the -notify feature and corerct accouting, you will need
> to wait until the qrsh_starter in SGE is fixed not to exit when they
> receive a usr1/2.
>
> -- Reuti
>
> >
> >
> > cpu 0.360
> >
> > cpu 0.480
> >
> > cpu 0.470
> >
> > cpu 0.490
> >
> > cpu 0.530
> >
> > cpu 0.470
> >
> > cpu 0.680
> >
> > cpu 464.305
> >
> >
> >
> > And from watching the runs that time is close to the wall clock
> > time and match what I see for that single process. Now I have
> > gotten it to give what I believe are correct values, but I have to
> > include --debug-daemons option to our mpirun command. With that I
> > get the following:
> >
> >
> >
> > cpu 73.146
> >
> > cpu 72.982
> >
> > cpu 73.381
> >
> > cpu 73.142
> >
> > cpu 73.029
> >
> > cpu 73.183
> >
> > cpu 73.117
> >
> > cpu 73.265
> >
> > cpu 73.236
> >
> >
> >
> > I have noticed that when I get the cpu time correctly I get qrsh
> > process that startup (my understanding is that this is what starts
> > the processes on the remote machines) and they stay running until
> > the jobs is finished. When I don't get the correct cpu time, I see
> > the qrsh processes start on the master node, but die off once they
> > start the process on the remote nodes. The PE environment looks
> > like the following:
> >
> >
> >
> >
> >
> > pe_name orte
> >
> > slots 560
> >
> > user_lists NONE
> >
> > xuser_lists NONE
> >
> > start_proc_args /bin/true
> >
> > stop_proc_args /bin/true
> >
> > allocation_rule $round_robin
> >
> > control_slaves TRUE
> >
> > job_is_first_task FALSE
> >
> > urgency_slots min
> >
> > accounting_summary FALSE
> >
> >
> >
> > Please let me know if I can provide any more information to help
> > figure this out.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Scott Malone
> >
> > Manager, High Performance Computing Facility
> >
> > Information Sciences - Research Informatics
> >
> > St. Jude Children's Research Hospital
> >
> > 332 North Lauderdale
> >
> > Memphis, TN 38105
> >
> > 901.495.4947
> >
> > scott.malone_at_[hidden]
> >
> >
> >
> >
> >
> >
> > Email Disclaimer: www.stjude.org/emaildisclaimer
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users