Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
From: Reuti (reuti_at_[hidden])
Date: 2011-04-15 18:23:23

Am 15.04.2011 um 23:02 schrieb Derrick LIN:

> - what is your SGE configuration `qconf -sconf`?
> <snip>
> rlogin_daemon /usr/sbin/sshd -i
> rlogin_command /usr/bin/ssh
> qlogin_daemon /usr/sbin/sshd -i
> qlogin_command /usr/share/gridengine/qlogin-wrapper
> rsh_daemon /usr/sbin/sshd -i
> rsh_command /usr/bin/ssh

So you route the SGE startup mechanism to use `ssh`, nevertherless is should work of course. Small difference to a conventional `ssh` is, that SGE will start a private daemon for each job on the nodes listening on a random port.

When you use only one host, then forks will be created but no `ssh` call. Your test uses more than one node?

You copied you SGE aware version to all nodes at the same location? Are you getting the correct `mpiexec` and shared libraries in your jobscript? Shows the output of:

which mpiexec
ldd ompi_job

the expected ones (ompi_job is the binary and the script) when submitted with a PE request?

-- Reuti

> jsv_url none
> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
> # my queue setting is:
> qname dev.q
> hostlist
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> ckpt_list NONE
> pe_list make orte
> rerun FALSE
> slots 8
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
> # my PE setting is:
> pe_name orte
> slots 4
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
> a) you are testing from master to a node, but jobs are running between nodes.
> b) unless you need X11 forwarding, using SGE’s -builtin- communication works fine, this way you can have a cluster without `rsh` or `ssh` (or limited to admin staff) and can still run parallel jobs.
> Sorry for the misleading snip. All the hosts (both master and execution host) in the cluster can powerwordless each other without an issue. As my 2) states, I could run a generic openmpi job without the SGE successfully. So I do not think is the communication issue?
> Then you are bypassing SGE’s slot allocation and will have wrong accounting and no job control of the slave tasks.
> I know it's not a proper submission as a PE job. I simply ran out of idea what to do next. Even it's not a proper way, but that openmpi error didn't happen and the job completed. I am wondering why.
> The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.
> I have installed OpenMPI on the submission host and the master later, but it didn't help. So I guess OpenMPI is needed in execution hosts only.
> _______________________________________________
> users mailing list
> users_at_[hidden]