Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
From: Derrick LIN (klin938_at_[hidden])
Date: 2011-04-15 17:02:21


>
> - what is your SGE configuration `qconf -sconf`?

#global:
execd_spool_dir /var/spool/gridengine/execd
mailer /usr/bin/mail
xterm /usr/bin/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail root
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs 100
gid_range 65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
reprioritize false
rlogin_daemon /usr/sbin/sshd -i
rlogin_command /usr/bin/ssh
qlogin_daemon /usr/sbin/sshd -i
qlogin_command /usr/share/gridengine/qlogin-wrapper
rsh_daemon /usr/sbin/sshd -i
rsh_command /usr/bin/ssh
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w

# my queue setting is:

qname dev.q
hostlist sgeqexec01.domain.com.au
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make orte
rerun FALSE
slots 8
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

# my PE setting is:

pe_name orte
slots 4
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

> a) you are testing from master to a node, but jobs are running between
> nodes.

> b) unless you need X11 forwarding, using SGE’s -builtin- communication
> works fine, this way you can have a cluster without `rsh` or `ssh` (or
> limited to admin staff) and can still run parallel jobs.
>

Sorry for the misleading snip. All the hosts (both master and execution
host) in the cluster can powerwordless each other without an issue. As my 2)
states, I could run a generic openmpi job without the SGE successfully. So I
do not think is the communication issue?

> Then you are bypassing SGE’s slot allocation and will have wrong accounting
> and no job control of the slave tasks.
>

I know it's not a proper submission as a PE job. I simply ran out of idea
what to do next. Even it's not a proper way, but that openmpi error didn't
happen and the job completed. I am wondering why.

The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.

I have installed OpenMPI on the submission host and the master later, but it
didn't help. So I guess OpenMPI is needed in execution hosts only.