- what is your SGE configuration `qconf -sconf`?
#global:
execd_spool_dir /var/spool/gridengine/execd
mailer /usr/bin/mail
xterm /usr/bin/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail root
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false sharelog=00:00:00
finished_jobs 100
gid_range 65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
reprioritize false
rlogin_daemon /usr/sbin/sshd -i
rlogin_command /usr/bin/ssh
qlogin_daemon /usr/sbin/sshd -i
qlogin_command /usr/share/gridengine/qlogin-wrapper
rsh_daemon /usr/sbin/sshd -i
rsh_command /usr/bin/ssh
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
# my queue setting is:
qname dev.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make orte
rerun FALSE
slots 8
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
# my PE setting is:
pe_name orte
slots 4
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
a) you are testing from master to a node, but jobs are running between nodes.
b) unless you need X11 forwarding, using SGE’s -builtin- communication works fine, this way you can have a cluster without `rsh` or `ssh` (or limited to admin staff) and can still run parallel jobs.
Sorry for the misleading snip. All the hosts (both master and execution host) in the cluster can powerwordless each other without an issue. As my 2) states, I could run a generic openmpi job without the SGE successfully. So I do not think is the communication issue?
Then you are bypassing SGE’s slot allocation and will have wrong accounting and no job control of the slave tasks.
I know it's not a proper submission as a PE job. I simply ran out of idea what to do next. Even it's not a proper way, but that openmpi error didn't happen and the job completed. I am wondering why.
The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.
I have installed OpenMPI on the submission host and the master later, but it didn't help. So I guess OpenMPI is needed in execution hosts only.