Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
From: Reuti (reuti_at_[hidden])
Date: 2011-04-15 18:23:23


Am 15.04.2011 um 23:02 schrieb Derrick LIN:

> - what is your SGE configuration `qconf -sconf`?
>
> <snip>
> rlogin_daemon /usr/sbin/sshd -i
> rlogin_command /usr/bin/ssh
> qlogin_daemon /usr/sbin/sshd -i
> qlogin_command /usr/share/gridengine/qlogin-wrapper
> rsh_daemon /usr/sbin/sshd -i
> rsh_command /usr/bin/ssh

So you route the SGE startup mechanism to use `ssh`, nevertherless is should work of course. Small difference to a conventional `ssh` is, that SGE will start a private daemon for each job on the nodes listening on a random port.

When you use only one host, then forks will be created but no `ssh` call. Your test uses more than one node?

You copied you SGE aware version to all nodes at the same location? Are you getting the correct `mpiexec` and shared libraries in your jobscript? Shows the output of:

#!/bin/sh
which mpiexec
echo $LD_LIBRARY_PATH
ldd ompi_job

the expected ones (ompi_job is the binary and ompi_job.sh the script) when submitted with a PE request?

-- Reuti

> jsv_url none
> jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
>
> # my queue setting is:
>
> qname dev.q
> hostlist sgeqexec01.domain.com.au
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list make orte
> rerun FALSE
> slots 8
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
> # my PE setting is:
>
> pe_name orte
> slots 4
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
>
> a) you are testing from master to a node, but jobs are running between nodes.
>
> b) unless you need X11 forwarding, using SGE’s -builtin- communication works fine, this way you can have a cluster without `rsh` or `ssh` (or limited to admin staff) and can still run parallel jobs.
>
> Sorry for the misleading snip. All the hosts (both master and execution host) in the cluster can powerwordless each other without an issue. As my 2) states, I could run a generic openmpi job without the SGE successfully. So I do not think is the communication issue?
>
> Then you are bypassing SGE’s slot allocation and will have wrong accounting and no job control of the slave tasks.
>
> I know it's not a proper submission as a PE job. I simply ran out of idea what to do next. Even it's not a proper way, but that openmpi error didn't happen and the job completed. I am wondering why.
>
>
> The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.
>
> I have installed OpenMPI on the submission host and the master later, but it didn't help. So I guess OpenMPI is needed in execution hosts only.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users