Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-01-25 00:16:00


Thanks Reuti for the reply.

On Sun, Jan 25, 2009 at 2:22 AM, Reuti <reuti_at_[hidden]> wrote:
> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>
>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>> Engine. You can find more information and several remedies here:
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> I usually resolve this problem by adding "ulimit -l unlimited" near
>> the top of the SGE startup script on the computation nodes and
>> restarting SGE on every node.
>
> Did you request/set any limits with SGE's h_vmem/h_stack resource request?
>
No.

The used queue is as follows:
qconf -sq ib.q
qname ib.q
hostlist @ibhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list orte
rerun FALSE
slots 8
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode unix_behavior
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

# qconf -sp orte
pe_name orte
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
# qconf -shgrp @ibhosts
group_name @ibhosts
hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
         node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
         node-0-8.local node-0-9.local node-0-10.local node-0-11.local \
         node-0-12.local node-0-13.local node-0-14.local node-0-16.local \
         node-0-17.local node-0-18.local node-0-19.local node-0-20.local \
         node-0-21.local node-0-22.local

The Hostnames for IB interface are like ibc0 ibc1.. ibc22

Is this difference caussing the problem.

ssh issues:
between master & node: works fine but with some delay.

between nodes: works fine, no delay

>From command line the open mpi jobs were run with no error, even
master node is not used in hostfile.

Thanks,
Sangamesh

> -- Reuti
>
>
>> Jeremy Stout
>>
>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]> wrote:
>>>
>>> Hello all,
>>>
>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
>>> SGE i.e using --with-sge.
>>> But the ompi_info shows only one component:
>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>>
>>> Is this right? Because during ompi installation SGE qmaster daemon was
>>> not working.
>>>
>>> Now the problem is, the open mpi parallel jobs submitted thru
>>> gridengine are failing (when run on multiple nodes) with the error:
>>>
>>> $ cat err.26.Helloworld-PRL
>>> ssh_exchange_identification: Connection closed by remote host
>>>
>>> --------------------------------------------------------------------------
>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting
>>> to launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>>
>>> When the job runs on single node, it runs well with producing the
>>> output but with an error:
>>> $ cat err.23.Helloworld-PRL
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>>
>>> --------------------------------------------------------------------------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>>
>>> Local host: node-0-4.local
>>> Local device: mthca0
>>>
>>> --------------------------------------------------------------------------
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>> This will severely limit memory registrations.
>>> [node-0-4.local:07869] 7 more processes have sent help message
>>> help-mpi-btl-openib.txt / error in device init
>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
>>> 0 to see all help / error messages
>>>
>>> What may be the problem for this behavior?
>>>
>>> Thanks,
>>> Sangamesh
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>