Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-01-26 08:41:37


Am 25.01.2009 um 06:16 schrieb Sangamesh B:

> Thanks Reuti for the reply.
>
> On Sun, Jan 25, 2009 at 2:22 AM, Reuti <reuti_at_[hidden]>
> wrote:
>> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>>
>>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>>> Engine. You can find more information and several remedies here:
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>
>>> I usually resolve this problem by adding "ulimit -l unlimited" near
>>> the top of the SGE startup script on the computation nodes and
>>> restarting SGE on every node.
>>
>> Did you request/set any limits with SGE's h_vmem/h_stack resource
>> request?

Was this also your problem:

http://gridengine.sunsource.net/ds/viewMessage.do?
dsForumId=38&dsMessageId=99442

-- Reuti

>>
> No.
>
> The used queue is as follows:
> qconf -sq ib.q
> qname ib.q
> hostlist @ibhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list orte
> rerun FALSE
> slots 8
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode unix_behavior
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
> # qconf -sp orte
> pe_name orte
> slots 999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> # qconf -shgrp @ibhosts
> group_name @ibhosts
> hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
> node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
> node-0-8.local node-0-9.local node-0-10.local
> node-0-11.local \
> node-0-12.local node-0-13.local node-0-14.local
> node-0-16.local \
> node-0-17.local node-0-18.local node-0-19.local
> node-0-20.local \
> node-0-21.local node-0-22.local
>
> The Hostnames for IB interface are like ibc0 ibc1.. ibc22
>
> Is this difference caussing the problem.
>
> ssh issues:
> between master & node: works fine but with some delay.
>
> between nodes: works fine, no delay
>
>> From command line the open mpi jobs were run with no error, even
> master node is not used in hostfile.
>
> Thanks,
> Sangamesh
>
>> -- Reuti
>>
>>
>>> Jeremy Stout
>>>
>>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B
>>> <forum.san_at_[hidden]> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with
>>>> support of
>>>> SGE i.e using --with-sge.
>>>> But the ompi_info shows only one component:
>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>> MCA ras: gridengine (MCA v2.0, API v2.0,
>>>> Component v1.3)
>>>>
>>>> Is this right? Because during ompi installation SGE qmaster
>>>> daemon was
>>>> not working.
>>>>
>>>> Now the problem is, the open mpi parallel jobs submitted thru
>>>> gridengine are failing (when run on multiple nodes) with the error:
>>>>
>>>> $ cat err.26.Helloworld-PRL
>>>> ssh_exchange_identification: Connection closed by remote host
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> A daemon (pid 8462) died unexpectedly with status 129 while
>>>> attempting
>>>> to launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see
>>>> above).
>>>>
>>>> This may be because the daemon was unable to find all the needed
>>>> shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>> to have
>>>> the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> mpirun noticed that the job aborted, but has no info as to the
>>>> process
>>>> that caused that situation.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> mpirun: clean termination accomplished
>>>>
>>>> When the job runs on single node, it runs well with producing the
>>>> output but with an error:
>>>> $ cat err.23.Helloworld-PRL
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: There was an error initializing an OpenFabrics device.
>>>>
>>>> Local host: node-0-4.local
>>>> Local device: mthca0
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> [node-0-4.local:07869] 7 more processes have sent help message
>>>> help-mpi-btl-openib.txt / error in device init
>>>> [node-0-4.local:07869] Set MCA parameter
>>>> "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>>
>>>> What may be the problem for this behavior?
>>>>
>>>> Thanks,
>>>> Sangamesh
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users