Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-02-26 05:02:02


Hi,

the daemons will fork into daemon land - no accounting, no control by
SGE via qdel (nevertheless it runs, just not tightly integrated):

https://svn.open-mpi.org/trac/ompi/ticket/1783

-- Reuti

Am 26.02.2009 um 06:13 schrieb Sangamesh B:

> Hello Reuti,
>
> I'm sorry for the late response.
>
> On Mon, Jan 26, 2009 at 7:11 PM, Reuti <reuti_at_[hidden]>
> wrote:
>> Am 25.01.2009 um 06:16 schrieb Sangamesh B:
>>
>>> Thanks Reuti for the reply.
>>>
>>> On Sun, Jan 25, 2009 at 2:22 AM, Reuti <reuti_at_staff.uni-
>>> marburg.de> wrote:
>>>>
>>>> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>>>>
>>>>> The RLIMIT error is very common when using OpenMPI + OFED + Sun
>>>>> Grid
>>>>> Engine. You can find more information and several remedies here:
>>>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>>
>>>>> I usually resolve this problem by adding "ulimit -l unlimited"
>>>>> near
>>>>> the top of the SGE startup script on the computation nodes and
>>>>> restarting SGE on every node.
>>>>
>>>> Did you request/set any limits with SGE's h_vmem/h_stack resource
>>>> request?
>>
>> Was this also your problem:
>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=99442
>>
> I've not posted that mail. But the same setting is not working for
> me:
>
> $ qconf -sconf
> global:
> ..
> execd_params H_MEMORYLOCKED=infinity
> ..
>
> But I'm using "unset SGE_ROOT" (suggested by you) inside sge job
> submission script with a Loose integration of Open MPI with SGE. Its
> working fine.
>
> I'm curious to know why Open MPI-1.3 is not working with Tight
> Integration to SGE 6.0U8 in a Rocks-4.3 cluster.
>
> In other cluster Open MPI-1.3 works well with Tight Integration to
> SGE.
>
> Thanks a lot,
>
> Sangamesh
>
>
> Thanks,
> Sangamesh
>> -- Reuti
>>
>>
>>>>
>>> No.
>>>
>>> The used queue is as follows:
>>> qconf -sq ib.q
>>> qname ib.q
>>> hostlist @ibhosts
>>> seq_no 0
>>> load_thresholds np_load_avg=1.75
>>> suspend_thresholds NONE
>>> nsuspend 1
>>> suspend_interval 00:05:00
>>> priority 0
>>> min_cpu_interval 00:05:00
>>> processors UNDEFINED
>>> qtype BATCH INTERACTIVE
>>> ckpt_list NONE
>>> pe_list orte
>>> rerun FALSE
>>> slots 8
>>> tmpdir /tmp
>>> shell /bin/bash
>>> prolog NONE
>>> epilog NONE
>>> shell_start_mode unix_behavior
>>> starter_method NONE
>>> suspend_method NONE
>>> resume_method NONE
>>> terminate_method NONE
>>> notify 00:00:60
>>> owner_list NONE
>>> user_lists NONE
>>> xuser_lists NONE
>>> subordinate_list NONE
>>> complex_values NONE
>>> projects NONE
>>> xprojects NONE
>>> calendar NONE
>>> initial_state default
>>> s_rt INFINITY
>>> h_rt INFINITY
>>> s_cpu INFINITY
>>> h_cpu INFINITY
>>> s_fsize INFINITY
>>> h_fsize INFINITY
>>> s_data INFINITY
>>> h_data INFINITY
>>> s_stack INFINITY
>>> h_stack INFINITY
>>> s_core INFINITY
>>> h_core INFINITY
>>> s_rss INFINITY
>>> h_rss INFINITY
>>> s_vmem INFINITY
>>> h_vmem INFINITY
>>>
>>> # qconf -sp orte
>>> pe_name orte
>>> slots 999
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule $fill_up
>>> control_slaves TRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> # qconf -shgrp @ibhosts
>>> group_name @ibhosts
>>> hostlist node-0-0.local node-0-1.local node-0-2.local
>>> node-0-3.local \
>>> node-0-4.local node-0-5.local node-0-6.local
>>> node-0-7.local \
>>> node-0-8.local node-0-9.local node-0-10.local
>>> node-0-11.local \
>>> node-0-12.local node-0-13.local node-0-14.local
>>> node-0-16.local \
>>> node-0-17.local node-0-18.local node-0-19.local
>>> node-0-20.local \
>>> node-0-21.local node-0-22.local
>>>
>>> The Hostnames for IB interface are like ibc0 ibc1.. ibc22
>>>
>>> Is this difference caussing the problem.
>>>
>>> ssh issues:
>>> between master & node: works fine but with some delay.
>>>
>>> between nodes: works fine, no delay
>>>
>>>> From command line the open mpi jobs were run with no error, even
>>>
>>> master node is not used in hostfile.
>>>
>>> Thanks,
>>> Sangamesh
>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> Jeremy Stout
>>>>>
>>>>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]>
>>>>> wrote:
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with
>>>>>> support of
>>>>>> SGE i.e using --with-sge.
>>>>>> But the ompi_info shows only one component:
>>>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>>>> MCA ras: gridengine (MCA v2.0, API v2.0,
>>>>>> Component v1.3)
>>>>>>
>>>>>> Is this right? Because during ompi installation SGE qmaster
>>>>>> daemon was
>>>>>> not working.
>>>>>>
>>>>>> Now the problem is, the open mpi parallel jobs submitted thru
>>>>>> gridengine are failing (when run on multiple nodes) with the
>>>>>> error:
>>>>>>
>>>>>> $ cat err.26.Helloworld-PRL
>>>>>> ssh_exchange_identification: Connection closed by remote host
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> A daemon (pid 8462) died unexpectedly with status 129 while
>>>>>> attempting
>>>>>> to launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see
>>>>>> above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the
>>>>>> needed shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>>> to have
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this
>>>>>> will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>>> process
>>>>>> that caused that situation.
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> mpirun: clean termination accomplished
>>>>>>
>>>>>> When the job runs on single node, it runs well with producing the
>>>>>> output but with an error:
>>>>>> $ cat err.23.Helloworld-PRL
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> WARNING: There was an error initializing an OpenFabrics device.
>>>>>>
>>>>>> Local host: node-0-4.local
>>>>>> Local device: mthca0
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> ---------
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>> This will severely limit memory registrations.
>>>>>> [node-0-4.local:07869] 7 more processes have sent help message
>>>>>> help-mpi-btl-openib.txt / error in device init
>>>>>> [node-0-4.local:07869] Set MCA parameter
>>>>>> "orte_base_help_aggregate" to
>>>>>> 0 to see all help / error messages
>>>>>>
>>>>>> What may be the problem for this behavior?
>>>>>>
>>>>>> Thanks,
>>>>>> Sangamesh
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>