Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Reuti (reuti_at_[hidden])
Date: 2009-01-26 08:41:37


Am 25.01.2009 um 06:16 schrieb Sangamesh B:

> Thanks Reuti for the reply.
>
> On Sun, Jan 25, 2009 at 2:22 AM, Reuti <reuti_at_[hidden]>
> wrote:
>> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>>
>>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>>> Engine. You can find more information and several remedies here:
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>
>>> I usually resolve this problem by adding "ulimit -l unlimited" near
>>> the top of the SGE startup script on the computation nodes and
>>> restarting SGE on every node.
>>
>> Did you request/set any limits with SGE's h_vmem/h_stack resource
>> request?

Was this also your problem:

http://gridengine.sunsource.net/ds/viewMessage.do?
dsForumId=38&dsMessageId=99442

-- Reuti

>>
> No.
>
> The used queue is as follows:
> qconf -sq ib.q
> qname ib.q
> hostlist @ibhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list orte
> rerun FALSE
> slots 8
> tmpdir /tmp
> shell /bin/bash
> prolog NONE
> epilog NONE
> shell_start_mode unix_behavior
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
> # qconf -sp orte
> pe_name orte
> slots 999
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $fill_up
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> # qconf -shgrp @ibhosts
> group_name @ibhosts
> hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
> node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
> node-0-8.local node-0-9.local node-0-10.local
> node-0-11.local \
> node-0-12.local node-0-13.local node-0-14.local
> node-0-16.local \
> node-0-17.local node-0-18.local node-0-19.local
> node-0-20.local \
> node-0-21.local node-0-22.local
>
> The Hostnames for IB interface are like ibc0 ibc1.. ibc22
>
> Is this difference caussing the problem.
>
> ssh issues:
> between master & node: works fine but with some delay.
>
> between nodes: works fine, no delay
>
>> From command line the open mpi jobs were run with no error, even
> master node is not used in hostfile.
>
> Thanks,
> Sangamesh
>
>> -- Reuti
>>
>>
>>> Jeremy Stout
>>>
>>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B
>>> <forum.san_at_[hidden]> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with
>>>> support of
>>>> SGE i.e using --with-sge.
>>>> But the ompi_info shows only one component:
>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>> MCA ras: gridengine (MCA v2.0, API v2.0,
>>>> Component v1.3)
>>>>
>>>> Is this right? Because during ompi installation SGE qmaster
>>>> daemon was
>>>> not working.
>>>>
>>>> Now the problem is, the open mpi parallel jobs submitted thru
>>>> gridengine are failing (when run on multiple nodes) with the error:
>>>>
>>>> $ cat err.26.Helloworld-PRL
>>>> ssh_exchange_identification: Connection closed by remote host
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> A daemon (pid 8462) died unexpectedly with status 129 while
>>>> attempting
>>>> to launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see
>>>> above).
>>>>
>>>> This may be because the daemon was unable to find all the needed
>>>> shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>> to have
>>>> the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> mpirun noticed that the job aborted, but has no info as to the
>>>> process
>>>> that caused that situation.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> mpirun: clean termination accomplished
>>>>
>>>> When the job runs on single node, it runs well with producing the
>>>> output but with an error:
>>>> $ cat err.23.Helloworld-PRL
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> WARNING: There was an error initializing an OpenFabrics device.
>>>>
>>>> Local host: node-0-4.local
>>>> Local device: mthca0
>>>>
>>>> -------------------------------------------------------------------
>>>> -------
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>> This will severely limit memory registrations.
>>>> [node-0-4.local:07869] 7 more processes have sent help message
>>>> help-mpi-btl-openib.txt / error in device init
>>>> [node-0-4.local:07869] Set MCA parameter
>>>> "orte_base_help_aggregate" to
>>>> 0 to see all help / error messages
>>>>
>>>> What may be the problem for this behavior?
>>>>
>>>> Thanks,
>>>> Sangamesh
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users