Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Ompi runs thru cmd line but fails when run thru SGE
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-02-26 00:13:11


Hello Reuti,

   I'm sorry for the late response.

On Mon, Jan 26, 2009 at 7:11 PM, Reuti <reuti_at_[hidden]> wrote:
> Am 25.01.2009 um 06:16 schrieb Sangamesh B:
>
>> Thanks Reuti for the reply.
>>
>> On Sun, Jan 25, 2009 at 2:22 AM, Reuti <reuti_at_[hidden]> wrote:
>>>
>>> Am 24.01.2009 um 17:12 schrieb Jeremy Stout:
>>>
>>>> The RLIMIT error is very common when using OpenMPI + OFED + Sun Grid
>>>> Engine. You can find more information and several remedies here:
>>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>
>>>> I usually resolve this problem by adding "ulimit -l unlimited" near
>>>> the top of the SGE startup script on the computation nodes and
>>>> restarting SGE on every node.
>>>
>>> Did you request/set any limits with SGE's h_vmem/h_stack resource
>>> request?
>
> Was this also your problem:
>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99442
>
 I've not posted that mail. But the same setting is not working for me:

$ qconf -sconf
global:
..
execd_params H_MEMORYLOCKED=infinity
..

But I'm using "unset SGE_ROOT" (suggested by you) inside sge job
submission script with a Loose integration of Open MPI with SGE. Its
working fine.

I'm curious to know why Open MPI-1.3 is not working with Tight
Integration to SGE 6.0U8 in a Rocks-4.3 cluster.

In other cluster Open MPI-1.3 works well with Tight Integration to SGE.

Thanks a lot,

Sangamesh

Thanks,
Sangamesh
> -- Reuti
>
>
>>>
>> No.
>>
>> The used queue is as follows:
>> qconf -sq ib.q
>> qname                 ib.q
>> hostlist              @ibhosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               orte
>> rerun                 FALSE
>> slots                 8
>> tmpdir                /tmp
>> shell                 /bin/bash
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      unix_behavior
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>> # qconf -sp orte
>> pe_name           orte
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /bin/true
>> allocation_rule   $fill_up
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>> # qconf -shgrp @ibhosts
>> group_name @ibhosts
>> hostlist node-0-0.local node-0-1.local node-0-2.local node-0-3.local \
>>         node-0-4.local node-0-5.local node-0-6.local node-0-7.local \
>>         node-0-8.local node-0-9.local node-0-10.local node-0-11.local \
>>         node-0-12.local node-0-13.local node-0-14.local node-0-16.local \
>>         node-0-17.local node-0-18.local node-0-19.local node-0-20.local \
>>         node-0-21.local node-0-22.local
>>
>> The Hostnames for IB interface are like ibc0 ibc1.. ibc22
>>
>> Is this difference caussing the problem.
>>
>> ssh issues:
>> between master & node: works fine but with some delay.
>>
>> between nodes: works fine, no delay
>>
>>> From command line the open mpi jobs were run with no error, even
>>
>> master node is not used in hostfile.
>>
>> Thanks,
>> Sangamesh
>>
>>> -- Reuti
>>>
>>>
>>>> Jeremy Stout
>>>>
>>>> On Sat, Jan 24, 2009 at 6:06 AM, Sangamesh B <forum.san_at_[hidden]>
>>>> wrote:
>>>>>
>>>>> Hello all,
>>>>>
>>>>>  Open MPI 1.3 is installed on Rocks 4.3 Linux cluster with support of
>>>>> SGE i.e using --with-sge.
>>>>> But the ompi_info shows only one component:
>>>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine
>>>>>               MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
>>>>>
>>>>> Is this right? Because during ompi installation SGE qmaster daemon was
>>>>> not working.
>>>>>
>>>>> Now the problem is, the open mpi parallel jobs submitted thru
>>>>> gridengine are failing (when run on multiple nodes) with the error:
>>>>>
>>>>> $ cat err.26.Helloworld-PRL
>>>>> ssh_exchange_identification: Connection closed by remote host
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting
>>>>> to launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>> the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun: clean termination accomplished
>>>>>
>>>>> When the job runs on single node, it runs well with producing the
>>>>> output but with an error:
>>>>> $ cat err.23.Helloworld-PRL
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> WARNING: There was an error initializing an OpenFabrics device.
>>>>>
>>>>>  Local host:   node-0-4.local
>>>>>  Local device: mthca0
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>>>>>  This will severely limit memory registrations.
>>>>> [node-0-4.local:07869] 7 more processes have sent help message
>>>>> help-mpi-btl-openib.txt / error in device init
>>>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to
>>>>> 0 to see all help / error messages
>>>>>
>>>>> What may be the problem for this behavior?
>>>>>
>>>>> Thanks,
>>>>> Sangamesh
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>