Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SGE tight integration and ?tm? protocol for start
From: Reuti (reuti_at_[hidden])
Date: 2008-10-13 03:39:33


Am 13.10.2008 um 00:55 schrieb Sean Davis:

> On Sat, Oct 11, 2008 at 6:48 PM, Reuti <reuti_at_[hidden]>
> wrote:
>> Am 12.10.2008 um 00:21 schrieb Sean Davis:
>>
>>> <snip>
>>>
>>> Thanks, Pak. There is only one queue on the SGE system. Of course,
>>> there are queue instances for each machine, which is the usual for
>>> SGE.
>>>
>>> I'll give the -masterq a look. And the messages files for the
>>> involved machines are devoid of anything useful; in fact, there
>>> is no
>>> mention of these jobs, in general.
>>
>> Hi,
>>
>> to see more, you can set "loglevel log_info" in the scheduler
>> configuration.
>>
>> Do you have more than one network card installed and gave them the
>> same
>> name?
>> Your defined "tmpdir" is local on each machine?
>> Do you redifine $TMPDIR in your .bashrc or anything else therein?
>
> The tmpdir is the same on each machine and is local to the machine.
> We do have two interfaces on each machine, one for a local subnet and
> the other for an outside connection from each machine. The DNS is
> resolved on the outside network. Why would behavior be different on
> jobs that are run in a PE and only in $round_robin than for standard
> serial jobs or jobs on only one node?

Hi Sean,

I found it in MPICH(1), and don't know whether it applies also to
Open MPI. The node where the master of the parallel job starts, i.e.
executes the mpirun command, will send in the startup message his own
name for the slave tasks. When this is the outside network, the nodes
will never find the master. Therefore a special variable must be set,
MPI_HOST, to the name of the internal interface before calling mpirun:

MPI_HOST=`grep $(hostname) $SGE_ROOT/default/common/host_aliases |
cut -f 1 -d " "`

The master on its own is aware of both names, but not the other nodes.

Did you setup host_aliases? Is the primary interface used for the
internal or the external connection?

--- Reuti

> Sean
>>>>> Date: Sat, 11 Oct 2008 07:56:02 -0400
>>>>> From: Jeff Squyres <jsquyres_at_[hidden]>
>>>>> Subject: Re: [OMPI users] SGE tight integration and ?tm?
>>>>> protocol for
>>>>> start
>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>> Message-ID: <3E62159B-14B9-4D44-96F6-0345079BCCE5_at_[hidden]>
>>>>> Content-Type: text/plain; charset=US-ASCII; format=flowed;
>>>>> delsp=yes
>>>>>
>>>>> I don't know much/anything about SGE (I'll leave that to the
>>>>> Sun folks
>>>>> on
>>>>> this list to reply), but I can tell you about the tm plugins:
>>>>> tm is the
>>>>> protocol used by the PBS/Torque family of launchers. It looks
>>>>> like
>>>>> your
>>>>> Open MPI was built with TM support, but when you launch, it's
>>>>> likely
>>>>> unable
>>>>> to find the support libraries that it needs to load those
>>>>> plugins.
>>>>>
>>>>> This is probably fine in your case, since you want to use SGE,
>>>>> not TM.
>>>>>
>>>>>
>>>>> On Oct 9, 2008, at 4:40 PM, Sean Davis wrote:
>>>>>
>>>>>> I am relatively new to OpenMPI and Sun Grid Engine parallel
>>>>>> integration. I have a small cluster that is running SGE6.2 on
>>>>>> linux
>>>>>> machines all using Intel Xeon processors. I have installed
>>>>>> OpenMPI
>>>>>> 1.2.7 from source using the --with-sge switch. Now, I am
>>>>>> trying to
>>>>>> troubleshoot some problems I am having. I have created a
>>>>>> simple job
>>>>>> script:
>>>>>>
>>>>>> The job script looks like:
>>>>>> #!/bin/bash
>>>>>> #$ -S /bin/bash
>>>>>> #$ -cwd
>>>>>> mpirun --mca pls_gridengine_verbose 1 -np $NSLOTS hostname
>>>>>>
>>>>>> And the output on the error stream:
>>>>>>>
>>>>>>> more junksub.sh.e3574
>>>>>>
>>>>>> [shakespeare:05720] mca: base: component_find: unable to open
>>>>>> ras tm:
>>>>>> file not found (ignored)
>>>>>> [shakespeare:05720] mca: base: component_find: unable to open
>>>>>> pls tm:
>>>>>> file not found (ignored)
>>>>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>>>>> Starting server daemon at host "octopus.nci.nih.gov"
>>>>>> Server daemon successfully started with task id "1.shakespeare"
>>>>>> [shakespeare:05733] mca: base: component_find: unable to open
>>>>>> ras tm:
>>>>>> file not found (ignored)
>>>>>> [shakespeare:05733] mca: base: component_find: unable to open
>>>>>> pls tm:
>>>>>> file not found (ignored)
>>>>>> error: executing task of job 3576 failed: failed sending task to
>>>>>> execd_at_[hidden]: can't find connecti
>>>>>> on
>>>>>> [shakespeare:05720] ERROR: A daemon on node
>>>>>> octopus.nci.nih.gov failed
>>>>>> to start as expected.
>>>>>> [shakespeare:05720] ERROR: There may be more information
>>>>>> available
>>>>>> from
>>>>>> [shakespeare:05720] ERROR: the 'qstat -t' command on the Grid
>>>>>> Engine
>>>>>> tasks.
>>>>>> [shakespeare:05720] ERROR: If the problem persists, please
>>>>>> restart the
>>>>>> [shakespeare:05720] ERROR: Grid Engine PE job
>>>>>> [shakespeare:05720] ERROR: The daemon exited unexpectedly
>>>>>> with status
>>>>>> 1.
>>>>>>
>>>>>> However, there is no output in any output stream.
>>>>>>
>>>>>> And if I log into shakespeare and qrsh -q all.q_at_octopus, I
>>>>>> immediately
>>>>>> get a slot, so there isn't a "direct" problem with connecting.
>>>>>>
>>>>>> As I got a hint from folks on the SGE mailing list, it appears
>>>>>> that
>>>>>> qrsh is not being used for job submission. Any suggestions as
>>>>>> to why
>>>>>> this might be the case (or if it is the case)?
>>>>>>
>>>>>> Thanks,
>>>>>> Sean
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users