Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: Problem running under SGE
From: Reuti (reuti_at_[hidden])
Date: 2011-09-13 18:36:06


Am 14.09.2011 um 00:29 schrieb Ralph Castain:

>
> On Sep 13, 2011, at 4:25 PM, Reuti wrote:
>
>> Am 13.09.2011 um 23:54 schrieb Blosch, Edwin L:
>>
>>> This version of OpenMPI I am running was built without any guidance regarding SGE in the configure command, but it was built on a system that did not have SGE, so I would presume support is absent.
>>
>> Whether SGE is installed on the built machine is not relevant. In contrast to Torque (and I think also SLURM) nothing is compiled into Open MPI which needs a library from the designated queuing system to support it. It will in case of SGE just check for the existence of some environment variables and call `qrsh -inherit ...`. Further startup is handled by SGE by the defined qrsh_daemon/qrsh_command.
>>
>> So, to check it you can issue:
>>
>> ompi_info | grep grid
>
> Just an FYI: that could still yield no output and not mean that qrsh won't be used by the launcher. The rsh launcher has the qrsh command embedded within it, so it won't show on ompi_info.

Got it - thx. - Reuti

>> Any output?
>>
>>
>>> My hope is that OpenMPI will not attempt to use SGE in any way. But perhaps it is trying to.
>>>
>>> Yes, I did supply a machinefile on my own. It is formed on the fly within the submitted script by parsing the PE_HOSTFILE, and I leave the
>>
>> Parsing the PE_HOSTFILE and prepare it in a format suitable for the actual parallel library is usually defined in start_proc_args to do it once for all users and applications using this parallel library. With a tight integration they could be set to NONE though.
>>
>>
>>> resulting file lying around, and the result appears to be correct, i.e. it includes those nodes (and only those nodes) allocated to the job.
>>
>> Well, even without compilation --with-sge you could achieve a so called tight integration and confuse the startup when. What does your PE look like? Depending whether Open MPI will start an task on the master node of the job by a local `qrsh -inherit ...` job_is_first_task needs to be set to FALSE (this allows one `qrsh -inherit ...`call to be made local). But if all is fine, the job script is already the first task and TRUE should work.
>>
>> -- Reuti
>>
>>
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
>>> Sent: Tuesday, September 13, 2011 4:27 PM
>>> To: Open MPI Users
>>> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE
>>>
>>> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
>>>
>>>> I'm able to run this command below from an interactive shell window:
>>>>
>>>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>>>>
>>>> but it does not work if I put it into a shell script and 'qsub' that script to SGE. I get the message shown at the bottom of this post.
>>>>
>>>> I've tried everything I can think of. I would welcome any hints on how to proceed.
>>>>
>>>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. I am setting and exporting OPAL_PREFIX and as I said, all works fine interactively just not in batch. It was built with -disable-shared and I don't see any shared libs under openmpi/lib, and I've done 'ldd' from within the script, on both the application executable and on the orterun command; no unresolved shared libraries. So I don't think the error message hinting at LD_LIBRARY_PATH issues is pointing me in the right direction.
>>>>
>>>> Thanks for any guidance,
>>>>
>>>> Ed
>>>>
>>>
>>> Oh, I missed this:
>>>
>>>
>>>> error: executing task of job 139362 failed: execution daemon on host "f8312" didn't accept task
>>>
>>> did you supply a machinefile on your own? In a proper SGE integration it's running in a parallel environment. You defined and requested one? The error looks like it was started in a PE, but tried to access a node not granted for the actual job
>>>
>>> -- Reuti
>>>
>>>
>>>> --------------------------------------------------------------------------
>>>> A daemon (pid 2818) died unexpectedly with status 1 while attempting
>>>> to launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see above).
>>>>
>>>> This may be because the daemon was unable to find all the needed shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --------------------------------------------------------------------------
>>>> mpirun: clean termination accomplished
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users