Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: Problem running under SGE
From: Blosch, Edwin L (edwin.l.blosch_at_[hidden])
Date: 2011-09-13 19:12:10


We don't budget computer hours so I don't think we would use accounting, although I'm not sure I know what this capability is all about. Also, I don't care about launch speed. A few minutes means nothing when the job will take days to run. Also, I have a highly portable strategy of wrapping the mpirun command with a shell script that figures out how many processes are allocated to the job and explicitly tells OpenMPI how many hosts to use and which ones. I can adapt that script in very minor ways to support any job-queueing system past present or future, and my invocation of the mpirun command remains the same and should always work.

For these reasons I have preferred the rsh/ssh launcher, the less intelligent the better. I'm sure there are benefits of tight integration, as you said, perhaps you can keep users from accidentally or intentionally using nodes outside their allocation. It's just not an issue for us.

I will check the FAQ to see if I can learn more about the benefits of tight integration with a job-queueing system.

Thank you again for the help

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
Sent: Tuesday, September 13, 2011 5:36 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Problem running under SGE

Am 14.09.2011 um 00:25 schrieb Blosch, Edwin L:

> Your comment guided me in the right direction, Reuti. And overlapped with your guidance, Ralph.
>
> It works: if I add this flag then it runs
> --mca plm_rsh_disable_qrsh
>
> Thank you both for the explanations.
>
> I had built OpenMPI on another system, as I said, it did not have SGE and thus I did not give --without-sge (nor did I give --with-sge). In the future for building 1.4.3 I will just add --without-sge and presumably I won't run into the qrsh issue.

Can I understand this in a way, that you don't want a tight integration with correct accounting, but prefer to run slave tasks by rsh/ssh on your own? This can lead to oversubscribed machines in case some users' scripts are not honoring the machinefile in the correct way.

Having a tight integration (with disabled ssh/rsh inside the cluster) is the setup I usually prefer.

-- Reuti

> Thanks again
>
>
>
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
> Sent: Tuesday, September 13, 2011 4:27 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE
>
> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
>
>> I'm able to run this command below from an interactive shell window:
>>
>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>>
>> but it does not work if I put it into a shell script and 'qsub' that script to SGE. I get the message shown at the bottom of this post.
>>
>> I've tried everything I can think of. I would welcome any hints on how to proceed.
>>
>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. I am setting and exporting OPAL_PREFIX and as I said, all works fine interactively just not in batch. It was built with -disable-shared and I don't see any shared libs under openmpi/lib, and I've done 'ldd' from within the script, on both the application executable and on the orterun command; no unresolved shared libraries. So I don't think the error message hinting at LD_LIBRARY_PATH issues is pointing me in the right direction.
>>
>> Thanks for any guidance,
>>
>> Ed
>>
>
> Oh, I missed this:
>
>
>> error: executing task of job 139362 failed: execution daemon on host "f8312" didn't accept task
>
> did you supply a machinefile on your own? In a proper SGE integration it's running in a parallel environment. You defined and requested one? The error looks like it was started in a PE, but tried to access a node not granted for the actual job
>
> -- Reuti
>
>
>> --------------------------------------------------------------------------
>> A daemon (pid 2818) died unexpectedly with status 1 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users