Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: Problem running under SGE
From: Reuti (reuti_at_[hidden])
Date: 2011-09-13 18:36:02


Am 14.09.2011 um 00:25 schrieb Blosch, Edwin L:

> Your comment guided me in the right direction, Reuti. And overlapped with your guidance, Ralph.
>
> It works: if I add this flag then it runs
> --mca plm_rsh_disable_qrsh
>
> Thank you both for the explanations.
>
> I had built OpenMPI on another system, as I said, it did not have SGE and thus I did not give --without-sge (nor did I give --with-sge). In the future for building 1.4.3 I will just add --without-sge and presumably I won't run into the qrsh issue.

Can I understand this in a way, that you don't want a tight integration with correct accounting, but prefer to run slave tasks by rsh/ssh on your own? This can lead to oversubscribed machines in case some users' scripts are not honoring the machinefile in the correct way.

Having a tight integration (with disabled ssh/rsh inside the cluster) is the setup I usually prefer.

-- Reuti

> Thanks again
>
>
>
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Reuti
> Sent: Tuesday, September 13, 2011 4:27 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problem running under SGE
>
> Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
>
>> I'm able to run this command below from an interactive shell window:
>>
>> <path>/bin/mpirun --machinefile mpihosts.dat -np 16 -mca plm_rsh_agent /usr/bin/rsh -x MPI_ENVIRONMENT=1 ./test_setup
>>
>> but it does not work if I put it into a shell script and 'qsub' that script to SGE. I get the message shown at the bottom of this post.
>>
>> I've tried everything I can think of. I would welcome any hints on how to proceed.
>>
>> For what it's worth, this OpenMPI is 1.4.3 and I built it on another system. I am setting and exporting OPAL_PREFIX and as I said, all works fine interactively just not in batch. It was built with -disable-shared and I don't see any shared libs under openmpi/lib, and I've done 'ldd' from within the script, on both the application executable and on the orterun command; no unresolved shared libraries. So I don't think the error message hinting at LD_LIBRARY_PATH issues is pointing me in the right direction.
>>
>> Thanks for any guidance,
>>
>> Ed
>>
>
> Oh, I missed this:
>
>
>> error: executing task of job 139362 failed: execution daemon on host "f8312" didn't accept task
>
> did you supply a machinefile on your own? In a proper SGE integration it's running in a parallel environment. You defined and requested one? The error looks like it was started in a PE, but tried to access a node not granted for the actual job
>
> -- Reuti
>
>
>> --------------------------------------------------------------------------
>> A daemon (pid 2818) died unexpectedly with status 1 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users