Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Problem running under SGE
From: Reuti (reuti_at_[hidden])
Date: 2011-09-13 17:24:55


Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:

> I’m able to run this command below from an interactive shell window:
>
> <path>/bin/mpirun --machinefile mpihosts.dat –np 16 –mca plm_rsh_agent /usr/bin/rsh –x MPI_ENVIRONMENT=1 ./test_setup
>
> but it does not work if I put it into a shell script and ‘qsub’ that script to SGE. I get the message shown at the bottom of this post.
>
> I’ve tried everything I can think of. I would welcome any hints on how to proceed.
>
> For what it’s worth, this OpenMPI is 1.4.3 and I built it on another system. I am setting and exporting OPAL_PREFIX and as I said, all works fine interactively just not in batch. It was built with –disable-shared and I don’t see any shared libs under openmpi/lib, and I’ve done ‘ldd’ from within the script, on both the application executable and on the orterun command; no unresolved shared libraries. So I don’t think the error message hinting at LD_LIBRARY_PATH issues is pointing me in the right direction.

Did you compile Open MPI with --with-sge?

What's the setting of qrsh_command and qrsh_daemon in SGE (qconf -sconf) and/or do local configurations exist overwriting this (qconf -sconfl)?

-- Reuti

>
> Thanks for any guidance,
>
> Ed
>
>
> error: executing task of job 139362 failed: execution daemon on host "f8312" didn't accept task
> --------------------------------------------------------------------------
> A daemon (pid 2818) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users