Am 13.09.2011 um 23:18 schrieb Blosch, Edwin L:
> Im able to run this command below from an interactive shell window:
>
> <path>/bin/mpirun --machinefile mpihosts.dat np 16 mca plm_rsh_agent /usr/bin/rsh x MPI_ENVIRONMENT=1 ./test_setup
>
> but it does not work if I put it into a shell script and qsub that script to SGE. I get the message shown at the bottom of this post.
>
> Ive tried everything I can think of. I would welcome any hints on how to proceed.
>
> For what its worth, this OpenMPI is 1.4.3 and I built it on another system. I am setting and exporting OPAL_PREFIX and as I said, all works fine interactively just not in batch. It was built with disable-shared and I dont see any shared libs under openmpi/lib, and Ive done ldd from within the script, on both the application executable and on the orterun command; no unresolved shared libraries. So I dont think the error message hinting at LD_LIBRARY_PATH issues is pointing me in the right direction.
>
> Thanks for any guidance,
>
> Ed
>
Oh, I missed this:
> error: executing task of job 139362 failed: execution daemon on host "f8312" didn't accept task
did you supply a machinefile on your own? In a proper SGE integration it's running in a parallel environment. You defined and requested one? The error looks like it was started in a PE, but tried to access a node not granted for the actual job
-- Reuti
> --------------------------------------------------------------------------
> A daemon (pid 2818) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
|