Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job termination on grid
From: Reuti (reuti_at_[hidden])
Date: 2013-04-30 15:35:41


Hi,

Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov:

> My recent job started normally but after a few hours of running died with the following message:
>
> --------------------------------------------------------------------------
> A daemon (pid 19390) died unexpectedly with status 137 while attempting
> to launch so we are aborting.

I wonder why it rose the failure only after running for hours. As 137 = 128 + 9 it was killed, maybe by the queuing system due to the set time limit? If you check the accouting, what is the output of:

$ qacct -j <job_id>

-- Reuti

> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
>
> The scheduling script is below:
>
> #$ -S /bin/bash
> #$ -cwd
> #$ -N SC3blastx_64-96thr
> #$ -pe openmpi* 64-96
> #$ -l h_rt=24:00:00,vf=3G
> #$ -j y
> #$ -M yaximik_at_[hidden]
> #$ -m eas
> #
> # Load the appropriate module files
> # Should be loaded already
> #$ -V
>
> mpirun -np $NSLOTS blastx -query $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta -db nr -out $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20 -lcase_masking -num_threads $NSLOTS
>
> What caused this termination? It does not seem scheduling problem as the program run several hours with 96 threads. My $LD_LIBRARY_PATH does have /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify it?
>
> Vladimir
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users