Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job termination on grid
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-04-30 17:02:02


On Apr 30, 2013, at 1:54 PM, Vladimir Yamshchikov <yaximik_at_[hidden]> wrote:

> This is the question I am trying to answer - how many threads I can use with blastx on a grid? If I could request resources by_node, use -pernode option to have one process per node, and then specify the correct number of threads for each node. But I cannot, resurces (slots) are requested per-core (per_process),

I don't believe that is true - resources are requested for the entire job, not for each process

> so I was instructed to request total number of slots. However, as allocated cores are spread across the nodes, looks like it messes scheduling up causing overload.

I suggest you look at the SGE documentation - I don't think you are using it correctly

>
>
> On Tue, Apr 30, 2013 at 3:46 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On Apr 30, 2013, at 1:34 PM, Vladimir Yamshchikov <yaximik_at_[hidden]> wrote:
>
>> I asked grid IT and they said they had to kill it as the job was overloading nodes. They saw loads up to 180 instead of close to 12 on 12-core nodes. They think that blastx is not an openmpi application, so openMPI is spawning between 64-96 blastx processes, each of which is then starting up 96 worker threads. Or if blastx can work with openmpi, my blastx synthax mpirun syntax is wrong. Any advice?
>> I was advised earlier to use –pe openmpi [ARG} , where ARG = number_of_processes x number_of_threads , and then pass desired number of threads as ‘ mpirun –np $NSLOTS cpus-per-proc [ number_of_threads]’. When I did that, I got an error that more threads were requested than number of physical cores.
>>
>
> How many threads are you trying to launch?? If it is a 12-core node, then you can't have more than 12 - sounds like you are trying to startup 96!
>
>>
>>
>>
>>
>>
>> On Tue, Apr 30, 2013 at 2:35 PM, Reuti <reuti_at_[hidden]> wrote:
>> Hi,
>>
>> Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov:
>>
>> > My recent job started normally but after a few hours of running died with the following message:
>> >
>> > --------------------------------------------------------------------------
>> > A daemon (pid 19390) died unexpectedly with status 137 while attempting
>> > to launch so we are aborting.
>>
>> I wonder why it rose the failure only after running for hours. As 137 = 128 + 9 it was killed, maybe by the queuing system due to the set time limit? If you check the accouting, what is the output of:
>>
>> $ qacct -j <job_id>
>>
>> -- Reuti
>>
>>
>> > There may be more information reported by the environment (see above).
>> >
>> > This may be because the daemon was unable to find all the needed shared
>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> > location of the shared libraries on the remote nodes and this will
>> > automatically be forwarded to the remote nodes.
>> > --------------------------------------------------------------------------
>> > --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> >
>> > The scheduling script is below:
>> >
>> > #$ -S /bin/bash
>> > #$ -cwd
>> > #$ -N SC3blastx_64-96thr
>> > #$ -pe openmpi* 64-96
>> > #$ -l h_rt=24:00:00,vf=3G
>> > #$ -j y
>> > #$ -M yaximik_at_[hidden]
>> > #$ -m eas
>> > #
>> > # Load the appropriate module files
>> > # Should be loaded already
>> > #$ -V
>> >
>> > mpirun -np $NSLOTS blastx -query $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta -db nr -out $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20 -lcase_masking -num_threads $NSLOTS
>> >
>> > What caused this termination? It does not seem scheduling problem as the program run several hours with 96 threads. My $LD_LIBRARY_PATH does have /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify it?
>> >
>> > Vladimir
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users