Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job termination on grid
From: Vladimir Yamshchikov (yaximik_at_[hidden])
Date: 2013-04-30 16:54:56


This is the question I am trying to answer - how many threads I can use
with blastx on a grid? If I could request resources by_node, use -pernode
option to have one process per node, and then specify the correct number of
threads for each node. But I cannot, resurces (slots) are requested
per-core (per_process), so I was instructed to request total number of
slots. However, as allocated cores are spread across the nodes, looks
like it messes scheduling up causing overload.

On Tue, Apr 30, 2013 at 3:46 PM, Ralph Castain <rhc_at_[hidden]> wrote:

>
> On Apr 30, 2013, at 1:34 PM, Vladimir Yamshchikov <yaximik_at_[hidden]>
> wrote:
>
> I asked grid IT and they said they had to kill it as the job was
> overloading nodes. They saw loads up to 180 instead of close to 12 on
> 12-core nodes. They think that blastx is not an openmpi application, so openMPI
> is spawning between 64-96 blastx processes, each of which is then starting
> up 96 worker threads. Or if blastx can work with openmpi, my blastx synthax
> mpirun syntax is wrong. Any advice?
>
> I was advised earlier to use –pe openmpi [ARG} , where ARG =
> number_of_processes x number_of_threads , and then pass desired number of
> threads as ‘ mpirun –np $NSLOTS cpus-per-proc [ number_of_threads]’. When I
> did that, I got an error that more threads were requested than number of
> physical cores.
>
>
> How many threads are you trying to launch?? If it is a 12-core node, then
> you can't have more than 12 - sounds like you are trying to startup 96!
>
>
>
>
>
>
> On Tue, Apr 30, 2013 at 2:35 PM, Reuti <reuti_at_[hidden]> wrote:
>
>> Hi,
>>
>> Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov:
>>
>> > My recent job started normally but after a few hours of running died
>> with the following message:
>> >
>> >
>> --------------------------------------------------------------------------
>> > A daemon (pid 19390) died unexpectedly with status 137 while attempting
>> > to launch so we are aborting.
>>
>> I wonder why it rose the failure only after running for hours. As 137 =
>> 128 + 9 it was killed, maybe by the queuing system due to the set time
>> limit? If you check the accouting, what is the output of:
>>
>> $ qacct -j <job_id>
>>
>> -- Reuti
>>
>>
>> > There may be more information reported by the environment (see above).
>> >
>> > This may be because the daemon was unable to find all the needed shared
>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>> the
>> > location of the shared libraries on the remote nodes and this will
>> > automatically be forwarded to the remote nodes.
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that the job aborted, but has no info as to the process
>> > that caused that situation.
>> >
>> > The scheduling script is below:
>> >
>> > #$ -S /bin/bash
>> > #$ -cwd
>> > #$ -N SC3blastx_64-96thr
>> > #$ -pe openmpi* 64-96
>> > #$ -l h_rt=24:00:00,vf=3G
>> > #$ -j y
>> > #$ -M yaximik_at_[hidden]
>> > #$ -m eas
>> > #
>> > # Load the appropriate module files
>> > # Should be loaded already
>> > #$ -V
>> >
>> > mpirun -np $NSLOTS blastx -query
>> $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta
>> -db nr -out
>> $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out
>> -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20
>> -lcase_masking -num_threads $NSLOTS
>> >
>> > What caused this termination? It does not seem scheduling problem as
>> the program run several hours with 96 threads. My $LD_LIBRARY_PATH does
>> have /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify
>> it?
>> >
>> > Vladimir
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>