Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] qsub/SGE and OpenMPI
From: Reuti (reuti_at_[hidden])
Date: 2010-03-25 17:58:10


Hi,

Am 25.03.2010 um 22:34 schrieb Matthew MacManes:

> I am having an OpenMPI issue that seems to be relted to job
> scheduling- on TACC, one of the TeraGrid resources.
>
> The program I am trying to run, ABySS, seems to run fine without
> scheduling- i.e. when I run it on the login nodes without scheduling
> through qsub... but, using that same commande, but scheduling it
> through qsub, the job fails..
>
> Here is the qsub script, fyi:
>
> !/bin/bash
> #$ -N homo47
> #$ -j y
> #$ -o homo47
> #$ -pe 16way 128
> #$ -q normal
>
>
> #$ -l h_rt=00:30:00
> #$ -M macmanes_at_[hidden]
> #$ -m be
> cd /work/01301/mmacmane/abyss-1.1.2/bin
> #$ -cwd
most likely one of the two above lines would sufficient, -cwd would
also make a `cd` two the current working directory.

> #$ -V
> ibrun ./abyss-pe k=19 in='/work/01301/mmacmane/homo/*.fastq'
> name='homo_47' n=5 s=200 c=13
What is `ibrun` doing in detail? Is this something you have to use to
run a job in the Grid?

> I get an error message:
> TACC: Done.
> TACC: Starting up job 1299149
> TACC: Setting up parallel environment for OpenMPI mpirun.
> TACC: Setup complete. Running job script.
> TACC: starting parallel tasks...
> /opt/apps/pgi7_2/openmpi/1.3/bin/mpirun -np 64 ABYSS-P
You application was also compiled with Open MPI 1.3, i.e. you use the
same mpirun when you start it on the command line?

> -k19 -c13 --coverage-hist=coverage.hist -s bubbles.fa -o
> homo_61-1.fa /work/01301/mmacmane/homo/SRR001665_1.fastq /work/01301/
> mmacmane/homo/SRR001665_2.fastq /work/01301/mmacmane/homo/
> SRR002271_1.fastq /work/01301/mmacmane/homo/SRR002271_2.fastq /work/
> 01301/mmacmane/homo/SRR002273_1.fastq /work/01301/mmacmane/homo/
> SRR002273_2.fastq /work/01301/mmacmane/homo/SRR002274_1.fastq /work/
> 01301/mmacmane/homo/SRR002274_2.fastq /work/01301/mmacmane/homo/
> SRR002275_1.fastq /work/01301/mmacmane/homo/SRR002275_2.fastq /work/
> 01301/mmacmane/homo/SRR002276_1.fastq /work/01301/mmacmane/homo/
> SRR002276_2.fastq /work/01301/mmacmane/homo/SRR002291_1.fastq /work/
> 01301/mmacmane/homo/SRR002291_2.fastq /work/01301/mmacmane/homo/
> SRR002295_1.fastq /work/01301/mmacmane/homo/SRR002295_2.fastq /work/
> 01301/mmacmane/homo/SRR002297_1.fastq /work/01301/mmacmane/homo/
> SRR002297_2.fastq /work/01301/mmacmane/homo/SRR029337_1.fastq /work/
> 01301/mmacmane/homo/SRR029337_2.fastq
This comes from the expansion of the *, do you want to give the
expression including the * to your application (in this case the
expansion by the `ibrun` must be avoided)?

-- Reuti

> ...many many lines of this...
> [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19]
> ORTE_ERROR_LOG: A message is attempting to be sent to a process
> whose contact information is unknown in file rml_oob_send.c at line
> 105
> [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] could not get
> route to [[INVALID],INVALID]
> [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19]
> ORTE_ERROR_LOG: A message is attempting to be sent to a process
> whose contact information is unknown in file base/plm_base_proxy.c
> at line 85
> [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG:
> A message is attempting to be sent to a process whose contact
> information is unknown in file rml_oob_send.c at line 105
> [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] could not get
> route to [[INVALID],INVALID]
> [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG:
> A message is attempting to be sent to a process whose contact
> information is unknown in file base/plm_base_proxy.c at line 85
> [i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18]
> ORTE_ERROR_LOG: A message is attempting to be sent to a process
> whose contact information is unknown in file rml_oob_send.c at line
> 105
> [i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18] could not get
> route to [[INVALID],INVALID]
>
> ...many many lines of this...
> TACC: Cleaning up after job: 1299149
> TACC: Done.
> The TACC systems administrators don't seem to have a great solution,
> and the authors of the program say its MPI-related...
>
> _________________________________
> Matthew MacManes
> PhD Candidate
> University of California- Berkeley
> Museum of Vertebrate Zoology
> Phone: 510-495-5833
> Lab Website: http://ib.berkeley.edu/labs/lacey
> Personal Website: http://macmanes.com/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users