Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] qsub - mpirun problem
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-09-29 17:51:19


References: <200809290102.m8T12ic5022727_at_[hidden]> <5118_1222651029_m8T1H7c9014112_297D3668-BBFA-480C-8AA3-4DFE9A7DC71F_at_[hidden]> <200809290207.m8T27hg6030500_at_[hidden]> <19464_1222702229_m8TFURSA024528_A4205240-A331-4854-B32C-BFB27B24DF76_at_cisco.c> <200809291541.m8TFfQiH010787_at_[hidden]> <22576_1222703352_m8TFn8wB024921_BD4E4429-86D4-465A-993D-71F35DC36B23_at_cisco.c> <200809291627.m8TGRmxd023553_at_[hidden]> <12253_1222707367_m8TGu5XA015179_04150996-1DF4-439E-AB65-A6DC37B9B2F0_at_staff.u> <200809291706.m8TH66I0032720_at_[hidden]> <8951_1222709874_m8THbqlR017868_7B53C0FE-9368-4509-990F-0B05B82FF93B_at_staff.un> <200809292033.m8TKXr60021132_at_[hidden]> <31399_1222721194_m8TKkVMS012588_6F99B227-4218-4584-84E1-EF8CBECDB7F3_at_staff.u> <200809292112.m8TLCIWp030986_at_[hidden]> <31433_1222723468_m8TLOPGp010573_F5FF317B-297F-4029-8001-356DA5348C2C_at_rain.or> <200809292130.m8TLUrDv004298_at_[hidden]>

X-Mailer: Apple Mail (2.929.2)
Return-Path: jsquyres_at_[hidden]
X-OriginalArrivalTime: 29 Sep 2008 21:51:21.0482 (UTC) FILETIME=[82A41EA0:01C9227D]

It sounds like your Torque is not setup properly if the job never
started.

You probably want to take the conversation back to the Torque list...
this unfortunately is not the right place to get Torque help.

Sorry!

On Sep 29, 2008, at 5:30 PM, Zhiliang Hu wrote:

> At 02:15 PM 9/29/2008 -0700, you wrote:
>> It sounds like you may not have setup paswordless ssh between all
>> your nodes.
>>
>> Doug Reeder
>
> That's not the case. paswordless ssh is set up and it works fine.
> -- that's how I can do "mpirun -np 6 -machinefiles ......" fine.
>
> Zhiliang
>
>
>> On Sep 29, 2008, at 2:12 PM, Zhiliang Hu wrote:
>>
>>> At 10:45 PM 9/29/2008 +0200, you wrote:
>>>> Am 29.09.2008 um 22:33 schrieb Zhiliang Hu:
>>>>
>>>>> At 07:37 PM 9/29/2008 +0200, Reuti wrote:
>>>>>
>>>>>>> "-l nodes=6:ppn=2" is all I have to specify the node requests:
>>>>>>
>>>>>> this might help: http://www.open-mpi.org/faq/?category=tm
>>>>>
>>>>> Essentially the examples given on this web is no difference from
>>>>> what I did.
>>>>> Only thing new is, I suppose "qsub -I " is for interactive mode.
>>>>> When I did this:
>>>>>
>>>>> qsub -I -l nodes=7 mpiblastn.sh
>>>>>
>>>>> It hangs on "qsub: waiting for job 798.nagrp2.ansci.iastate.edu to
>>>>> start".
>>>>>
>>>>>
>>>>>>> UNIX_PROMPT> qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>>>>>> where "mpi_program" is a file with one line:
>>>>>>> /path/to/mpirun -np 12 /path/to/my_program
>>>>>>
>>>>>> Can you please try this jobscript instead:
>>>>>>
>>>>>> #!/bin/sh
>>>>>> set | grep PBS
>>>>>> /path/to/mpirun /path/to/my_program
>>>>>>
>>>>>> All should be handled by Open MPI automatically. With the "set"
>>>>>> bash
>>>>>> command you will get a list with all defined variables for
>>>>>> further
>>>>>> analysis; and where you can check for the variables set by
>>>>>> Torque.
>>>>>>
>>>>>> -- Reuti
>>>>>
>>>>> "set | grep PBS" part had nothing in output.
>>>>
>>>> Strange - you checked the .o end .e files of the job? - Reuti
>>>
>>> There is nothing in -o nor -e output. I had to kill the job.
>>> I checked torque log, it shows (/var/spool/torque/server_logs):
>>>
>>> 09/29/2008 15:52:16;0100;PBS_Server;Job;799.xxx.xxx.xxx;enqueuing
>>> into default, state 1 hop 1
>>> 09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job Queued
>>> at request of zhu_at_xxx.xxx.xxx, owner = zhu_at_xxx.xxx.xxx, job name =
>>> mpiblastn.sh, queue = default
>>> 09/29/2008 15:52:16;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent
>>> command new
>>> 09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job
>>> Modified at request of Scheduler_at_xxx.xxx.xxx
>>> 09/29/2008 15:52:27;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job deleted
>>> at request of zhu_at_xxx.xxx.xxx
>>> 09/29/2008 15:52:27;0100;PBS_Server;Job;799.xxx.xxx.xxx;dequeuing
>>> from default, state EXITING
>>> 09/29/2008 15:52:27;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent
>>> command term
>>> 09/29/2008 15:52:47;0001;PBS_Server;Svr;PBS_Server;is_request, bad
>>> attempt to connect from 172.16.100.1:1021 (address not trusted -
>>> check entry in server_priv/nodes)
>>>
>>> where the server_priv/nodes has:
>>> node001 np=4
>>> node002 np=4
>>> node003 np=4
>>> node004 np=4
>>> node005 np=4
>>> node006 np=4
>>> node007 np=4
>>>
>>> which was set up by the vender.
>>>
>>> What is "address not trusted"?
>>>
>>> Zhiliang
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems