Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque
From: Jim Kusznir (jkusznir_at_[hidden])
Date: 2008-05-29 14:25:40


I have verified that maui is killing the job. I actually ran into
this with another user all of a sudden. I don't know why its only
effecting a few currently. Here's the maui log extract for a current
run of this users' program:

-----------
[root_at_aeolus log]# grep 2120 *
maui.log:05/29 09:01:45 INFO: job '2118' loaded: 1 patton
patton 1800 Idle 0 1212076905 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212076905
maui.log:05/29 09:23:40 INFO: job '2119' loaded: 1 patton
patton 1800 Idle 0 1212078218 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212078220
maui.log:05/29 09:26:19 MPBSJobLoad(2120,2120.aeolus.eecs.wsu.edu,J,TaskList,0)
maui.log:05/29 09:26:19 MReqCreate(2120,SrcRQ,DstRQ,DoCreate)
maui.log:05/29 09:26:19 MJobSetCreds(2120,patton,patton,)
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
maui.log:05/29 09:26:19 INFO: job '2120' loaded: 1 patton
patton 1800 Idle 0 1212078378 [NONE] [NONE] [NONE] >=
0 >= 0 [NONE] 1212078379
maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:19 INFO: 8 feasible tasks found for job
2120:0 in partition DEFAULT (1 Needed)
maui.log:05/29 09:26:19 INFO: 1 requested hostlist tasks allocated
for job 2120 (0 remain)
maui.log:05/29 09:26:19 MJobStart(2120)
maui.log:05/29 09:26:19 MJobDistributeTasks(2120,base,NodeList,TaskMap)
maui.log:05/29 09:26:19 MAMAllocJReserve(2120,RIndex,ErrMsg)
maui.log:05/29 09:26:19 MRMJobStart(2120,Msg,SC)
maui.log:05/29 09:26:19 MPBSJobStart(2120,base,Msg,SC)
maui.log:05/29 09:26:19
MPBSJobModify(2120,Resource_List,Resource,compute-0-0.local)
maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,1)
maui.log:05/29 09:26:19 INFO: job '2120' successfully started
maui.log:05/29 09:26:19 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:19 MResJCreate(2120,MNodeList,00:00:00,ActiveJob,Res)
maui.log:05/29 09:26:19 INFO: starting job '2120'
maui.log:05/29 09:26:50 INFO: node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:26:50 INFO: job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:26:50 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:26:50 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:26:50 MResDestroy(2120)
maui.log:05/29 09:26:50 MResChargeAllocation(2120,2)
maui.log:05/29 09:26:50 MResJCreate(2120,MNodeList,-00:00:31,ActiveJob,Res)
maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: node compute-0-0.local has joblist
'0/2120.aeolus.eecs.wsu.edu'
maui.log:05/29 09:27:21 INFO: job 2120 adds 1 processors per task
to node compute-0-0.local (1)
maui.log:05/29 09:27:21 MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
maui.log:05/29 09:27:21 MStatUpdateActiveJobUsage(2120)
maui.log:05/29 09:27:21 MResDestroy(2120)
maui.log:05/29 09:27:21 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:21 MResJCreate(2120,MNodeList,-00:01:02,ActiveJob,Res)
maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc
limit (3.72 > 1.00)
maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in
state 'Running' has exceeded PROC resource limit (372 > 100) (action
CANCEL will be taken) job start time: Thu May 29 09:26:19
maui.log:05/29 09:27:21 MRMJobCancel(2120,job violates resource
utilization policies,SC)
maui.log:05/29 09:27:21 MPBSJobCancel(2120,base,CMsg,Msg,job violates
resource utilization policies)
maui.log:05/29 09:27:21 INFO: job '2120' successfully cancelled
maui.log:05/29 09:27:27 INFO: active PBS job 2120 has been removed
from the queue. assuming successful completion
maui.log:05/29 09:27:27 MJobProcessCompleted(2120)
maui.log:05/29 09:27:27 MAMAllocJDebit(A,2120,SC,ErrMsg)
maui.log:05/29 09:27:27 INFO: job ' 2120' completed.
QueueTime: 1 RunTime: 62 Accuracy: 3.44 XFactor: 0.04
maui.log:05/29 09:27:27 INFO: job '2120' completed X: 0.035000
T: 62 PS: 62 A: 0.034444
maui.log:05/29 09:27:27 MJobSendFB(2120)
maui.log:05/29 09:27:27 INFO: job usage sent for job '2120'
maui.log:05/29 09:27:27 MJobRemove(2120)
maui.log:05/29 09:27:27 MResDestroy(2120)
maui.log:05/29 09:27:27 MResChargeAllocation(2120,2)
maui.log:05/29 09:27:27 MJobDestroy(2120)
maui.log:05/29 09:42:54 INFO: job '2121' loaded: 1 sledburg
sledburg 1800 Idle 0 1212079373 [NONE] [NONE] [NONE] >=
  0 >= 0 [NONE] 1212079374
maui.log:05/29 09:43:34 INFO: job '2122' loaded: 1 sledburg
sledburg 1800 Idle 0 1212079413 [NONE] [NONE] [NONE] >=
  0 >= 0 [NONE] 1212079414
[root_at_aeolus log]#
------------------------------

Any thoughts?

Thank you.

On Wed, May 28, 2008 at 5:21 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> (I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not
> sure my post will get through)
>
> I wonder if Jan's idea has merit -- if Torque is killing the job for
> some other reason (i.e., not wallclock). The message printed by
> mpirun ("mpirun: killing job...") is *only* displayed if mpirun
> receives a SIGINT or SIGTERM. So perhaps some other resource limit is
> being reached...?
>
> Is there a way to have Torque log if it is killing a job for some
> reason?
>
>
> On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:
>
>> Yep. Wall time is no where near violation (dies about 2 minutes into
>> a 30 minute allocation). I did a ulimit -a through qsub and direct on
>> the node (as the same user in both cases), and the results were
>> identical (most items were unlimited).
>>
>> Any other ideas?
>>
>> --Jim
>>
>> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <Jan.Ploski_at_[hidden]>
>> wrote:
>>>
>>> This suggestion is rather trivial, but since you have not mentioned
>>> anything in this area:
>>>
>>> Are you sure that the job is not exceeding resource limits
>>> (walltime -
>>> enforced by TORQUE, or rlimits such as memory - enforced by the
>>> kernel,
>>> but they could be set differently in TORQUE and your manual
>>> invocations of
>>> mpirun).
>>>
>>> Regards,
>>> Jan Ploski
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>