Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-29 14:54:43


I don't know much about Maui, but these lines from the log seem
relevant:

-----
maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc
limit (3.72 > 1.00)
maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in
state 'Running' has exceeded PROC resource limit (372 > 100) (action
CANCEL will be taken) job start time: Thu May 29 09:26:19
-----

I'm not sure what resource limits it's talking about.

On May 29, 2008, at 2:25 PM, Jim Kusznir wrote:

> I have verified that maui is killing the job. I actually ran into
> this with another user all of a sudden. I don't know why its only
> effecting a few currently. Here's the maui log extract for a current
> run of this users' program:
>
> -----------
> [root_at_aeolus log]# grep 2120 *
> maui.log:05/29 09:01:45 INFO: job '2118' loaded: 1 patton
> patton 1800 Idle 0 1212076905 [NONE] [NONE] [NONE] >=
> 0 >= 0 [NONE] 1212076905
> maui.log:05/29 09:23:40 INFO: job '2119' loaded: 1 patton
> patton 1800 Idle 0 1212078218 [NONE] [NONE] [NONE] >=
> 0 >= 0 [NONE] 1212078220
> maui.log:05/29 09:26:19
> MPBSJobLoad(2120,2120.aeolus.eecs.wsu.edu,J,TaskList,0)
> maui.log:05/29 09:26:19 MReqCreate(2120,SrcRQ,DstRQ,DoCreate)
> maui.log:05/29 09:26:19 MJobSetCreds(2120,patton,patton,)
> maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
> DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
> DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> maui.log:05/29 09:26:19 INFO: default QOS for job 2120 set to
> DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
> maui.log:05/29 09:26:19 INFO: job '2120' loaded: 1 patton
> patton 1800 Idle 0 1212078378 [NONE] [NONE] [NONE] >=
> 0 >= 0 [NONE] 1212078379
> maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
> maui.log:05/29 09:26:19 INFO: job '2120' Priority: 1
> maui.log:05/29 09:26:19 INFO: 8 feasible tasks found for job
> 2120:0 in partition DEFAULT (1 Needed)
> maui.log:05/29 09:26:19 INFO: 1 requested hostlist tasks allocated
> for job 2120 (0 remain)
> maui.log:05/29 09:26:19 MJobStart(2120)
> maui.log:05/29 09:26:19
> MJobDistributeTasks(2120,base,NodeList,TaskMap)
> maui.log:05/29 09:26:19 MAMAllocJReserve(2120,RIndex,ErrMsg)
> maui.log:05/29 09:26:19 MRMJobStart(2120,Msg,SC)
> maui.log:05/29 09:26:19 MPBSJobStart(2120,base,Msg,SC)
> maui.log:05/29 09:26:19
> MPBSJobModify(2120,Resource_List,Resource,compute-0-0.local)
> maui.log:05/29 09:26:19 MPBSJobModify(2120,Resource_List,Resource,1)
> maui.log:05/29 09:26:19 INFO: job '2120' successfully started
> maui.log:05/29 09:26:19 MStatUpdateActiveJobUsage(2120)
> maui.log:05/29 09:26:19 MResJCreate(2120,MNodeList,
> 00:00:00,ActiveJob,Res)
> maui.log:05/29 09:26:19 INFO: starting job '2120'
> maui.log:05/29 09:26:50 INFO: node compute-0-0.local has joblist
> '0/2120.aeolus.eecs.wsu.edu'
> maui.log:05/29 09:26:50 INFO: job 2120 adds 1 processors per task
> to node compute-0-0.local (1)
> maui.log:05/29 09:26:50
> MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
> maui.log:05/29 09:26:50 MStatUpdateActiveJobUsage(2120)
> maui.log:05/29 09:26:50 MResDestroy(2120)
> maui.log:05/29 09:26:50 MResChargeAllocation(2120,2)
> maui.log:05/29 09:26:50
> MResJCreate(2120,MNodeList,-00:00:31,ActiveJob,Res)
> maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
> maui.log:05/29 09:26:50 INFO: job '2120' Priority: 1
> maui.log:05/29 09:27:21 INFO: node compute-0-0.local has joblist
> '0/2120.aeolus.eecs.wsu.edu'
> maui.log:05/29 09:27:21 INFO: job 2120 adds 1 processors per task
> to node compute-0-0.local (1)
> maui.log:05/29 09:27:21
> MPBSJobUpdate(2120,2120.aeolus.eecs.wsu.edu,TaskList,0)
> maui.log:05/29 09:27:21 MStatUpdateActiveJobUsage(2120)
> maui.log:05/29 09:27:21 MResDestroy(2120)
> maui.log:05/29 09:27:21 MResChargeAllocation(2120,2)
> maui.log:05/29 09:27:21
> MResJCreate(2120,MNodeList,-00:01:02,ActiveJob,Res)
> maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
> maui.log:05/29 09:27:21 INFO: job '2120' Priority: 1
> maui.log:05/29 09:27:21 INFO: job 2120 exceeds requested proc
> limit (3.72 > 1.00)
> maui.log:05/29 09:27:21 MSysRegEvent(JOBRESVIOLATION: job '2120' in
> state 'Running' has exceeded PROC resource limit (372 > 100) (action
> CANCEL will be taken) job start time: Thu May 29 09:26:19
> maui.log:05/29 09:27:21 MRMJobCancel(2120,job violates resource
> utilization policies,SC)
> maui.log:05/29 09:27:21 MPBSJobCancel(2120,base,CMsg,Msg,job violates
> resource utilization policies)
> maui.log:05/29 09:27:21 INFO: job '2120' successfully cancelled
> maui.log:05/29 09:27:27 INFO: active PBS job 2120 has been removed
> from the queue. assuming successful completion
> maui.log:05/29 09:27:27 MJobProcessCompleted(2120)
> maui.log:05/29 09:27:27 MAMAllocJDebit(A,2120,SC,ErrMsg)
> maui.log:05/29 09:27:27 INFO: job ' 2120' completed.
> QueueTime: 1 RunTime: 62 Accuracy: 3.44 XFactor: 0.04
> maui.log:05/29 09:27:27 INFO: job '2120' completed X: 0.035000
> T: 62 PS: 62 A: 0.034444
> maui.log:05/29 09:27:27 MJobSendFB(2120)
> maui.log:05/29 09:27:27 INFO: job usage sent for job '2120'
> maui.log:05/29 09:27:27 MJobRemove(2120)
> maui.log:05/29 09:27:27 MResDestroy(2120)
> maui.log:05/29 09:27:27 MResChargeAllocation(2120,2)
> maui.log:05/29 09:27:27 MJobDestroy(2120)
> maui.log:05/29 09:42:54 INFO: job '2121' loaded: 1 sledburg
> sledburg 1800 Idle 0 1212079373 [NONE] [NONE] [NONE] >=
> 0 >= 0 [NONE] 1212079374
> maui.log:05/29 09:43:34 INFO: job '2122' loaded: 1 sledburg
> sledburg 1800 Idle 0 1212079413 [NONE] [NONE] [NONE] >=
> 0 >= 0 [NONE] 1212079414
> [root_at_aeolus log]#
> ------------------------------
>
> Any thoughts?
>
> Thank you.
>
> On Wed, May 28, 2008 at 5:21 AM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>> (I'm not a subscriber to the torqueusers or mauiusers lists -- I'm
>> not
>> sure my post will get through)
>>
>> I wonder if Jan's idea has merit -- if Torque is killing the job for
>> some other reason (i.e., not wallclock). The message printed by
>> mpirun ("mpirun: killing job...") is *only* displayed if mpirun
>> receives a SIGINT or SIGTERM. So perhaps some other resource limit
>> is
>> being reached...?
>>
>> Is there a way to have Torque log if it is killing a job for some
>> reason?
>>
>>
>> On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:
>>
>>> Yep. Wall time is no where near violation (dies about 2 minutes
>>> into
>>> a 30 minute allocation). I did a ulimit -a through qsub and
>>> direct on
>>> the node (as the same user in both cases), and the results were
>>> identical (most items were unlimited).
>>>
>>> Any other ideas?
>>>
>>> --Jim
>>>
>>> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <Jan.Ploski_at_[hidden]>
>>> wrote:
>>>>
>>>> This suggestion is rather trivial, but since you have not mentioned
>>>> anything in this area:
>>>>
>>>> Are you sure that the job is not exceeding resource limits
>>>> (walltime -
>>>> enforced by TORQUE, or rlimits such as memory - enforced by the
>>>> kernel,
>>>> but they could be set differently in TORQUE and your manual
>>>> invocations of
>>>> mpirun).
>>>>
>>>> Regards,
>>>> Jan Ploski
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems