Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-28 08:21:54


(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not
sure my post will get through)

I wonder if Jan's idea has merit -- if Torque is killing the job for
some other reason (i.e., not wallclock). The message printed by
mpirun ("mpirun: killing job...") is *only* displayed if mpirun
receives a SIGINT or SIGTERM. So perhaps some other resource limit is
being reached...?

Is there a way to have Torque log if it is killing a job for some
reason?

On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:

> Yep. Wall time is no where near violation (dies about 2 minutes into
> a 30 minute allocation). I did a ulimit -a through qsub and direct on
> the node (as the same user in both cases), and the results were
> identical (most items were unlimited).
>
> Any other ideas?
>
> --Jim
>
> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <Jan.Ploski_at_[hidden]>
> wrote:
>>
>> This suggestion is rather trivial, but since you have not mentioned
>> anything in this area:
>>
>> Are you sure that the job is not exceeding resource limits
>> (walltime -
>> enforced by TORQUE, or rlimits such as memory - enforced by the
>> kernel,
>> but they could be set differently in TORQUE and your manual
>> invocations of
>> mpirun).
>>
>> Regards,
>> Jan Ploski
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems