Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [torqueusers] Job dies randomly, but only through torque
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-28 08:21:54


(I'm not a subscriber to the torqueusers or mauiusers lists -- I'm not
sure my post will get through)

I wonder if Jan's idea has merit -- if Torque is killing the job for
some other reason (i.e., not wallclock). The message printed by
mpirun ("mpirun: killing job...") is *only* displayed if mpirun
receives a SIGINT or SIGTERM. So perhaps some other resource limit is
being reached...?

Is there a way to have Torque log if it is killing a job for some
reason?

On May 27, 2008, at 7:02 PM, Jim Kusznir wrote:

> Yep. Wall time is no where near violation (dies about 2 minutes into
> a 30 minute allocation). I did a ulimit -a through qsub and direct on
> the node (as the same user in both cases), and the results were
> identical (most items were unlimited).
>
> Any other ideas?
>
> --Jim
>
> On Tue, May 27, 2008 at 9:25 AM, Jan Ploski <Jan.Ploski_at_[hidden]>
> wrote:
>>
>> This suggestion is rather trivial, but since you have not mentioned
>> anything in this area:
>>
>> Are you sure that the job is not exceeding resource limits
>> (walltime -
>> enforced by TORQUE, or rlimits such as memory - enforced by the
>> kernel,
>> but they could be set differently in TORQUE and your manual
>> invocations of
>> mpirun).
>>
>> Regards,
>> Jan Ploski
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems