Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.
From: Rahul Nabar (rpnabar_at_[hidden])
Date: 2009-03-31 22:18:40


2009/3/31 Ralph Castain <rhc_at_[hidden]>:

> It is very hard to debug the problem with so little information. We

Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll
try my best to fill you guys in on as much debug info as I can.

> regularly run OMPI jobs on Torque without issue.

So do we. In fact on the very same cluster other jobs using the same
code do run fine. Its only this one type of jobs that I am seeing
this strange behavior on. For those more curious, the code I am trying
to run is a computational chemistry code called DACAPO developed at
CAMd at the Technical University of Denemark. Link:
https://wiki.fysik.dtu.dk/dacapo

Hardware Architecture:
Dell rack servers: PowerEdge SC1435.
2.2GHz Opteron 1Ghz. (AMD)
8 cpus per node.

> Are you getting an allocation from somewhere for the nodes?
>If so, are you
> using Moab to get it?

We are using Torque as the scheduler and Maui as the master scheduler.

>Do you have a $PBS_NODEFILE in your environment?

Yes, I do. For a test case I was trying to run on a single node (which
has 8 cpus)

If I cat $PBS_NODEFILE I do get the name "node17" 8 times.

I did dump the environment variables from a running job. I get:
PBS_NODEFILE="/var/spool/torque/aux//4609.uranus.che.foo.edu"

> I have no idea why your processes are crashing when run via Torque - are you
> sure that the processes themselves crash?
>Are they segfaulting - if so, can

Yes, they are indeed segfaulting. And only when I run them through Torque.
########################################
forrtl: error (78): process killed (SIGTERM)
mpirun noticed that job rank 5 with PID 10580 on node node17 exited on
signal 11 (Segmentation fault).
#########################################

Exact same job runs like a charm if I submit it via mprrun on the node
outside of Torque.

> you use gdb to find out where?

I can try that. I haven't used gdb much before. In case it matters the
core executable is a fortran source compiled via the Intel Fortran
Compiler ifort. That executable runs fine for all other cases except
this one.

Maybe this helps more?

-- 
Rahul