Yes, the failure seems to be in mpirun, it never even gets to my application.
The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);
where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id
If the API was different, wouldn't the compiler most likely generate an error at compile-time?
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
I'm a tad confused - this trace would appear to indicate that mpirun is failing, yes? Not your application?
The reason it works for local procs is that tm_init isn't called for that case - mpirun just fork/exec's the procs directly. When remote nodes are required, mpirun must connect to Torque. This is done with a call to:
ret = tm_init(NULL, &tm_root);
My guess is that something changed in PBS Pro 10.2 to that API. Can you check the tm header file and see? I have no access to PBS any more, so I'll have to rely on your eyes to see a diff.
On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
I'm having problems running Open MPI jobs under PBS Pro 10.2. I've configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux and with --with-tm support and the build runs fine. I've also built with static libraries per the FAQ suggestion since libpbs is static. However, my test application keep failing with a segmentation fault, but ONLY when trying to select more than 1 node. Running on a single node withing PBS works fine. Also, running outside of PBS vis ssh runs fine as well, even across multiple nodes. OpenIB support is also enabled, but that doesn't seem to affect the error because I've also tried running with the --mca btl tcp,self flag and it still doesn't work. Here is the error I'm getting:
[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x471d0c]
[n34:26892] [ 4] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) [0x471ff8]
[n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x43f580]
[n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x413921]
[n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412b78]
[n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586]
[n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412ac9]
[n34:26892] *** End of error message ***
(NOTE: pbs_mpirun = orterun, mpirun, etc.)
Has anyone else seen errors like this within PBS?
Boeing Defense, Space, & Security - Rotorcraft
Phone: (610) 591-1510
Fax: (610) 591-6263
users mailing list