Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
From: Joshua Bernstein (jbernstein_at_[hidden])
Date: 2010-02-15 14:06:20


Well,

        We all wish the Altair guys would at least try to maintain backwards
compatibility with the community, but they have a big habit of
breaking things. This isn't the first time they've broken a more
customer facing function like tm_spawn. (The also like breaking
pbs_statjob too!).

        I have access to PBS Pro and I can raise the issue with Altair if it
would help. Just let me know how I can be helpful.

-Joshua Bernstein
Senior Software Engineer
Penguin Computing

On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:

> Bummer!
>
> If it helps, could you put us in touch with the PBS Pro people? We
> usually only have access to Torque when developing the TM-launching
> stuff (PBS Pro and Torque supposedly share the same TM interface,
> but we don't have access to PBS Pro, so we don't know if it has
> diverged over time).
>
>
> On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:
>
>> Ralph,
>>
>> This is my first build of OpenMPI so I haven't had this working
>> before. I'm pretty confident that PATH and LD_LIBRARY_PATH issues
>> are not the cause, otherwise launches outside of PBS would fail
>> too. Also, I tried compiling everything statically with the same
>> result.
>>
>> Some additional info... (1) I did a diff on tm.h for PBS 10.2 and
>> from version 8.0 that we had - they are identical, and (2) I've
>> tried this with both the Intel 11.1 and GCC compilers and gotten
>> the exact same run-time errors.
>>
>> For now, I've got a a work-around setup that launches over ssh and
>> still attaches the processes to PBS.
>>
>> Thanks for your help.
>>
>> Steve
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
>> mpi.org] On Behalf Of Ralph Castain
>> Sent: Friday, February 12, 2010 8:29 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>>
>> Afraid compilers don't help when the param is a void*...
>>
>> It looks like this is consistent, but I've never tried it under
>> that particular environment. Did prior versions of OMPI work, or
>> are you trying this for the first time?
>>
>> One thing you might check is that you have the correct PATH and
>> LD_LIBRARY_PATH set to point to this version of OMPI and the
>> corresponding PBS Pro libs you used to build it. Most Linux distros
>> come with OMPI installed, and that can cause surprises.
>>
>> We run under Torque at major installations every day, so it -
>> should- work...unless PBS Pro has done something unusual.
>>
>>
>> On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:
>>
>>> Yes, the failure seems to be in mpirun, it never even gets to my
>>> application.
>>>
>>> The proto for tm_init looks like this:
>>> int tm_init(void *info, struct tm_roots *roots);
>>>
>>> where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x
>>> tm_task_id
>>>
>>> If the API was different, wouldn't the compiler most likely
>>> generate an error at compile-time?
>>>
>>> Thanks!
>>>
>>> Steve
>>>
>>>
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
>>> mpi.org] On Behalf Of Ralph Castain
>>> Sent: Friday, February 12, 2010 3:21 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>>>
>>> I'm a tad confused - this trace would appear to indicate that
>>> mpirun is failing, yes? Not your application?
>>>
>>> The reason it works for local procs is that tm_init isn't called
>>> for that case - mpirun just fork/exec's the procs directly. When
>>> remote nodes are required, mpirun must connect to Torque. This is
>>> done with a call to:
>>>
>>> ret = tm_init(NULL, &tm_root);
>>>
>>> My guess is that something changed in PBS Pro 10.2 to that API.
>>> Can you check the tm header file and see? I have no access to
>>> PBS any more, so I'll have to rely on your eyes to see a diff.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm having problems running Open MPI jobs under PBS Pro 10.2.
>>>> I've configured and built OpenMPI 1.4.1 with the Intel 11.1
>>>> compiler on Linux and with --with-tm support and the build runs
>>>> fine. I've also built with static libraries per the FAQ
>>>> suggestion since libpbs is static. However, my test application
>>>> keep failing with a segmentation fault, but ONLY when trying to
>>>> select more than 1 node. Running on a single node withing PBS
>>>> works fine. Also, running outside of PBS vis ssh runs fine as
>>>> well, even across multiple nodes. OpenIB support is also
>>>> enabled, but that doesn't seem to affect the error because I've
>>>> also tried running with the --mca btl tcp,self flag and it still
>>>> doesn't work. Here is the error I'm getting:
>>>>
>>>> [n34:26892] *** Process received signal ***
>>>> [n34:26892] Signal: Segmentation fault (11)
>>>> [n34:26892] Signal code: Address not mapped (1)
>>>> [n34:26892] Failing at address: 0x3f
>>>> [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
>>>> [n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun(discui_+0x84) [0x476a50]
>>>> [n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun(diswsi+0xc3) [0x474063]
>>>> [n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun [0x471d0c]
>>>> [n34:26892] [ 4] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun(tm_init+0x1fe) [0x471ff8]
>>>> [n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun [0x43f580]
>>>> [n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun [0x413921]
>>>> [n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun [0x412b78]
>>>> [n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6)
>>>> [0x7fc03068d586]
>>>> [n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/
>>>> pbs_mpirun [0x412ac9]
>>>> [n34:26892] *** End of error message ***
>>>> Segmentation fault
>>>>
>>>> (NOTE: pbs_mpirun = orterun, mpirun, etc.)
>>>>
>>>> Has anyone else seen errors like this within PBS?
>>>>
>>>> ============================================
>>>> Steve Repsher
>>>> Boeing Defense, Space, & Security - Rotorcraft
>>>> Aerodynamics/CFD
>>>> Phone: (610) 591-1510
>>>> Fax: (610) 591-6263
>>>> stephen.j.repsher_at_[hidden]
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users