Okay - thanks!
First, be assured we run 64-bit ifort code under Torque at large scale
all the time here at LANL, so this is likely to be something trivial
in your environment.
A few things to consider/try:
1. most likely culprit is that your LD_LIBRARY_PATH is pointing to the
32-bit libraries on the other nodes. Torque does -not- copy your
environment by default, and neither does OMPI. Try adding
to your cmd line, making sure that the 64-bit libs are before any 32-
bit libs in that envar. This tells mpirun to pickup that envar and
propagate it for you.
2. check to ensure you are in fact using a 64-bit version of OMPI. Run
"ompi_info --config" to see how it was built. Also run "mpif90 --
showme" and see what libs it is linked to. Does your LD_LIBRARY_PATH
match the names and ordering?
3. get a multi-node allocation and run "pbsdsh echo $LD_LIBRARY_PATH"
and see what libs you are defaulting to on the other nodes.
I realize these are somewhat overlapping, but I think you catch the
drift - I suspect you are getting the infamous "library confusion"
On Jul 23, 2009, at 7:49 PM, Sims, James S. Dr. wrote:
> [sims_at_raritan openmpi]$ mpirun -V
> mpirun (Open MPI) 1.3.1rc4
> From: users-bounces_at_[hidden] [users-bounces_at_[hidden]] On
> Behalf Of Ralph Castain [rhc_at_[hidden]]
> Sent: Thursday, July 23, 2009 5:44 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Open MPI:Problem with 64-bit openMPI and
> intel compiler
> What OMPI version are you using?
> On Jul 23, 2009, at 3:00 PM, Sims, James S. Dr. wrote:
>> I have an OpenMPI program compiled with a version of OpenMPI built
>> using the ifort 10.1
>> compiler. I can compile and run this code with no problem, using the
>> 32 bit
>> version of ifort. And I can also submit batch jobs using torque with
>> this 32-bit code.
>> However, compiling the same code to produce a 64 bit executable
>> produces a code
>> that runs correctly only in the simplest cases. It does not run
>> correctly when run
>> under the torque batch queuing system, running for awhile and then
>> giving a
>> segmentation violation in s section of code that is fine in the 32
>> bit version.
>> I have to run the mpi multinode jobs using our torque batch queuing
>> but we do have the capability of running the jobs in an interactive
>> batch environment.
>> If I do a qsub -I -l nodes=1:x4gb
>> I get an interactive session on the remote node assigned to my job.
>> I can run the
>> job using either
>> ./MPI_li_64 or
>> mpirun -np 1 ./MPI_li_64
>> and the job runs successfully to completion. I can also
>> start an interactive shell using
>> qsub -I -l nodes=1:ppn=2:x4gb
>> and I will get a single dual processor (or greater node). On this
>> single node,
>> mpirun -np 2 ./MPI_li_64 works.
>> However, if instead I ask for two nodes in my interactive batch node,
>> qsub -I -l nodes=2:x4gb,
>> Two nodes will be assigned to me but when I enter
>> mpirun -np 2 ./MPI_li_64
>> the job runs awhile, then fails with a
>> mpirun noticed that process rank 1 with PID 23104 on node n339
>> exited on signal 11 (Segmentation fault).
>> I can trace this in the intel debugger and see that the segmentation
>> fault is occuring in what should
>> be good code, and in code that executes with no problem when
>> everything is compiled 32-bit. I am
>> at a loss for what could be preventing this code to run within the
>> batch queuing environment in the
>> 64-bit version.
>> users mailing list
> users mailing list
> users mailing list