Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI:Problem with 64-bit openMPI andintel compiler
From: Sims, James S. Dr. (james.sims_at_[hidden])
Date: 2009-08-12 13:52:56


Sorry, I don't understand what you want me to do. I assume you want me to run the app on n296 as
rank 0 and run the app on n298 as rank 1, but I don't know how to do that outside of either torque
or mpirun -hostfile

Jim

P.S. O tried -x LD_LIBRARY_PATH and it doesn't work.

________________________________________
From: users-bounces_at_[hidden] [users-bounces_at_[hidden]] On Behalf Of Ralph Castain [rhc_at_[hidden]]
Sent: Wednesday, August 12, 2009 7:47 AM
To: Open MPI Users
Subject: Re: [OMPI users] Open MPI:Problem with 64-bit openMPI andintel compiler

We use Torque with OMPI here on almost every cluster, running 64-bit
jobs with the Intel compilers, so I doubt the problem is with Torque.
It is probably an issue with library paths.

Torque doesn't automatically forward your environment, nor does it
execute your remote .bashrc (or equivalent) when starting your remote
process. While ssh also typically doesn't forward the environment
(though your sys admin may have set it up to do so), it does execute
the remote .bashrc, which could be setting the correct path. I should
also note that mpirun will automatically forward LD_LIBRARY_PATH and
PATH for you, which is something different from what we do for the
other launchers.

If you execute your MPI_Ii_64 program locally on each of your nodes
(i.e., both processes run local), does it work? If so, then try adding

-x LD_LIBRARY_PATH

to your mpirun cmd line. This will tell mpirun to pickup your local
lib-path and forward it for you regardless of the launch environment.

On Aug 11, 2009, at 10:17 PM, Sims, James S. Dr. wrote:

> Back to this problem.
>
> The last suggestion was to upgrade to 1.3.3, which has been done.
> Still cannot get this code to
> run in 64 bit mode with torque. What I can do is run the job in l6
> bit mode using a hostfile.
> Specifically, if I use
> qsub -I -l nodes=2:ppn=1 torque allocates two nodes to the job, and
> since this is an interactive
> shell, logs me in to the controlling node. In this example process
> rank 0 is n72 and process rank 1 is n89:
> [sims_at_n72 4000]$ mpirun --display-allocation -pernode --display-map
> hostname
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: Name: n72.clust.nist.gov Num slots: 1 Max
> slots: 0
> Data for node: Name: n89 Num slots: 1 Max slots: 0
>
> =================================================================
>
> ======================== JOB MAP ========================
>
> Data for node: Name: n72.clust.nist.gov Num procs: 1
> Process OMPI jobid: [47657,1] Process rank: 0
>
> Data for node: Name: n89 Num procs: 1
> Process OMPI jobid: [47657,1] Process rank: 1
>
> =============================================================
> n89
> n72.clust.nist.gov
>
> My hostfile is
> [sims_at_n72 4000]$ cat hostfile
> n72
> n89
>
>
> If, logged in to n72, I use the command
> mpirun -np 2 ./MPI_li_64
> the job fails with a
> mpirun noticed that process rank 1 with PID 10538 on node n89 exited
> on signal 11 (Segmentation fault).
>
> If I use the command
> mpirun -np 2 --hostfile hostfile ./MPI_li_64
> the same thing happens.
>
> However, if I ssh to n73, for example, and use the command
> mpirun -np 2 --hostfile hostfile ./MPI_li_64
> everything works fine. So it appears that the problem is with torque.
>
> Any ideas?
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users