Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] OMPI 1.4.x ignores hostfile and launches all the processes on just one node in Grid Engine
From: Serge (skhan_at_[hidden])
Date: 2010-04-06 14:18:02


Hi,

OpenMPI integrates with Sun Grid Engine really well, and one does not
need to specify any parameters for the mpirun command to launch the
processes on the compute nodes, that is having in the submission script
"mpirun ./program" is enough; there is no need for "-np XX" or
"-hostfile file_name".

However, there are cases when being able to specify the hostfile is
important (hybrid jobs, users with MPICH jobs, etc.). For example, with
Grid Engine I can request four 4-core nodes, that is total of 16 slots.
But I also want to specify how to distribute processes on the nodes, so
I create the file 'hosts'

node01 slots=1
node02 slots=1
node03 slots=1
node04 slots=1

and modify the line in the submission script to:
mpirun -hostfile hosts ./program

With Open MPI 1.2.x everything worked properly, meaning that Open MPI
could count the number of slots specified in the 'hosts' file - 4 (i.e.
effectively supplying the mpirun command with the -np parameter), as
well as properly distribute processes on the compute nodes (one process
per host).

It's different with Open MPI 1.4.1. It cannot process the 'hosts' file
properly at all. All the processes get launched on just one node -- the
shepherd host.

The format of the 'hosts' file does not matter. It can be, say

node01
node01
node02
node02

meaning 2 slots on each node. Open MPI 1.2.x would handle that with no
problem, however Open MPI 1.4.x would not.

The problem appears with OMPI 1.4.1, SGE 6.1u6. It was also tested with
OMPI 1.3.4 and SGE 6.2u4.

It's important to notice that if the mpirun command is run
interactively, not from inside the Grid Engine script, then it
interprets the content of the host file just fine.

I am wondering what changed from OMPI 1.2.x to OMPI 1.4.x that prevents
expected behavior, and is it possible to get it from OMPI 1.4.x by, say,
tuning some parameters?

= Serge