Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Per Madsen (Per.Madsen_at_[hidden])
Date: 2007-07-23 07:10:40


I am in the process of moving a parallel program from our old 32 bit based (Xeon @ 2.8 GHz) Linux cluster to a new EM64T (Intel Xeon 5160 @ 3.00GHz) base linux cluster.

OS on the old cluster is Redhat 9 and Fedora 7 on the new cluster.

I have installed the Intel Fortran compiler version 10.0 and openmpi-1.2.3.

I configured openmpi with “--prefix=/opt/openmpi –F77=ifort –FC=ifort.
config.log and the output from ompi_info --all are in the attached files.

/opt/ is mounted on all nodes in the cluster.

The program causing me problems, is a program that solves two large interrelated systems of equations (+200.000.000 eq.) using PCG iteration. The program starts to iterate on the first system until a certain degree of convergence is reached, then the master node executes a shell script which starts the parallel solver on the second system. Again the iteration is continued until certain degree of convergence, some parameters from solving the second system is stored in different files. After the solving of the second system, the stored parameters is used in the solver for the first system. Both before and after the master node makes the system call the nodes are synchronized via calls of MPI_BARRIER.

This setup has worked fine on the old cluster, but on the new cluster, The system call do not start the parallel solver for the second system. The solver program is very complex, so I have med some small Fortran programs and shell scripts that illustrates the problem.

The setup is as follows:

mpi_master starts mpi on a number of nodes and checks that the nodes is alive. The master then executes the shell script via a system call, thats starts a serial Fortran program serial_subprog). After return from the system call, the master executes the shell script This script tries to start
mpi_subprog via mpirun.

I have used mpif90 to compile the mpi programs and ifort to compile the serial program.

Mpi_main starts as expected, the call of starts the serial program as expected. However, the system call to execute the do not start mpi_subprog.

The Fortran programs and scripts are in the attached file test.tar.gz.

When I run the setup via:
mpirun -np 4 -hostfile nodelist ./mpi_main

I get the following:

MPI_INIT return code: 0
 MPI_INIT return code: 0
 MPI_COMM_RANK return code: 0
 MPI_COMM_SIZE return code: 0
 Process 1 of 2 is alive - Hostname= c01b04
           1 : 19
 MPI_COMM_RANK return code: 0
 MPI_COMM_SIZE return code: 0
 Process 0 of 2 is alive - Hostname= c01b05
           0 : 19
 MYID: 1 MPI_REDUCE 1 red_chk_sum= 0 rc= 0
 MYID: 0 MPI_REDUCE 1 red_chk_sum= 2 rc= 0

 Master will now execute the shell script

This is from

 We are now in the serial subprogram

 Master back from the shell script
 IERR= 0

 Master will now execute the shell script

This is from
[] OOB: Connection to HNP lost

 Master back from the shell script
 IERR= 0

 MYID: 0 MPI_REDUCE 2 red_chk_sum= 20 rc= 0
 MYID: 1 MPI_REDUCE 2 red_chk_sum= 0 rc= 0

As you can see, the execution on the serial program works, while the mpi program is not started.

I have checked that mpirun is in the PATH in the shell started by the system call, and I have checked the the script works if it is executed from the command prompt. Output from a run with mpirun options -v -d are in the attached file test.tar.gz.

Is there anyone out there that have tried to do some thing similar?


Per Madsen
Senior scientist

Det Jordbrugsvidenskabelige Fakultet / Faculty of Agricultural Sciences
Forskningscenter Foulum / Research Centre Foulum
Genetik og Bioteknologi / Dept. of Genetics and Biotechnology
Blichers Allé 20, P.O. BOX 50
DK-8830 Tjele