Could your problem is related to the MCA parameter “contamination” problem, where the child MPI process inherits MCA environment
variables from the parent process still exists.
Back in 2007 I was implementing a program that solves two large interrelated systems of equations (+200.000.000 eq.) using
PCG iteration. The program starts to iterate on the first system until a certain degree of convergence, then the master node executes a shell script which starts the parallel solver on the second system. Again the iteration is to certain degree of convergence,
some parameters from solving the second system are stored in files. After the solving of the second system, the stored parameters are used in the solver for the first system. Both before and after the master node makes the system call the nodes are synchronized
via calls of MPI_BARRIER.
The program was hanging when the master node executed the shell script.
I found that it was because MCA environment variables was inherited form the parent process, and solved the
problem by adding the following to the script starting the second MPI program:
for i in $(env | grep OMPI_MCA |sed 's/=/ /' | awk '{print $1}')
do
unset $i
done
|
|
|