Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_Reduce hangs in multi-node configuration
From: Brian Blank (brianblank_at_[hidden])
Date: 2009-02-08 19:46:58


I'm trying to run a small "proof of concept" program using OpenMPI 1.3. I
am using Solaris 8 with Sparc processors across 2 nodes. It appears that
the MPI_Reduce function is hanging. If I run the same program with only 4
instances on 1 node , or 2 instances on 2 nodes, it works fine. The problem
is visible with 4 instances on 2 nodes.
First, I had some issues while compiling OpenMPI. I did resolve my
compile-time issues, so I would like to share with you my fixes. I believe
that my compile-time issues are related to running an older version of
Solaris, and probably not due to any major issue in OpenMPI. These fixes
are not related to my problem, but thought you might need to see this in
case it provides insight onto what my problem is.

1) ./opal/mca/paffinity/solaris/paffinity_solaris_module.c
_SC_CPUID_MAX was undefined. I made the following change to 2 locations in
the source:
    cpuid_max = 7; /* sysconf(_SC_CPUID_MAX); */ /* Running on 8 CPU nodes
*/

2) ./ompi/contrib/vt/vt/vtlib/vt_iowrap.c
vfscan was undefined. I had to comment out the following code (it appears
that fscanf() was not required anyway):

a) /* #include <stdint.h> */
b) /* VT_IOWRAP_INIT_FUNC(fscanf); */
c) I commented out the entire fscanf() function

Now, I seem to be stuck on a run-time issue. I wrote a program (located in
the attached bz2 file) called sieve.c which calculates prime numbers using
the sieve algorithm (copied the code from somewhere). When I run the
program on a local node only with 4 threads it works fine. If I run the
program with 2 threads on 2 nodes, it also works fine. If I run the program
with 4 threads on 2 nodes, it hangs. I made the following observations:

1) It is definitely hanging during the call to MPI_Reduce().

2) Some instances do exit MPI_Reduce(), while other instances enter but
never exit this function.

3) If I added the following code right before calling MPI_Reduce(), the
problem went away. It appears that by delaying the destination instance of
the reduce operation from making the call, it seems to work. However, I do
realize this is a kludge and that it is no guarantee that it will work all
the time.

     MPI_Barrier(MPI_COMM_WORLD);
     if(!id) sleep(1);

4) If I changed the MPI_Reduce() to an MPI_Allreduce(), the sieve program
also works with 4 instances across 2 nodes.

I did search your archives, and found someone else with a similar issue, but
I didn't see any response.
          http://www.open-mpi.org/community/lists/users/2008/07/6157.php

My PATH includes:
     /home/username/mpi/openmpi-1.3.local/bin

My LD_LIBRARY_PATH includes:
     /home/username/mpi/openmpi-1.3.local/lib

I used the following in my configure parameters:
./configure --prefix=/home/username/mpi/openmpi-1.3.local --disable-mpi-f77
--disable-mpi-f90 CFLAGS=-xarch=v8plus CXXFLAGS=-xarch=v8plus

I compiled the program with:
mpicc -g -o sieve sieve.c

I ran the program with:
mpirun -np 4 -H node1,node2 sieve 100

Please let me know if you need any additional information. And thanks in
advance for any help you can provide.

Thanks,
Brian