Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Run Time problem: Program hangs when utilizing multiple nodes.
From: Tim Blattner (tblatt1_at_[hidden])
Date: 2011-12-06 20:17:59


Hello,

The problem that I have been having is running my application across
multiple nodes. Here are the details of what I have debugged thus far.

I am going to follow the numbered list from the getting help page: (
http://www.open-mpi.org/community/help/)
1 ) I checked for a solution to this problem throughout the FAQ as well as
the mailing list, but was unsuccessful in resolving the issue.
2) Version of openmpi: openmpi v1.4.4
3) I found the config.log, but it is very large, so I was unable to attach
it. If you would like me to I can upload it and provide a link.
4) ompi_info --all output: see attached file 'ompi_info_all.txt'
5)'ompi_info -v ompi full --parsable' (ran using: 'mpirun --bynode *--hostfile
my_hostfile* --tag-output ompi_info -v ompi full --parsable'
[1,0]<stdout>:package:Open MPI root_at_intel16 Distribution
[1,0]<stdout>:ompi:version:full:1.4.4 [1,0]<stdout>:ompi:version:svn:r25188
[1,0]<stdout>:ompi:version:release_date:Sep 27, 2011
[1,0]<stdout>:orte:version:full:1.4.4 [1,0]<stdout>:orte:version:svn:r25188
[1,0]<stdout>:orte:version:release_date:Sep 27, 2011
[1,0]<stdout>:opal:version:full:1.4.4 [1,0]<stdout>:opal:version:svn:r25188
[1,0]<stdout>:opal:version:release_date:Sep 27, 2011
[1,0]<stdout>:ident:1.4.4 [1,1]<stdout>:package:Open MPI
root_at_intel16Distribution [1,1]<stdout>:ompi:version:full:1.4.4
[1,1]<stdout>:ompi:version:svn:r25188
[1,1]<stdout>:ompi:version:release_date:Sep 27, 2011
[1,1]<stdout>:orte:version:full:1.4.4 [1,1]<stdout>:orte:version:svn:r25188
[1,1]<stdout>:orte:version:release_date:Sep 27, 2011
[1,1]<stdout>:opal:version:full:1.4.4 [1,1]<stdout>:opal:version:svn:r25188
[1,1]<stdout>:opal:version:release_date:Sep 27, 2011
[1,1]<stdout>:ident:1.4.4

6) Detailed description:
I have a fortran90 application that solves a system of linear equations
using LU Decomposition. The application has three components. matrix_fill ,
matrix_decomp, and matrix_solve. The application has a make option for
compiling the application using MPI.

I have successfully compiled the application using openmpi v1.4.4, and can
run the application.
I utilize the '--hostfile' parameter when executing mpirun. For testing
purposes I modified this file to see if I could narrow down the problem.

I am able to run the program locally (on the same node that mpirun is being
executed on) when utilizing 1 or greater than 1 slots (i was able to run
with 12 slots on a single node). I am also able to mpirun on 1 or 2 slots
on a single remote node as well.

The problem occurs when I try to have two nodes work together, such that I
specify two separate nodes in the hostfile and use -np 2 when executing
mpirun).

Here is an example of the my_hostfile (when the problem occurs)
intel15
intel16

and this is an example of the command used:
[intel15] > mpirun --hostfile my_hostfile -np 2 matrix_fill

The problem occurs at a second call to MPI_BARRIER. The first MPI_BARRIER
call is successful, but on the second one it hangs.
Here is a basic outline of the code for up to the point of where the
program hangs:
[code]
      CALL MPI_INIT(ierr)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD, group_size, ierr)

      ! creates buffers for each image
      !synchronize buffers
      CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)

      WRITE(6, *) 'Initializing I/O for image #', my_image
      CALL flushio

      ! At this barrier the program hangs and must be killed using CTRL+C
      CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
[/code]

The hang only occurs when trying to use -np 2 (or larger) and on multiple
nodes that are networked together. At first I thought it was a firewall
issue, so i ran 'service iptables stop' as root, but sadly this did not fix
the problem. I am able to ssh between these nodes without a password, and
the nodes are apart of a cluster of approximately 20 nodes at University of
Maryland *B*altimore County.

7) Network info: see attached network_info.txt file:

I have been trying to determine the root of this error for the past week,
but with no success.

Any help would be greatly appreciated.

Thank you,
Tim