Hello,

The problem that I have been having is running my application across multiple nodes. Here are the details of what I have debugged thus far.

I am going to follow the numbered list from the getting help page: (http://www.open-mpi.org/community/help/)
1 ) I checked for a solution to this problem throughout the FAQ as well as the mailing list, but was unsuccessful in resolving the issue.
2) Version of openmpi: openmpi v1.4.4
3) I found the config.log, but it is very large, so I was unable to attach it.  If you would like me to I can upload it and provide a link.
4) ompi_info --all output: see attached file 'ompi_info_all.txt'
5)'ompi_info -v ompi full --parsable' (ran using: 'mpirun --bynode --hostfile my_hostfile --tag-output ompi_info -v ompi full --parsable'
[1,0]<stdout>:package:Open MPI root@intel16 Distribution [1,0]<stdout>:ompi:version:full:1.4.4 [1,0]<stdout>:ompi:version:svn:r25188 [1,0]<stdout>:ompi:version:release_date:Sep 27, 2011 [1,0]<stdout>:orte:version:full:1.4.4 [1,0]<stdout>:orte:version:svn:r25188 [1,0]<stdout>:orte:version:release_date:Sep 27, 2011 [1,0]<stdout>:opal:version:full:1.4.4 [1,0]<stdout>:opal:version:svn:r25188 [1,0]<stdout>:opal:version:release_date:Sep 27, 2011 [1,0]<stdout>:ident:1.4.4 [1,1]<stdout>:package:Open MPI root@intel16 Distribution [1,1]<stdout>:ompi:version:full:1.4.4 [1,1]<stdout>:ompi:version:svn:r25188 [1,1]<stdout>:ompi:version:release_date:Sep 27, 2011 [1,1]<stdout>:orte:version:full:1.4.4 [1,1]<stdout>:orte:version:svn:r25188 [1,1]<stdout>:orte:version:release_date:Sep 27, 2011 [1,1]<stdout>:opal:version:full:1.4.4 [1,1]<stdout>:opal:version:svn:r25188 [1,1]<stdout>:opal:version:release_date:Sep 27, 2011 [1,1]<stdout>:ident:1.4.4


6) Detailed description:
I have a fortran90 application that solves a system of linear equations using LU Decomposition. The application has three components. matrix_fill , matrix_decomp, and matrix_solve.  The application has a make option for compiling the application using MPI.

I have successfully compiled the application using openmpi v1.4.4, and can run the application.  
I utilize the '--hostfile' parameter when executing mpirun. For testing purposes I modified this file to see if I could narrow down the problem.

I am able to run the program locally (on the same node that mpirun is being executed on) when utilizing 1 or greater than 1 slots (i was able to run with 12 slots on a single node). I am also able to mpirun on 1 or 2 slots on a single remote node as well. 

The problem occurs when I try to have two nodes work together, such that I specify two separate nodes in the hostfile and use -np 2 when executing mpirun).

Here is an example of the my_hostfile (when the problem occurs)
intel15
intel16

and this is an example of the command used:
[intel15] > mpirun --hostfile my_hostfile -np 2 matrix_fill


The problem occurs at a second call to MPI_BARRIER. The first MPI_BARRIER call is successful, but on the second one it hangs.
Here is a basic outline of the code for up to the point of where the program hangs:
[code]
      CALL MPI_INIT(ierr)
      CALL MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
      CALL MPI_COMM_SIZE(MPI_COMM_WORLD, group_size, ierr)

      ! creates buffers for each image
      !synchronize buffers
      CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)

      WRITE(6, *) 'Initializing I/O for image #', my_image
      CALL flushio

      ! At this barrier the program hangs and must be killed using CTRL+C
      CALL MPI_BARRIER(MPI_COMM_WORLD, ierr)
[/code]


The hang only occurs when trying to use -np 2 (or larger) and on multiple nodes that are networked together. At first I thought it was a firewall issue, so i ran 'service iptables stop' as root, but sadly this did not fix the problem. I am able to ssh between these nodes without a password, and the nodes are apart of a cluster of approximately 20 nodes at University of Maryland Baltimore County.

7) Network info: see attached network_info.txt file:



I have been trying to determine the root of this error for the past week, but with no success. 

Any help would be greatly appreciated.  

Thank you,
Tim