Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Network Problem?
From: David Ronis (David.Ronis_at_[hidden])
Date: 2009-06-30 14:49:53


(This may be a duplicate. An earlier post seems to have been lost).

I'm using openmpi (1.3.2) to run on 3 dual processor machines (running
linux, slackware-12.1, gcc-4.4.0). Two are directly on my LAN while
the 3rd is connected to my LAN via VPN and NAT (I can communicate in
either direction from any of the machines to the remote machines using
its NAT address).

The program I'm trying to run is very simple in terms of MPI.
Basically it is:

main()
{
        [snip];

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&myrank);

        [snip];

  if(myrank==0)
    i=MPI_Reduce(MPI_IN_PLACE, C, N, MPI_DOUBLE,
                 MPI_SUM, 0, MPI_COMM_WORLD);
  else
    i=MPI_Reduce(C, MPI_IN_PLACE, N, MPI_DOUBLE,
                 MPI_SUM, 0, MPI_COMM_WORLD);

  if(i!=MPI_SUCCESS)
    {
      
      fprintf(stderr,"MPI_Reduce (C) fails on processor %d\n", myrank);
      MPI_Finalize();
      exit(1);
    }
  MPI_Barrier(MPI_COMM_WORLD);

         [snip];

}

I run by invoking:

        mpirun -v -np ${NPROC} -hostfile ${HOSTFILE} --stdin none $*
> /dev/null

If I run on the 4 nodes that are physically on the LAN it works as
expected. When I add the nodes on the remote machine things don't
work properly:

1. If I start with NPROC=6 on one of the LAN machines all 6 nodes
start (as shown by running ps), and all get to the MPI_HARVEST
calls. At that point things hang (I see no network traffic, which
given the size of the array I'm trying to reduce is strange).

2. If I start on the remote with NPROC=6, the only the mpirun call
shows up under ps on the remote, while nothing shows up on the other
nodes. Killing the process gives messages like:

         hostname - daemon did not report back when launched

3. If I start on the remote with NPROC=2, the 2 processes start on
the remote and finish properly.

My suspicion is that there's some bad interaction with NAT and
authentication.

Any suggestions?

David