Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Mpirun only works when n< 3
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2011-07-13 08:29:30


Got it.   Building a new openMPI solved it.

I don't know if the standard Ubuntu install was the problem or if it just didn't like the slightly later kernel.
Seems to be reason to be suspicious of Ubuntu 10.10 OpenMPI builds if you have anything unusual in your system.
Thanks.
--- On Tue, 12/7/11, Jeff Squyres <jsquyres_at_[hidden]> wrote:

From: Jeff Squyres <jsquyres_at_[hidden]>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pullen_at_[hidden]
Cc: "Open MPI Users" <users_at_[hidden]>
Received: Tuesday, 12 July, 2011, 10:29 PM

On Jul 11, 2011, at 11:31 AM, Randolph Pullen wrote:

> There are no firewalls by default.  I can ssh between both nodes without a password so I assumed that all is good with the comms.

FWIW, ssh'ing is different than "comms" (which I assume you mean opening random TCP sockets between two servers).

> I can also get both nodes to participate in the ring program at the same time.
> Its just that I am limited to inly 2 processes if they are split between the nodes
> ie:
> mpirun -H A,B ring                         (works)
> mpirun -H A,A,A,A,A,A,A  ring     (works)
> mpirun -H B,B,B,B ring                 (works)
> mpirun -H A,B,A  ring                    (hangs)

It is odd that A,B works and A,B,A does not.

> I have discovered slightly more information:
> When I replace node 'B' from the new cluster with node 'C' from the old cluster
> I get the similar behavior but with an error message:
> mpirun -H A,A,A,A,A,A,A  ring     (works from either node)
> mpirun -H C,C,C  ring     (works from either node)
> mpirun -H A,C  ring     (Fails from either node:)
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> [C:23465] ***  An error occurred in MPI_Recv
> [C:23465] ***  on communicator MPI_COMM_WORLD
> [C:23465] ***  MPI_ERRORS_ARE FATAL (your job will now abort)
> Process 0 sent to 1
> ----------------------------------
> Running this on either node A or C produces the same result
> Node C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , not an i5 2400 like the others.
> all the binaries are compiled on FC10 with gcc 4.3.2

Are you sure that all the versions of Open MPI being used on all nodes are exactly the same?  I.e., are you finding/using Open MPI v1.4.1 on all nodes?

Are the nodes homogeneous in terms of software?  If they're heterogeneous in terms of hardware, you *might* need to have separate OMPI installations on each machine (vs., for example, a network-filesystem-based install shared to all 3) because the compiler's optimizer may produce code tailored for one of the machines, and it may therefore fail in unexpected ways on the other(s).  The same is true for your executable.

See this FAQ entry about heterogeneous setups:

    http://www.open-mpi.org/faq/?category=building#where-to-install

...hmm.  I could have sworn we had more on the FAQ about heterogeneity, but perhaps not.  The old LAM/MPI FAQ on heterogeneity is somewhat outdated, but most of its concepts are directly relevant to Open MPI as well:

    http://www.lam-mpi.org/faq/category11.php3

I should probably copy most of that LAM/MPI heterogeneous FAQ to the Open MPI FAQ, but it'll be waaay down on my priority list.  :-(  If anyone could help out here, I'd be happy to point them in the right direction to convert the LAM/MPI FAQ PHP to Open MPI FAQ PHP... 

To be clear: the PHP conversion will be pretty trivial; I stole heavily from the LAM/MPI FAQ PHP to create the Open MPI FAQ PHP -- but there are points where the LAM/MPI heterogeneity text needs to be updated; that'll take an hour or two to update all that content.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/