Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-01-29 20:54:05


Without analyzing your source, it's hard to say. I will say that
OMPI may send fragments out of order, but we do, of course, provide
the same message ordering guarantees that MPI mandates. So let me
ask a few leading questions:

- Are you using any wildcards in your receives, such as
MPI_ANY_SOURCE or MPI_ANY_TAG?

- Have you run your code through a memory-checking debugger such as
valgrind?

- I don't know what Scali MPI uses, but MPICH and Intel MPI use
integers for MPI handles. Have you tried LAM/MPI as well? It, like
Open MPI, uses pointers for MPI handles. I mention this because apps
that unintentionally have code that takes advantage of integer
handles can sometimes behave unpredictably when switching to a
pointer-based MPI implementation.

- What network interconnect are you using between the two hosts?

On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:

> Recently I wanted to try OpenMPI for use with our CFD flow solver
> WINDUS. The code uses a master/slave methodology were the master
> handles
> I/O and issues tasks for the slaves to perform. The original parallel
> implementation was done in 1993 using PVM and in 1999 we added support
> for MPI.
>
> When testing the code with Openmpi 1.1.2 it ran fine when running on a
> single machine. As soon as I ran on more than one machine I started
> getting random errors right away (the attached tar ball has a good and
> bad output). It looked like either the messages were out of order or
> were for the other slave process. In the run mode used there is no
> slave
> to slave communication. In the file the code died near the
> beginning of
> the communication between master and slave. Sometimes it will run
> further before it fails.
>
> I have included a tar file with the build and configuration info. The
> two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am
> running real-time (no queue) using the ssh starter using the following
> appt file.
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
> __bcfdbeta.exe
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> copland -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
>
> The above file fails but the following works:
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
> __bcfdbeta.exe
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
>
> The first process is the master and the second two are the slaves.
> I am
> not sure what is going wrong, the code runs fine with many other MPI
> distributions (MPICH1/2, Intel, Scali...). I assume that either I
> built
> it wrong or am not running it properly but I cannot see what I am
> doing
> wrong. Any help would be appreciated!
>
> <<mpipb.tgz>>
> <mpipb.tgz>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems