Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Fisher, Mark S (mark.s.fisher_at_[hidden])
Date: 2007-01-30 09:35:28


The master process uses both MPI_ANY_SOURCE and MPI_ANY_TAG while
waiting for requests from slave processes. The slaves sometimes use
MPI_ANY_TAG but the source is always specified.

We have run the code through valgrid for a number of cases including the
one being used here.

The code is Fortran 90 and we are using the FORTRAN 77 interface so I do
not believe this is a problem.

We are using Gigabit Ethernet.

I could look at LAM again to see if it would work. The code needs to be
in a specific working directory and we need some environment variable
set. This was not supported well in pre MPI 2. versions of MPI. For
MPICH1 I actually launch a script for the slaves so that we have the
proper setup before running the executable. Note I had tried that with
OpenMPI and it had an internal error in orterun. This is not a problem
since the mpirun can setup everything we need. If you think it is worth
while I will download and try it.

-----Original Message-----
From: Jeff Squyres [mailto:jsquyres_at_[hidden]]
Sent: Monday, January 29, 2007 7:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] Scrambled communications using ssh starter
onmultiple nodes.

Without analyzing your source, it's hard to say. I will say that OMPI
may send fragments out of order, but we do, of course, provide the same
message ordering guarantees that MPI mandates. So let me ask a few
leading questions:

- Are you using any wildcards in your receives, such as MPI_ANY_SOURCE
or MPI_ANY_TAG?

- Have you run your code through a memory-checking debugger such as
valgrind?

- I don't know what Scali MPI uses, but MPICH and Intel MPI use integers
for MPI handles. Have you tried LAM/MPI as well? It, like Open MPI,
uses pointers for MPI handles. I mention this because apps that
unintentionally have code that takes advantage of integer handles can
sometimes behave unpredictably when switching to a pointer-based MPI
implementation.

- What network interconnect are you using between the two hosts?

On Jan 25, 2007, at 4:22 PM, Fisher, Mark S wrote:

> Recently I wanted to try OpenMPI for use with our CFD flow solver
> WINDUS. The code uses a master/slave methodology were the master
> handles I/O and issues tasks for the slaves to perform. The original
> parallel implementation was done in 1993 using PVM and in 1999 we
> added support for MPI.
>
> When testing the code with Openmpi 1.1.2 it ran fine when running on a

> single machine. As soon as I ran on more than one machine I started
> getting random errors right away (the attached tar ball has a good and

> bad output). It looked like either the messages were out of order or
> were for the other slave process. In the run mode used there is no
> slave to slave communication. In the file the code died near the
> beginning of the communication between master and slave. Sometimes it
> will run further before it fails.
>
> I have included a tar file with the build and configuration info. The
> two nodes are identical Xeon 2.8 GHZ machines running SLED 10. I am
> running real-time (no queue) using the ssh starter using the following

> appt file.
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
> __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
> /usr/bin/ssh --host copland -wdir /tmp/mpi.m209290 -np 2
> ./__bcfdbeta.exe
>
> The above file fails but the following works:
>
> -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent /usr/bin/ssh --host
> skipper2 -wdir /opt/scratch/m209290/ol.scr.16348 -np 1 ./
> __bcfdbeta.exe -x PVMTASK -x BCFD_PS_MODE --mca pls_rsh_agent
> /usr/bin/ssh --host
> skipper2 -wdir /tmp/mpi.m209290 -np 2 ./__bcfdbeta.exe
>
> The first process is the master and the second two are the slaves.
> I am
> not sure what is going wrong, the code runs fine with many other MPI
> distributions (MPICH1/2, Intel, Scali...). I assume that either I
> built it wrong or am not running it properly but I cannot see what I
> am doing wrong. Any help would be appreciated!
>
> <<mpipb.tgz>>
> <mpipb.tgz>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users