Thank you for taking the time and composing the detailed explanation. This gives me a bit more understanding with respect to the underlying plumbing, which I appreciate.
Bottom line the update to r19377 has appeared to have resolved the truncate problem. While I have tested in only a limited number of hosts, it seems to behave as expected. Thanks!! Tom
--- On Mon, 8/18/08, George Bosilca <bosilca_at_[hidden]> wrote:
From: George Bosilca <bosilca_at_[hidden]>
Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Revc without Infinipath
To: "Open MPI Users" <users_at_[hidden]>
Cc: "Tom Riddle" <rarebitusa_at_[hidden]>
Date: Monday, August 18, 2008, 4:16 PM
This make perfect sense. However, the fact that one of the network
devices (BTL in Open MPi terms) is not available at runtime should not
modify the behavior of the application. At least this is the theory :)
Changing from named receives to unnamed one, definitively modify the
signature (i.e. communication pattern) of the application, and might
in most cases introduce mismatching if the same tag is used. However,
with the osu_latency there are only two ranks involved in the
communication (rank 0 and 1) so the communication pattern should stay
the same whatever you use ANY_SOURCE or not, as the MPI standard
enforce the message ordering.
Now, let me explain a little bit of internal black magic behind of
Open MPI. When we discover that a BTL is overcharged, we reroute the
new messages into a local "pending" queue, until some space on the
device became available. Once we start book-keeping messages, we still
have to enforce the MPI logical ordering, so all new messages will
follow into the "pending" queue, until the device is capable of
sending data again, and then the messages will be delivered in-order
to their respective destination. What might happens, and this is only
speculation at this point, is that somehow a message bypass this
"pending" queue and goes into the wire too early. As this message
have the same tag, Open MPI might match it when the message arrive at
the destination, and can generate a TRUNCATE error if this message
belong to the next loop in the osu_latency benchmark. As you can see,
there are many ifs in the previous paragraph, so let's assume by now
that this is just pure speculation. Please upgrade to the latest
version of Open MPI, and if you encounter the same problem then we
will try to dig a little bit deeper into this "speculation".
On Aug 19, 2008, at 12:36 AM, Tom Riddle wrote:
> Thanks George, I will update and try the latest repo. However I'd
> like to describe our usage case a bit more to see if there is
> something that may not be proper in our development approach.
> Forgive me if this is repetitious...
> We have configured and built OpenMPI originally on a machine with
> Infinipath / PSM installed. Since we desire a flexible software
> development environment across a number of machines (most of them
> are without the Infinipath hw), we run these same OpenMPI bins in a
> shared user area. That means other developer's machines, which do
> not have Infinipath / PSM installed locally, will simulate the
> multiple machine communication by running in shared memory mode.
> But again these OpenMPI bins have been configured with Infinipath
> So we see the error when running in shared memory mode on machines
> that don't have Infinipath, so is there a way at runtime that you
> can force shared memory mode exclusively? We are wondering if
> designating MPI_ANY_SOURCE may then direct OpenMPI to look at every
> possible communications mode and that probably would cause conflicts
> if there wasn't psm libs present.
> Hope this makes sense, Tom
> Things were working without issue until we went to the wildcard
> MPI_ANY_SOURCE on our receives but only on machines without . I
> guess I wonder what is the mechanism when in a wildcard mode.
> --- On Sun, 8/17/08, George Bosilca <bosilca_at_[hidden]> wrote:
> From: George Bosilca <bosilca_at_[hidden]>
> Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Revc without
> To: rarebitusa_at_[hidden], "Open MPI Users"
> Date: Sunday, August 17, 2008, 2:42 PM
> I did the same modification as you on the osu_latency and the
> resulting application run to completion. I don't get any TRUNCATE
> error messages. I'm using the latest version of Open MPI
> There was a bug that might be related to your problem but our commit
> log shows it was fixed by commit 18830 on July 9.
> On Aug 13, 2008, at 5:49 PM, Tom Riddle wrote:
> > Hi,
> > A bit more info wrt the question below. I have run other releases of
> > OpenMPI and they seem to be fine. The reason I need to run the
> > latest is because it supports valgrind fully.
> > openmpi-1.2.4
> > openmpi-1.3ar18303
> > TIA, Tom
> > --- On Tue, 8/12/08, Tom Riddle <rarebitusa_at_[hidden]> wrote:
> > Hi,
> > I am getting a curious error on a simple communications test. I have
> > altered the std
> mvapich osu_latency test to accept receives from any
> > source and I get the following error
> > [d013.sc.net:15455] *** An error occurred in MPI_Recv
> > [d013.sc.net:15455] *** on communicator MPI_COMM_WORLD
> > [d013.sc.net:15455] *** MPI_ERR_TRUNCATE: message truncated
> > [d013.sc.net:15455] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > the code change was...
> > MPI_Recv(r_buf, size, MPI_CHAR, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD,
> > &reqstat);
> > the command line I run was
> > > mpirun -np 2 ./osu_latency
> > Now I run this on 2 types of host machine configurations. One that
> > has Infinipath HCAs installed and another that doesn't. I run
> > of these in shared memory mode ie: dual processes on the same node.
> > I have verified that when I am on the host with Infinipath I am
> > actually running the OpenMPI mpirun, not
> the mpi that comes with the
> > HCA.
> > I have built OpenMPI with psm support from a fairly recent svn pull
> > and run the same bins on both host machines... The config was as
> > follows:
> > > $ ../configure --prefix=/opt/wkspace/openmpi-1.3 CC=gcc CXX=g++
> > > --disable-mpi-f77 --enable-debug --enable-memchecker
> > > --with-psm=/usr/include --with-valgrind=/opt/wkspace/
> > > mpirun --version
> > mpirun (Open MPI) 1.4a1r18908
> > The error presents itself only on the host that does not have
> > Infinipath installed. I have combed through the mca args to see if
> > there is a setting I am missing but I cannot see anything obvious.
> > Any input would be appreciated. Thanks. Tom
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]