Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi.isend still not working (was trying to use personal copy of 1.7.4--solved)
From: Ross Boylan (ross_at_[hidden])
Date: 2014-04-23 16:45:34


On Wed, 2014-04-23 at 13:05 -0400, Hao Yu wrote:
> Hi Ross,
>
> Sorry for backing to you later on this issue. After finishing my course, I
> am working on Rmpi 0.6-4 to be released soon to CRAN.
>
> I did a few tests like yours and indeed I was able to produce some
> deadlocks whenever mpi.isend.Robj is used. Later on I traced it to some
> kind of race condition. If you use mpi.test to test whether mpi.isend.Robj
> finishes its job or not, this deadlock may be avoided. I did like
> mpi.isend.Robj(r,1,4,request=0)
> mpi.test(0)
>
> If mpi.test(0) returns FALSE and I run
> mpi.send.Robj(r2,1,4)
>
> then I get no prompt. If mpi.test(0) returns TRUE, then
> mpi.send.Robj(r2,1,4)
>
> is OK. So, if any nonblocking calls are used, one must use mpi.test or
> mpi.wait to check if they are complete before trying any blocking calls.
That sounds like a different problem than the one I encountered. The
system did get hung up, but the reason was that processes received
corrupted R objects, threw an error, and stopped responding.

The root of my problem was that objects got garbage collected before the
isend completed. This will happen regardless of subsequent R-level
calls (e.g., to mpi.test). The object to be transmitted is serialized
and passed to C, but when the call returns there are no R references to
the object--that is, the serialized version of the object--and so it is
subject to garbage collection.

I'd be happy to provide my modifications to get around this. Although
they worked for me, they are not really suitable for general use. There
are 2 main issues: first, I ignored the asynchronous receive since I
didn't use it. Since MPI request objects are used for both sending and
receiving, I suspect that mixing irecv's in with code doing isends would
not work right. I don't think there's any reason in principle the
handling of isend's could be extended to include irecv's; I just didn't
do it. I also did not put the hooks for the new stuff in calls the
reset the maximum number of requests.

The second issue is that my fix changed the interface to a slightly
higher level of abstraction. Request objects and numbers are more
things that are managed by Rmpi than the user. Rmpi keeps references to
the serialized objects around as long as the request is outstanding. For
example, the revised mpi.isend does not take a request number; the
function works out one and returns it. And in general the calls do more
than simply call the corresponding C function.

Ross Boylan
>
> Hao
>
>
> Ross Boylan wrote:
> > I changed the calls to dlopen in Rmpi.c so that it tried libmpi.so
> > before libmpi.so.0. I also rebuilt MPI, R, and Rmpi as suggested
> > earlier by Bennet Fauber
> > (http://www.open-mpi.org/community/lists/users/2014/03/23823.php).
> > Thanks Bennet!
> >
> > My theory is that the change to dlopen by itself was sufficient. The
> > rebuilding done before (by others) may have worked because they made the
> > load of libmpi.so.0 fail. That's not a great theory since a) if there
> > was no libmpi.so.0 on the system it would fail anyway and b) dlopen
> > could probably find libmpi.so.0 in standard system locations regardless
> > of how R was built or LD_LIBRARY_PATHS setup (assuming it didn't find it
> > in a custom place first).
> >
> > Which brings me back to my original problem: mpi.isend.Robj (or possibly
> > mpi.recv.Robj on the other end) did not seem to be working properly. I
> > had hoped switching to a newer MPI library (1.7.4) would fix this; if
> > anything, it made it worse. I am sending to a fake receiver (at rank 1)
> > that does nothing but print a message when it gets a message. r is a
> > list with
> >> length(serialize(r, NULL)) # the mpi.isend.Robj R function serializes
> > the object and then mpi.isend's it.
> > length(serialize(r, NULL))
> > [1] 599499 # ~ 0.5 MB
> >> mpi.send.Robj(1, 1, 4) # send of number works
> > Fake Assembler: 0 4 numeric
> >> mpi.send.Robj(r, 1, 4) # send of r works
> > NULL
> >> Fake Assembler: 0 4 list
> > mpi.isend.Robj(1, 1, 4) # isend of number works
> >> Fake Assembler: 0 4 numeric
> > mpi.isend.Robj(r, 1, 4) # sometimes this used to work the first time
> >> mpi.send.Robj(r, 1, 4) # sometimes used to get previous message
> > unstuck
> > # never get the command prompt back
> > # presumably mpi.send, the C function, does not return.
> >
> > I might just switch to mpi.send, though the fact that something is going
> > wrong makes me nervous.
> >
> > Obviously given the involvement of R it's not clear the problem lies
> > with the MPI layer, but that seems at least a possibility.
> >
> > Ross
> > On Thu, 2014-03-13 at 12:15 -0700, Ross Boylan wrote:
> >> On Wed, 2014-03-12 at 10:52 -0400, Bennet Fauber wrote:
> >> > My experience with Rmpi and OpenMPI is that it doesn't seem to do well
> >> > with the dlopen or dynamic loading. I recently installed R 3.0.3, and
> >> > Rmpi, which failed when built against our standard OpenMPI but
> >> > succeeded using the following 'secret recipe'. Perhaps there is
> >> > something here that will be helpful for you.
> >> >
> >> I have a couple of things to report. First,
> >> http://www.stats.uwo.ca/faculty/yu/Rmpi/changelogs.htm says
> >> It looks like that the option --disable-dlopen is not necessary to
> >> install Open MPI 1.6, at least on Debian. This might be R's .onLoad
> >> correctly loading dynamic libraries and Open MPI is not required to be
> >> compiled with static libraries enabled.
> >>
> >> Second, I tried rebuilding MPI with --disable-dlopen WITHOUT any of the
> >> changes to R or Rmpi. The behavior didn't change. Nobody said it
> >> would, but I thought it was worth a try.
> >>
> >> Third, the source of the double-load of mpi-related libraries looks like
> >> this code in Rmpi.c:
> >> if (!dlopen("libmpi.so.0", RTLD_GLOBAL | RTLD_LAZY)
> >> && !dlopen("libmpi.so", RTLD_GLOBAL | RTLD_LAZY)){
> >> So libmpi.so.1 is loaded because it's linked to Rmpi.so, and libmpi.so.0
> >> is loaded because the code does so explicitly.
> >>
> >> The motivation was
> >> http://www.stats.uwo.ca/faculty/yu/Rmpi/changelogs.htm notes
> >> ----------------------------------
> >> 2007-10-24, version 0.5-5:
> >>
> >> dlopen has been used to load libmpi.so explicitly. This is mainly useful
> >> for Rmpi under OpenMPI where one might see many error messages:
> >> mca: base: component_find: unable to open osc pt2pt: file not found
> >> (ignored)
> >> if libmpi.so is not loaded with RTLD_GLOBAL flag.
> >> -------------------------------------
> >>
> >> I think I'll try changing to to try libmpi.so first so that it picks up
> >> libmpi.so.1 if available. I've already rebuilt R, though it looks as if
> >> Rmpi may have been the source of the problems.
> >>
> >> Ross
> >> > ### Install openmpi 1.6.5
> >> >
> >> > export PREFIX=/scratch/support_flux/
> >> > bennet/local
> >> > COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
> >> > CONFIGURE_FLAGS='--disable-dlopen --enable-static'
> >> > cd openmpi-1.6.5
> >> > ./configure --prefix=${PREFIX} \
> >> > --mandir=${PREFIX}/man \
> >> > --with-tm=/usr/local/torque \
> >> > --with-openib --with-psm \
> >> > --with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre' \
> >> > $CONFIGURE_FLAGS \
> >> > $COMPILERS
> >> > make
> >> > make check
> >> > make install
> >> >
> >> > ### Install R 3.0.3
> >> >
> >> > wget http://cran.case.edu/src/base/R-3/R-3.0.3.tar.gz
> >> > tar xzvf R-3.0.3.tar.gz
> >> > cd R-3.0.3
> >> >
> >> > export MPI_HOME=/scratch/support_
> >> > flux/bennet/local
> >> > export LD_LIBRARY_PATH=$MPI_HOME/lib:${LD_LIBRARY_PATH}
> >> > export LD_LIBRARY_PATH=$MPI_HOME/openmpi:${LD_LIBRARY_PATH}
> >> > export PATH=${PATH}:${MPI_HOME}/bin
> >> > export LDFLAGS='-Wl,-O1'
> >> > export R_PAPERSIZE=letter
> >> > export R_INST=${PREFIX}
> >> > export FFLAGS='-O3 -mtune=native'
> >> > export CFLAGS='-O3 -mtune=native'
> >> > ./configure --prefix=${R_INST} --mandir=${R_INST}/man
> >> > --enable-R-shlib --without-x
> >> > make
> >> > make check
> >> > make install
> >> > wget
> >> http://www.stats.uwo.ca/faculty/yu/Rmpi/download/linux/Rmpi_0.6-3.tar.gz
> >> > R CMD INSTALL Rmpi_0.6-3.tar.gz \
> >> > --configure-args="--with-Rmpi-include=$MPI_HOME/include
> >> > --with-Rmpi-libpath=$MPI_HOME/lib --with-Rmpi-type=OPENMPI"
> >> >
> >> > Make sure environment variables and paths are set
> >> >
> >> > MPI_HOME=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static
> >> > PATH=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/bin
> >> > LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib
> >> > LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib/openmpi
> >> > PATH=/home/software/rhel6/R/3.0.3/bin:$LD_LIBRARY_PATH}
> >> > LD_LIBRARY_PATH=/home/software/rhel6/R/3.0.3/lib64/R/lib:$LD_LIBRARY_PATH}
> >> >
> >> > ## Then install snow with
> >> > R
> >> > > install.packages('snow')
> >> > [ . . . .
> >> >
> >> >
> >> > I think the key thing is the --disable-dlopen, though it might require
> >> > both. Jeff Squyres had a post about this quite a while ago that gives
> >> > more detail about what's happening:
> >> >
> >> > http://www.open-mpi.org/community/lists/devel/2012/04/10840.php
> >> >
> >> > -- bennet
> >> > _______________________________________________
> >> > users mailing list
> >> > users_at_[hidden]
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
> >
>
>