One thing commonly done in this situation is for a user to simply download the OMPI tarball and install it in their own home directory, then link R etc to the updated version. This avoids impacting everyone else on the system and is a low-risk way of trying to see if the update fixes the problem.
On Feb 6, 2014, at 10:23 AM, Ross Boylan <ross_at_[hidden]> wrote:
> On 2/6/2014 3:24 AM, Jeff Squyres (jsquyres) wrote:
>> Have you tried upgrading to a newer version of Open MPI? The 1.4.x series is several generations old. Open MPI 1.7.4 was just released yesterday.
> It's on a cluster running Debian squeeze, with perhaps some upgrades to wheezy coming. However, even wheezy is at 1.4.5 (the next generation is currently at 1.6.5). I don't administer the cluster, and upgrading basic infrastructure seems somewhat hazardous.
> I checked for backports of more recent version (at backports.debian.org) but there don't seem to be any for squeeze or wheezy.
> Can we mix later an earlier versions of MPI? The documentation at http://www.open-mpi.org/software/ompi/versions/ seems to indicate that 1.4, 1.6 and 1.7 would all be binary incompatible, though 1.5 and 1.6, or 1.7 and 1.8 would be compatible. However, point 10 of the FAQ (http://www.open-mpi.org/faq/?category=sysadmin#new-openmpi-version) seems to say compatibility is broader.
> Also, the documents don't seem to address on-the-wire compatibility; that is, if nodes on are different versions, can they work together reliably?
>> On Feb 5, 2014, at 9:58 PM, Ross Boylan <ross_at_[hidden]> wrote:
>>> On 1/31/2014 1:08 PM, Ross Boylan wrote:
>>>> I am getting the following error, amidst many successful message sends:
>>>> [n10][[50048,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:118:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x7f6155970038, 578659815)
>>>> Bad address(1)
>>> I think I've tracked down the immediate cause: I was sending a very large object (from R--I assume serialized into a byte stream) that was over 3G. I'm not sure why it would produce that error, but it doesn't seem that surprising that something would go wrong.
>>>> Any ideas about what is going on or what I can do to fix it?
>>>> I am using the openmpi-bin 1.4.2-4 Debian package on a cluster running Debian squeeze.
>>>> I couldn't find a config.log file; there is /etc/openmpi/openmpi-mca-params.conf, which is completely commented out.
>>>> Invocation is from R 3.0.1 (debian package) with Rmpi 0.6.3 built by me from source in a local directory. My sends all use mpi.isend.Robj and the receives use mpi.recv.Robj, both from the Rmpi library.
>>>> The jobs were started with rmpilaunch; it and the hosts file are included in the attachments. TCP connections. rmpilaunch leaves me in an R session on the master. I invoked the code inside the toplevel() function toward the bottom of dbox-master.R.
>>>> The program source files and other background information is in the attached file. n10 has the output of ompi_info --all, and n1011 has other info for both nodes that were active (n10 was master; n11 had some slaves).
>>>> users mailing list
>>> users mailing list
> users mailing list