Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Can't get a fully functional openmpi build on Mac OSX Mavericks
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-17 15:29:15


We did update ROMIO at some point in there, so it is possible this is a ROMIO bug that we have picked up. I've asked someone to check upstream about it.

On Jan 17, 2014, at 12:02 PM, Ronald Cohen <rhcohen_at_[hidden]> wrote:

> Sorry, too many entries in this thread, I guess. My general goal is to get a working parallel hdf5 with openmpi on Mac OS X Mavericks. At one point in the saga I had romio disabled, which naturally doesn't work for hdf5 (which is trying to read/write files in parallel). So the hdf5 tests would of course fail. I subsequently had link errors with hdf5 because I was building openmpi with --disable-static, whereas the default (and recommended) option for hdf5 is to disable shared and build static. My most recent attempts were with openmpi with enable-static, enable-nodlopen. In that case, with openmpi 1.7.4rc1, hdf5 1.8.12 configured and built successfully but make chek-p produced many errors in its t-mpi tests, with messages like "proc 4: found data error at [2140143616+0], expect -7, got 6". The errors were reproduced by the HDF5 testing team with openmpir 1.7.4rc1, but not with 1.7.3 (which I am now building).
>
> Hopefully that is an adequate summary.
>
> Ron
>
>
>
> On Fri, Jan 17, 2014 at 11:44 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
> Can you specify exactly which issue you're referring to?
>
> - test failing when you had ROMIO disabled
> - test (sometimes) failing when you had ROMIO disabled
> - compiling / linking issues
>
> ?
>
>
> On Jan 17, 2014, at 1:50 PM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
>
> > Hello Ralph and others, I just got the following back from the HDF-5 support group, suggesting an ompi bug. So I should either try 1.7.3 or a recent nightly 1.7.4. Will likely opt for 1.7.3, but hopefully someone at openmpi can look at the problem for 1.7.4. In short, the challenge is to get a parallel hdf5 that passes make check-p with 1.7.4.
> >
> >
> >
> >
> >
> > ------------------
> > Hi Ron,
> >
> > I had sent your message to the developer and he can reproduce the issue.
> > Here is what he says:
> >
> > ---
> > I replicated this on Jam with ompi 1.7.4rc1. I saw the same error he is seeing.
> > Note that this is an un-stable release for ompi.
> > I tried ompi 1.7.3 (feature - little more stable release). I didn't see the
> > problems there.
> >
> > So this is an ompi bug. He can report it to the ompi list. He can just point
> > them to the t_mpi.c tests in our test suite in testpar/ and say it occurs with
> > their 1.7.4 rc1.
> > ---
> >
> > -Barbara
> >
> >
> >
> > On Fri, Jan 17, 2014 at 9:39 AM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> > Thanks, I've just gotten an email with some suggestions (and promise of more help) from the HDF5 support team. I will report back here, as it may be of interest to others trying to build hdf5 on mavericks.
> >
> >
> > On Fri, Jan 17, 2014 at 9:08 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> > Afraid I have no idea, but hopefully someone else here with experience with HDF5 can chime in?
> >
> >
> > On Jan 17, 2014, at 9:03 AM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> >
> >> Still a timely response, thank you. The particular problem I noted hasn't recurred; for reasons I will explain shortly I had to rebuild openmpi again, and this time Sample_mpio.c compiled and ran successfully from the start.
> >>
> >> But now my problem is trying to get parallel HDF5 to run. In my first attempt to build HDF5 it failed in the load stage because of unsatisifed externals from openmpi, and I deduced the problem was having built openmpi with --disable-static. So I rebuilt with --enable-static and --disable-dlopen (emulating a successful openmpi + hdf5 combination I had built on Snow Leopard). Once again openmpi passed its make check's, and as noted above the Sample_mpio.c test compiled and ran fine. And the parallel hdf5 configure and make steps ran successfully. But when I ran make check for hdf5, the serial tests passed but none of the parallel tests did. Over a million test failures! Error messages like:
> >>
> >> Proc 0: *** MPIO File size range test...
> >> --------------------------------
> >> MPI_Offset is signed 8 bytes integeral type
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file write test MPItest.h5
> >> MPIO GB file read test MPItest.h5
> >> MPIO GB file read test MPItest.h5
> >> MPIO GB file read test MPItest.h5
> >> MPIO GB file read test MPItest.h5
> >> proc 3: found data error at [2141192192+0], expect -6, got 5
> >> proc 3: found data error at [2141192192+1], expect -6, got 5
> >>
> >> And -- the specific errors reported, which processor, which location, and the total number of errors changes if I rerun make check.
> >>
> >> I've sent configure, make and make check logs to the HDF5 help desk but haven't gotten a response.
> >>
> >> I am now configuring openmpi (still 1.7.4rc1) with:
> >>
> >> ./configure --prefix=/usr/local/openmpi CC=gcc CXX=g++ FC=gfortran F77=gfortran --enable-static --with-pic --disable-dlopen --enable-mpirun-prefix-by-default
> >>
> >> and configuring HDF5 (version 1.8.12) with:
> >>
> >> configure --prefix=/usr/local/hdf5/par CC=mpicc CFLAGS=-fPIC FC=mpif90 FCFLAGS=-fPIC CXX=mpicxx CXXFLAGS=-fPIC --enable-parallel --enable-fortran
> >>
> >> This is the combination that worked for me with Snow Leopard (though it was then earlier versions of both openmpi and hdf5.)
> >>
> >> If it matters, the gcc is the stock one with Mavericks' XCode, and gfortran is 4.9.0.
> >>
> >> (I just noticed that the mpi fortran wrapper is now mpifort, but I also see that mpif90 is still there and is a just link to mpifort.)
> >>
> >> Any suggestions?
> >>
> >>
> >> On Fri, Jan 17, 2014 at 8:14 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >> sorry for delayed response - just getting back from travel. I don't know why you would get that behavior other than a race condition. Afraid that code path is foreign to me, but perhaps one of the folks in the MPI-IO area can respond
> >>
> >>
> >> On Jan 15, 2014, at 4:26 PM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> >>
> >>> Update: I reconfigured with enable_io_romio=yes, and this time -- mostly -- the test using Sample_mpio.c passes. Oddly the very first time I tried I got errors:
> >>>
> >>> % mpirun -np 2 sampleio
> >>> Proc 1: hostname=Ron-Cohen-MBP.local
> >>> Testing simple C MPIO program with 2 processes accessing file ./mpitest.data
> >>> (Filename can be specified via program argument)
> >>> Proc 0: hostname=Ron-Cohen-MBP.local
> >>> Proc 1: read data[0:1] got 0, expect 1
> >>> Proc 1: read data[0:2] got 0, expect 2
> >>> Proc 1: read data[0:3] got 0, expect 3
> >>> Proc 1: read data[0:4] got 0, expect 4
> >>> Proc 1: read data[0:5] got 0, expect 5
> >>> Proc 1: read data[0:6] got 0, expect 6
> >>> Proc 1: read data[0:7] got 0, expect 7
> >>> Proc 1: read data[0:8] got 0, expect 8
> >>> Proc 1: read data[0:9] got 0, expect 9
> >>> Proc 1: read data[1:0] got 0, expect 10
> >>> Proc 1: read data[1:1] got 0, expect 11
> >>> Proc 1: read data[1:2] got 0, expect 12
> >>> Proc 1: read data[1:3] got 0, expect 13
> >>> Proc 1: read data[1:4] got 0, expect 14
> >>> Proc 1: read data[1:5] got 0, expect 15
> >>> Proc 1: read data[1:6] got 0, expect 16
> >>> Proc 1: read data[1:7] got 0, expect 17
> >>> Proc 1: read data[1:8] got 0, expect 18
> >>> Proc 1: read data[1:9] got 0, expect 19
> >>> --------------------------------------------------------------------------
> >>> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> >>> with errorcode 1.
> >>>
> >>> But when I reran the same mpirun command, the test was successful. And deleting the executable and recompiling and then again running the same mpirun command, the test was successful. Can someone explain that?
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 1:16 PM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> >>> Aha. I guess I didn't know what the io-romio option does. If you look at my config.log you will see my configure line included --disable-io-romio. Guess I should change --disable to --enable.
> >>>
> >>> You seem to imply that the nightly build is stable enough that I should probably switch to that rather than 1.7.4rc1. Am I reading between the lines correctly?
> >>>
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:56 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >>> Oh, a word of caution on those config params - you might need to check to ensure I don't disable romio in them. I don't normally build it as I don't use it. Since that is what you are trying to use, just change the "no" to "yes" (or delete that line altogether) and it will build.
> >>>
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:53 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >>> You can find my configure options in the OMPI distribution at contrib/platform/intel/bend/mac. You are welcome to use them - just configure --with-platform=intel/bend/mac
> >>>
> >>> I work on the developer's trunk, of course, but also run the head of the 1.7.4 branch (essentially the nightly tarball) on a fairly regular basis.
> >>>
> >>> As for the opal_bitmap test: it wouldn't surprise me if that one was stale. I can check on it later tonight, but I'd suspect that the test is bad as we use that class in the code base and haven't seen an issue.
> >>>
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:49 AM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> >>> Ralph,
> >>>
> >>> I just sent out another post with the c file attached.
> >>>
> >>> If you can get that to work, and even if you can't can you tell me what configure options you use, and what version of open-mpi? Thanks.
> >>>
> >>> Ron
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:36 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >>> BTW: could you send me your sample test code?
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:34 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >>> I regularly build on Mavericks and run without problem, though I haven't tried a parallel IO app. I'll give yours a try later, when I get back to my Mac.
> >>>
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:04 AM, Ronald Cohen <rhcohen_at_[hidden]> wrote:
> >>> I have been struggling trying to get a usable build of openmpi on Mac OSX Mavericks (10.9.1). I can get openmpi to configure and build without error, but have problems after that which depend on the openmpi version.
> >>>
> >>> With 1.6.5, make check fails the opal_datatype_test, ddt_test, and ddt_raw tests. The various atomic_* tests pass. See checklogs_1.6.5, attached as a .gz file.
> >>>
> >>> Following suggestions from openmpi discussions I tried openmpi version 1.7.4rc1. In this case make check indicates all tests passed. But when I proceeded to try to build a parallel code (parallel HDF5) it failed. Following an email exchange with the HDF5 support people, they suggested I try to compile and run the attached bit of simple code Sample_mpio.c (which they supplied) which does not use any HDF5, but just attempts a parallel write to a file and parallel read. That test failed when requesting more than 1 processor -- which they say indicates a failure of the openmpi installation. The error message was:
> >>>
> >>> MPI_INIT: argc 1
> >>> MPI_INIT: argc 1
> >>> Testing simple C MPIO program with 2 processes accessing file ./mpitest.data
> >>> (Filename can be specified via program argument)
> >>> Proc 0: hostname=Ron-Cohen-MBP.local
> >>> Proc 1: hostname=Ron-Cohen-MBP.local
> >>> MPI_BARRIER[0]: comm MPI_COMM_WORLD
> >>> MPI_BARRIER[1]: comm MPI_COMM_WORLD
> >>> Proc 0: MPI_File_open with MPI_MODE_EXCL failed (MPI_ERR_FILE: invalid file)
> >>> MPI_ABORT[0]: comm MPI_COMM_WORLD errorcode 1
> >>> MPI_BCAST[1]: buffer 7fff5a483048 count 1 datatype MPI_INT root 0 comm MPI_COMM_WORLD
> >>>
> >>> I then went back to my openmpi directories and tried running some of the individual tests in the test and examples directories. In particular in test/class I found one test that seem to not be run as part of make check which failed, even with one processor; this is opal_bitmap. Not sure if this is because 1.7.4rc1 is incomplete, or there is something wrong with the installation, or maybe a 32 vs 64 bit thing? The error message is
> >>>
> >>> mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
> >>>
> >>> Process name: [[48805,1],0]
> >>> Exit code: 255
> >>>
> >>> Any suggestions?
> >>>
> >>> More generally has anyone out there gotten an openmpi build on Mavericks to work with sufficient success that they can get the attached Sample_mpio.c (or better yet, parallel HDF5) to build?
> >>>
> >>> Details: Running Mac OS X 10.9.1 on a mid-2009 Macbook pro with 4 GB memory; tried openmpi 1.6.5 and 1.7.4rc1. Built openmpi against the stock gcc that comes with XCode 5.0.2, and gfortran 4.9.0.
> >>>
> >>> Files attached: config.log.gz, openmpialllog.gz (output of running ompi_info --all), checklog2.gz (output of make.check in top openmpi directory).
> >>>
> >>> I am not attaching logs of make and install since those seem to have been successful, but can generate those if that would be helpful.
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users