Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] CMAQ crashes with OpenMPI
From: Matthew Russell (mrussel2_at_[hidden])
Date: 2011-08-10 12:14:56


Hmm, I didn't know that. Is OS X's small stack something that can be
alleviated with "ulimit" in bash? Right now, I have my ulimit set to
unlimited. Does this still work with OpenMPI? (I might be wrong, but
doesn't MPI work over TCP, such that new spawned processes on my host
wouldn't be affected by my bash settings?)

What is discouraging and possibly related is, one member of my research
group has to set, unset and reset her ulimit on OS X (Snow Leopard) when
running this model statically. I haven't experienced the same, but it gives
me an impression that something on her computer of OS is very finicky.

I'm going to try re-building OpenMPI to ensure that everything I am using is
built with the same compiler (PGI), and then if (when) I run into this error
again, I'll run the debugger as you suggested below.

Thanks!

On Tue, Aug 9, 2011 at 5:27 PM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:

> The error message looks like it's no where near an MPI function; I would
> guess that this is not an Open MPI problem but, particularly given your
> statements about Snow Leopard) a CMAQ problem. The easiest way to debug
> on OS X is to launch the application code in a debugger, something like:
>
> mpirun -np 2 xterm -e gdb <app>
>
> One thing that can get people on OS X is that the maximum stack size is
> extremely small compared to Linux. Fortran apps, in particular, can end
> up putting things on the stack which cause an overrun and all kinds of fun.
>
> Brian
>
> On 8/9/11 3:18 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>
> >Also, please be aware that we haven't done any testing of OMPI on Lion,
> >so this is truly new ground.
> >
> >On Aug 9, 2011, at 3:00 PM, Doug Reeder wrote:
> >
> >
> >Matt,
> >Are you sure you are building against your macports version of openmpi
> >and not the one that ships w/ lion. In the trace back are items 4-9, that
> >end w/ x86_64pg from the pgi compiler. You said you are using pgf90 and
> >pgcc but in the configure input it looks like gcc is being used on lion.
> >
> >Doug Reeder
> >On Aug 9, 2011, at 1:49 PM, Matthew Russell wrote:
> >
> >
> >
> >Hi,
> >I'm trying to run CMAQ - an air quality model developed by the US EPA -
> >on a Mac (Lion) using OpenMPI (1.5.3) installed with MacPorts.
> >
> >I am able to run CMAQ in parallel, and am able to run small programs that
> >use OpenMPI.
> >
> >I set the OpenMPI environment variables to use pgf90/pgcc (10.9) as my
> >compiler. Using PGI because some of the code I need to build is fortran
> >77 ( *sigh* ), and for some other reasons.
> >
> >
> >The error I get is:
> >
> >/opt/local/lib/openmpi/bin/mpirun -v -machinefile
> >/Users/matt/cmaq/darwin11/scripts/cctm/machines8 -np 2
> >/Users/matt/cmaq/darwin11/scripts/cctm/CCTM_e1a_Darwin11_x86_64pg
> >[pontus:72547] *** Process received signal ***
> >[pontus:72547] Signal: Segmentation fault: 11 (11)
> >[pontus:72547] Signal code: Address not mapped (1)
> >[pontus:72547] Failing at address: 0x0
> >[pontus:72547] [ 0] 2 libsystem_c.dylib
> >0x00007fff91065cfa _sigtramp + 26
> >[pontus:72547] [ 1] 3 ???
> >0x00007fff5fbe58ab 0x0 + 140734799698091
> >[pontus:72547] [ 2] 4 CCTM_e1a_Darwin11_x86_64pg
> >0x000000010003c89b distr_env_ + 971
> >[pontus:72547] [ 3] 5 CCTM_e1a_Darwin11_x86_64pg
> >0x000000010003cbe5 par_init_ + 565
> >[pontus:72547] [ 4] 6 CCTM_e1a_Darwin11_x86_64pg
> >0x0000000100032e1b MAIN_ + 219
> >[pontus:72547] [ 5] 7 CCTM_e1a_Darwin11_x86_64pg
> >0x00000001000016f6 main + 70
> >[pontus:72547] [ 6] 8 CCTM_e1a_Darwin11_x86_64pg
> >0x000000010000163a _start + 248
> >[pontus:72547] [ 7] 9 CCTM_e1a_Darwin11_x86_64pg
> >0x0000000100001541 start + 33
> >[pontus:72547] [ 8] 10 ???
> >0x0000000000000001 0x0 + 1
> >[pontus:72547] *** End of error message ***
> >--------------------------------------------------------------------------
> >mpirun noticed that process rank 1 with PID 72547 on node
> >pontus.cee.carleton.ca <http://pontus.cee.carleton.ca/> exited on signal
> >11 (Segmentation fault: 11).
> >--------------------------------------------------------------------------
> >
> >
> >I don't expect anyone to know the solution from this brief error message,
> >however I was wondering if anyone has insight on how I might debug this?
> >I am too new to both OpenMPI and CMAQ to be served that well from this
> >traceback.
> >
> >I'm told by others in my research group that CMAQ with OpenMPI on Linux
> >works fine, and that the error I'm getting is very similar to the error
> >others got when trying this on a Mac (Snow Leopard) with ifort.. before
> >they gave up...
> >
> >OpenMPI was configured with:
> >configure.args --sysconfdir=${prefix}/etc/${name} \
> >
> > --includedir=${prefix}/include/${name} \
> > --bindir=${prefix}/lib/${name}/bin \
> > --mandir=${prefix}/share/man \
> > --with-memory-manager=none
> >
> ># enable build on Lion
> >if {$os.major} >= 11} {
> > configure.compiler gcc-4.2
> >}
> >
> >
> >The --with-memory-manager is there because I saw it fix potentially
> >similar problems in other postings to this Mailing list. It didn't make
> >a difference though.
> >
> >Thanks!
> >
> >
> >_______________________________________________
> >users mailing list
> >users_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> >
> >_______________________________________________
> >users mailing list
> >users_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> >
> >
> >_______________________________________________
> >users mailing list
> >users_at_[hidden]
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Brian W. Barrett
> Dept. 1423: Scalable System Software
> Sandia National Laboratories
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>