Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] WRF Problem running in Parallel (Gus Correa)
From: Ahsan Ali (ahsanshah01_at_[hidden])
Date: 2011-02-23 21:55:32


Hello Gus, Jody

 The system has enough memory. I unlimited the stack size before runnning
WRF by the command *ulimit -s unlimited*.But he problem occured.
Thanks

Hi Ahsan, Jody
>
> Just a guess that this may be a stack size problem.
> Did you try to run WRF with unlimited stack size?
> Also, does your machine have enough memory to run WRF?
>
> I hope this helps,
> Gus Correa
>
>
> jody wrote:
> > Hi
> > At a first glance i would say this is not a OpenMPI problem,
> > but a wrf problem (though io must admit i have no knowledge whatsoever
> ith wrf)
> >
> > Have you tried running a single instance of wrf.exe?
> > Have you tried to run a simple application (like a "hello world") on your
> nodes?
> >
> > Jody
> >
> >
> > On Tue, Feb 22, 2011 at 7:37 AM, Ahsan Ali <ahsanshah01_at_[hidden]>
> wrote:
> >> Hello,
> >> I an stuck in a problem that is regarding the running for Weather
> research
> >> and Forecasting Model (WRFV 3.2.1). I get the following error while
> running
> >> with mpirun. Any help would be highly appreciated.
> >>
> >> [pmdtest_at_pmd02 em_real]$ mpirun -np 4 wrf.exe
> >> starting wrf task 0 of 4
> >> starting wrf task 1 of 4
> >> starting wrf task 3 of 4
> >> starting wrf task 2 of 4
> >>
> --------------------------------------------------------------------------
> >> mpirun noticed that process rank 3 with PID 6044 on node
> pmd02.pakmet.com
> >> exited on signal 11 (Segmentation fault).
> >>
> >>
> >>
> >> --
> >> Syed Ahsan Ali Bokhari
> >> Electronic Engineer (EE)
> >> Research & Development Division
> >> Pakistan Meteorological Department H-8/4, Islamabad.
> >> Phone # off +92518358714
> >> Cell # +923155145014
> >>
> >>
> Dear Jody,
>
> WRF is running well on serial option (i.e single interface) . I am running
> another application HRM using OpenMPI , there is no issue with that and
> application is running on cluster of many nodes. The wrf manual says the
> following about MPI run:
>
> I*f you have run the model on multiple processors using MPI, you should
> have
> a number of rsl.out.* and rsl.error.* files. Type ?tail rsl.out.0000? to
> see
> if you get ?SUCCESS COMPLETE WRF?. This is a good indication that the model
> has run successfully.*
>
> *Take a look at either rsl.out.0000 file or other standard out file. This
> file logs the times taken to compute for one model time step, and to write
> one history and restart output:*
>
> *
> Timing for main: time 2006-01-21_23:55:00 on domain 2: 4.91110 elapsed
> seconds.*
>
> *Timing for main: time 2006-01-21_23:56:00 on domain 2: 4.73350 elapsed
> seconds.*
>
> *Timing for main: time 2006-01-21_23:57:00 on domain 2: 4.72360 elapsed
> seconds.*
>
> *Timing for main: time 2006-01-21_23:57:00 on domain 1: 19.55880 elapsed
> seconds.*
>
> *and*
>
> *Timing for Writing wrfout_d02_2006-01-22_00:00:00 for domain 2: 1.17970
> elapsed seconds.*
>
> *Timing for main: time 2006-01-22_00:00:00 on domain 1: 27.66230 elapsed
> seconds.*
>
> *Timing for Writing wrfout_d01_2006-01-22_00:00:00 for domain 1: 0.60250
> elapsed seconds.*
>
> * *
>
> *If the model did not run to completion, take a look at these standard
> output/error files too. If the model has become numerically unstable, it
> may
> have violated the CFL criterion (for numerical stability). Check whether
> this is true by typing the following:*
>
> * *
>
> *grep cfl rsl.error.* or grep cfl wrf.out*
>
> *you might see something like these:*
>
> *5 points exceeded cfl=2 in domain 1 at time 4.200000 *
>
> * MAX AT i,j,k: 123 48 3 cfl,w,d(eta)=
> 4.165821*
>
> *21 points exceeded cfl=2 in domain 1 at time 4.200000 *
>
> * MAX AT i,j,k: 123 49 4 cfl,w,d(eta)=
> 10.66290*
>
> But when I check the rsl.out* or rsl.error* there is no indication on any
> error occured ,It seems that the application just didn't start.
> [pmdtest_at_pmd02 em_real]$ tail rsl.out.0000
> WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 8
> WRF TILE 1 IS 1 IE 360 JS 1 JE 25
> WRF TILE 2 IS 1 IE 360 JS 26 JE 50
> WRF TILE 3 IS 1 IE 360 JS 51 JE 74
> WRF TILE 4 IS 1 IE 360 JS 75 JE 98
> WRF TILE 5 IS 1 IE 360 JS 99 JE 122
> WRF TILE 6 IS 1 IE 360 JS 123 JE 146
> WRF TILE 7 IS 1 IE 360 JS 147 JE 170
> WRF TILE 8 IS 1 IE 360 JS 171 JE 195
> WRF NUMBER OF TILES = 8
>
>
>
> Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off +92518358714
Cell # +923155145014