Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI causing WRF to crash
From: Dmitry N. Mikushin (maemarcus_at_[hidden])
Date: 2011-08-03 07:46:03


BasitAli,

Signal 15 apparently means one of the WRF's MPI processes has been
unexpectedly terminated, maybe by program decision. No matter, if it
is OpenMPI-specific or not, issue needs to be tracked somehow to get
more details about it. Ideally, best thing is to get debugger attached
once the process signaled, then you can see call trace and figure out
what exactly has happened. This can be done by registering a custom
signal handler (see unix documentation for signals) or by running MPI
processes inside external diagnostic tool, for example valgrind:

mpirun -np <nprocesses> valgrind --db-attach=yes ./appname

... or by consulting with WRF community to check if they already have
configured some other approach.

Good luck with resolving this case!
- D.

2011/8/3 BasitAli Khan <BasitAli.Khan_at_[hidden]>:
> I am trying to run a rather heavy wrf simulation with spectral nudging but
> the simulation crashes after 1.8 minutes of integration.
>  The simulation has two domains    with  d01 = 601x601 and d02 = 721x721 and
> 51 vertical levels. I tried this simulation on two different systems but
> result was more or less same. For example
> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF
> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node = 4
> cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and
> mpixlf90.  I got the following error message in the wrf.err file
> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job
> record is as follows:
> <Aug 01 19:50:21.244657> BE_MPI (ERROR):   "killed with signal 15"
> I also tried to run the same simulation on our linux cluster (Linux Red Hat
> Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1
> compute node=8 cores). For the parallel run I am
> used mpi/openmpi/1.4.2-intel-11. I got the following error message in the
> error log after couple of minutes of integration.
> "mpirun has exited due to process rank 45 with PID 19540 on
> node ci118 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here)."
> I tried many things but nothing seems to be working. However, if I reduce
>  grid points below 200, the simulation goes fine. It appears that probably
> OpenMP has problem with large number of grid points but I have no idea how
> to fix it. I will greatly appreciate if you could suggest some solution.
> Best regards,
> ---
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 –6900,
> Kingdom of Saudi Arabia.
> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
> E-mail: basitali.khan_at_[hidden]
> Skype name: basit.a.khan
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>