Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] exited on signal 11 (Segmentation fault).
From: Gus Correa (gus_at_[hidden])
Date: 2011-10-25 11:24:03


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

* - stack -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:
> Hi all,
>
> I've checked the "limits.conf", and it contains theses lines
>
>
> # Jcb 29.06.2007 : pbs wrf (Siji)
> #* hard stack 1000000
> #* soft stack 1000000
>
> # Dr 14.02.2008 : pour voltaire mpi
> * hard memlock unlimited
> * soft memlock unlimited
>
>
>
> Many thanks for your help
> Mouhamad
>
> Gus Correa <gus_at_[hidden]> a écrit :
>
>> Hi Mouhamad, Ralph, Terry
>>
>> Very often big programs like wrf crash with segfault because they
>> can't allocate memory on the stack, and assume the system doesn't
>> impose any limits for it. This has nothing to do with MPI.
>>
>> Mouhamad: Check if your stack size is set to unlimited on all compute
>> nodes. The easy way to get it done
>> is to change /etc/security/limits.conf,
>> where you or your system administrator could add these lines:
>>
>> * - memlock -1
>> * - stack -1
>> * - nofile 4096
>>
>> My two cents,
>> Gus Correa
>>
>> Ralph Castain wrote:
>>> Looks like you are crashing in wrf - have you asked them for help?
>>>
>>> On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:
>>>
>>>> Hi again,
>>>>
>>>> This is exactly the error I have:
>>>>
>>>> ----
>>>> taskid: 0 hostname: part034.u-bourgogne.fr
>>>> [part034:21443] *** Process received signal ***
>>>> [part034:21443] Signal: Segmentation fault (11)
>>>> [part034:21443] Signal code: Address not mapped (1)
>>>> [part034:21443] Failing at address: 0xfffffffe01eeb340
>>>> [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
>>>> [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418)
>>>> [0x11cc9d8]
>>>> [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260)
>>>> [0x11cfca0]
>>>> [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31)
>>>> [0x11e6e41]
>>>> [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec)
>>>> [0x11e9bcc]
>>>> [part034:21443] [ 5]
>>>> wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)
>>>> [0xcc4ed3]
>>>> [part034:21443] [ 6]
>>>> wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)
>>>> [0xe0e4f5]
>>>> [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
>>>> [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
>>>> [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236)
>>>> [0x4b2c4a]
>>>> [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24)
>>>> [0x47a924]
>>>> [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
>>>> [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)
>>>> [0x361201d8b4]
>>>> [part034:21443] [13] wrf.exe [0x4793c9]
>>>> [part034:21443] *** End of error message ***
>>>> -------
>>>>
>>>> Mouhamad
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>