Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] exited on signal 11 (Segmentation fault).
From: Mouhamad Al-Sayed-Ali (Mouhamad.Al-Sayed-Ali_at_[hidden])
Date: 2011-10-26 14:44:36


Hi Gus;

   I have done as uou suggest me but it always doesn't work!

Many thanks for your help

Mouhamad
Gus Correa <gus_at_[hidden]> a écrit :

> Hi Mouhamad
>
> Stack of 10240kB is probably the Linux default,
> not necessarily good for HPC and number crunching.
> I'd suggest that you change it to unlimited,
> unless your system administrator has a very good reason not to do
> so.
> We've seen many atmosphre/ocean/climate models crash because
> they couldn't allocate memory on the stack [automatic arrays
> in subroutines, etc].
>
> This has nothing to do with MPI,
> the programs can fail even when they run in serial mode
> because of this.
>
> You can just append this line to /etc/security/limits.conf:
>
> * - stack -1
>
>
> I hope this helps,
> Gus Correa
>
>
> Mouhamad Al-Sayed-Ali wrote:
>> Hi Gus Correa,
>>
>> the output of ulimit -a is
>>
>>
>> ----
>> file(blocks) unlimited
>> coredump(blocks) 2048
>> data(kbytes) unlimited
>> stack(kbytes) 10240
>> lockedmem(kbytes) unlimited
>> memory(kbytes) unlimited
>> nofiles(descriptors) 1024
>> processes 256
>> --------
>>
>>
>> Thanks
>>
>> Mouhamad
>> Gus Correa <gus_at_[hidden]> a écrit :
>>
>>> Hi Mouhamad
>>>
>>> The locked memory is set to unlimited, but the lines
>>> about the stack are commented out.
>>> Have you tried to add this line:
>>>
>>> * - stack -1
>>>
>>> then run wrf again? [Note no "#" hash character]
>>>
>>> Also, if you login to the compute nodes,
>>> what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
>>> This should tell you what limits are actually set.
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
>>> Mouhamad Al-Sayed-Ali wrote:
>>>> Hi all,
>>>>
>>>> I've checked the "limits.conf", and it contains theses lines
>>>>
>>>>
>>>> # Jcb 29.06.2007 : pbs wrf (Siji)
>>>> #* hard stack 1000000
>>>> #* soft stack 1000000
>>>>
>>>> # Dr 14.02.2008 : pour voltaire mpi
>>>> * hard memlock unlimited
>>>> * soft memlock unlimited
>>>>
>>>>
>>>>
>>>> Many thanks for your help
>>>> Mouhamad
>>>>
>>>> Gus Correa <gus_at_[hidden]> a écrit :
>>>>
>>>>> Hi Mouhamad, Ralph, Terry
>>>>>
>>>>> Very often big programs like wrf crash with segfault because they
>>>>> can't allocate memory on the stack, and assume the system doesn't
>>>>> impose any limits for it. This has nothing to do with MPI.
>>>>>
>>>>> Mouhamad: Check if your stack size is set to unlimited on all compute
>>>>> nodes. The easy way to get it done
>>>>> is to change /etc/security/limits.conf,
>>>>> where you or your system administrator could add these lines:
>>>>>
>>>>> * - memlock -1
>>>>> * - stack -1
>>>>> * - nofile 4096
>>>>>
>>>>> My two cents,
>>>>> Gus Correa
>>>>>
>>>>> Ralph Castain wrote:
>>>>>> Looks like you are crashing in wrf - have you asked them for help?
>>>>>>
>>>>>> On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:
>>>>>>
>>>>>>> Hi again,
>>>>>>>
>>>>>>> This is exactly the error I have:
>>>>>>>
>>>>>>> ----
>>>>>>> taskid: 0 hostname: part034.u-bourgogne.fr
>>>>>>> [part034:21443] *** Process received signal ***
>>>>>>> [part034:21443] Signal: Segmentation fault (11)
>>>>>>> [part034:21443] Signal code: Address not mapped (1)
>>>>>>> [part034:21443] Failing at address: 0xfffffffe01eeb340
>>>>>>> [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
>>>>>>> [part034:21443] [ 1]
>>>>>>> wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
>>>>>>> [part034:21443] [ 2]
>>>>>>> wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
>>>>>>> [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31)
>>>>>>> [0x11e6e41]
>>>>>>> [part034:21443] [ 4]
>>>>>>> wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
>>>>>>> [part034:21443] [ 5]
>>>>>>> wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)
>>>>>>> [0xcc4ed3]
>>>>>>> [part034:21443] [ 6]
>>>>>>> wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)
>>>>>>> [0xe0e4f5]
>>>>>>> [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
>>>>>>> [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
>>>>>>> [part034:21443] [ 9]
>>>>>>> wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
>>>>>>> [part034:21443] [10]
>>>>>>> wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
>>>>>>> [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
>>>>>>> [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)
>>>>>>> [0x361201d8b4]
>>>>>>> [part034:21443] [13] wrf.exe [0x4793c9]
>>>>>>> [part034:21443] *** End of error message ***
>>>>>>> -------
>>>>>>>
>>>>>>> Mouhamad
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>