Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2010-05-06 17:41:30


Hi Gus,

Doh! I didn't see the kernel-related messages after the segfault
message. Definitely some weirdness here that is beyond your
control... Sorry about that.

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On May 6, 2010, at 3:28 PM, Gus Correa wrote:
> Hi Samuel
>
> Samuel K. Gutierrez wrote:
>> Hi Gus,
>> This may not help, but it's worth a try.  If it's not too much  
>> trouble, can you please reconfigure your Open MPI installation with  
>> --enable-debug and then rebuild?  After that, may we see the stack  
>> trace from a core file that is produced after the segmentation fault?
>> Thanks,
>> -- 
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>
> Thank you for the suggestion.
>
> I am a bit reluctant to try this because when it fails,
> it *really* fails.
> Most of the times the machine doesn't even return the prompt,
> and in all cases it freezes and requires a hard reboot.
> It is not a segfault that the OS can catch, I guess.
> I wonder if enabling debug mode would do much for us,
> and get to the point of dumping a core, or just die before that.
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>> On May 6, 2010, at 12:01 PM, Gus Correa wrote:
>>> Hi Eugene
>>>
>>> Thanks for the detailed answer.
>>>
>>> *************
>>>
>>> 1) Now I can see and use the btl_sm_num_fifos component:
>>>
>>> I had committed already "btl = ^sm" to the openmpi-mca-params.conf
>>> file.  This apparently hides the btl_sm_num_fifos from ompi_info.
>>>
>>> After I switched to no options in openmpi-mca-params.conf,
>>> then ompi_info showed the btl_sm_num_fifos component.
>>>
>>> ompi_info --all | grep btl_sm_num_fifos
>>>                MCA btl: parameter "btl_sm_num_fifos" (current  
>>> value: "1", data source: default value)
>>>
>>> A side comment:
>>> This means that the system administrator can
>>> hide some Open MPI options from the users, depending on what
>>> he puts in the openmpi-mca-params.conf file, right?
>>>
>>> *************
>>>
>>> 2) However, running with "sm" still breaks, unfortunately:
>>>
>>> Boomer!
>>> I get the same errors that I reported in my very
>>> first email, if I increase the number of processes to 16,
>>> to explore the hyperthreading range.
>>>
>>> This is using "sm" (i.e. not excluded in the mca config file),
>>> and btl_sm_num_fifos (mpiexec command line)
>>>
>>> The machine hangs, requires a hard reboot, etc, etc,
>>> as reported earlier.  See the below, please.
>>>
>>> So, I guess the conclusion is that I can use sm,
>>> but I have to remain within the range of physical cores (8),
>>> not oversubscribe, not try to explore the HT range.
>>> Should I expect it to work also for np>number of physical cores?
>>>
>>> I wonder if this would still work with np<=8, but with heavier code.
>>> (I only used hello_c.c so far.)
>>> Not sure I'll be able to test this, the user wants to use the  
>>> machine.
>>>
>>>
>>> $mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out
>>> Hello, world, I am 0 of 4
>>> Hello, world, I am 1 of 4
>>> Hello, world, I am 2 of 4
>>> Hello, world, I am 3 of 4
>>>
>>> $ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out
>>> Hello, world, I am 0 of 8
>>> Hello, world, I am 1 of 8
>>> Hello, world, I am 2 of 8
>>> Hello, world, I am 3 of 8
>>> Hello, world, I am 4 of 8
>>> Hello, world, I am 5 of 8
>>> Hello, world, I am 6 of 8
>>> Hello, world, I am 7 of 8
>>>
>>> $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
>>> --------------------------------------------------------------------------
>>> mpiexec noticed that process rank 8 with PID 3659 on node  
>>> spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> $
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:------------[ cut here ]------------
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:invalid opcode: 0000 [#1] SMP
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/ 
>>> physical_package_id
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:Stack:
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:Call Trace:
>>>
>>> Message from syslogd_at_spinoza at May  6 13:38:13 ...
>>> kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00  
>>> 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39  
>>> d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94  
>>> 24 00 01
>>>
>>> *****************
>>>
>>> Many thanks,
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>> Eugene Loh wrote:
>>>> Gus Correa wrote:
>>>>> Hi Eugene
>>>>>
>>>>> Thank you for answering one of my original questions.
>>>>>
>>>>> However, there seems to be a problem with the syntax.
>>>>> Is it really "-mca btl btl_sm_num_fifos=some_number"?
>>>> No.  Try "--mca btl_sm_num_fifos 4".  Or,
>>>> % setenv OMPI_MCA_btl_sm_num_fifos 4
>>>> % ompi_info -a | grep btl_sm_num_fifos     # check that things  
>>>> were set correctly
>>>> % mpirun -n 4 a.out
>>>>> When I grep any component starting with btl_sm I get nothing:
>>>>>
>>>>> ompi_info --all | grep btl_sm
>>>>> (No output)
>>>> I'm no guru, but I think the reason has something to do with  
>>>> dynamically loaded somethings.  E.g.,
>>>> % /home/eugene/ompi/bin/ompi_info --all | grep btl_sm_num_fifos
>>>> (no output)
>>>> % setenv OPAL_PREFIX /home/eugene/ompi
>>>> % set path = ( $OPAL_PREFIX/bin $path )
>>>> % ompi_info --all | grep btl_sm_num_fifos
>>>>               MCA btl: parameter "btl_sm_num_fifos" (current  
>>>> value: "1", data source: default value)
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users