Hi Gus,
This may not help, but it's worth a try. If it's not too much
trouble, can you please reconfigure your Open MPI installation with --
enable-debug and then rebuild? After that, may we see the stack trace
from a core file that is produced after the segmentation fault?
Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On May 6, 2010, at 12:01 PM, Gus Correa wrote:
> Hi Eugene
>
> Thanks for the detailed answer.
>
> *************
>
> 1) Now I can see and use the btl_sm_num_fifos component:
>
> I had committed already "btl = ^sm" to the openmpi-mca-params.conf
> file. This apparently hides the btl_sm_num_fifos from ompi_info.
>
> After I switched to no options in openmpi-mca-params.conf,
> then ompi_info showed the btl_sm_num_fifos component.
>
> ompi_info --all | grep btl_sm_num_fifos
> MCA btl: parameter "btl_sm_num_fifos" (current
> value: "1", data source: default value)
>
> A side comment:
> This means that the system administrator can
> hide some Open MPI options from the users, depending on what
> he puts in the openmpi-mca-params.conf file, right?
>
> *************
>
> 2) However, running with "sm" still breaks, unfortunately:
>
> Boomer!
> I get the same errors that I reported in my very
> first email, if I increase the number of processes to 16,
> to explore the hyperthreading range.
>
> This is using "sm" (i.e. not excluded in the mca config file),
> and btl_sm_num_fifos (mpiexec command line)
>
> The machine hangs, requires a hard reboot, etc, etc,
> as reported earlier. See the below, please.
>
> So, I guess the conclusion is that I can use sm,
> but I have to remain within the range of physical cores (8),
> not oversubscribe, not try to explore the HT range.
> Should I expect it to work also for np>number of physical cores?
>
> I wonder if this would still work with np<=8, but with heavier code.
> (I only used hello_c.c so far.)
> Not sure I'll be able to test this, the user wants to use the machine.
>
>
> $mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out
> Hello, world, I am 0 of 4
> Hello, world, I am 1 of 4
> Hello, world, I am 2 of 4
> Hello, world, I am 3 of 4
>
> $ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out
> Hello, world, I am 0 of 8
> Hello, world, I am 1 of 8
> Hello, world, I am 2 of 8
> Hello, world, I am 3 of 8
> Hello, world, I am 4 of 8
> Hello, world, I am 5 of 8
> Hello, world, I am 6 of 8
> Hello, world, I am 7 of 8
>
> $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 8 with PID 3659 on node
> spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> $
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:------------[ cut here ]------------
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:invalid opcode: 0000 [#1] SMP
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/
> physical_package_id
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:Stack:
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:Call Trace:
>
> Message from syslogd_at_spinoza at May 6 13:38:13 ...
> kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00
> 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0
> 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01
>
> *****************
>
> Many thanks,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> Eugene Loh wrote:
>> Gus Correa wrote:
>>> Hi Eugene
>>>
>>> Thank you for answering one of my original questions.
>>>
>>> However, there seems to be a problem with the syntax.
>>> Is it really "-mca btl btl_sm_num_fifos=some_number"?
>> No. Try "--mca btl_sm_num_fifos 4". Or,
>> % setenv OMPI_MCA_btl_sm_num_fifos 4
>> % ompi_info -a | grep btl_sm_num_fifos # check that things were
>> set correctly
>> % mpirun -n 4 a.out
>>> When I grep any component starting with btl_sm I get nothing:
>>>
>>> ompi_info --all | grep btl_sm
>>> (No output)
>> I'm no guru, but I think the reason has something to do with
>> dynamically loaded somethings. E.g.,
>> % /home/eugene/ompi/bin/ompi_info --all | grep btl_sm_num_fifos
>> (no output)
>> % setenv OPAL_PREFIX /home/eugene/ompi
>> % set path = ( $OPAL_PREFIX/bin $path )
>> % ompi_info --all | grep btl_sm_num_fifos
>> MCA btl: parameter "btl_sm_num_fifos" (current
>> value: "1", data source: default value)
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
|