Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bindings not reported and other problems in openmpi-1.7a1r27358
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-09-24 10:51:01


Please try and keep the User list on the messages - allows others to chime in.

You can see the topology by adding "-mca ess_base_verbose 5" to your command line. You'll get other stuff as well, and you'll need to --enable-debug in your configure.

On Sep 24, 2012, at 4:47 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
>> The 1.7 series has a completely different way of handling node
>> topology than was used in the 1.6 series. It provides some
>> enhanced features, but it does have some drawbacks in the case
>> where the topology info isn't correct. I fear you are running
>> into this problem (again).
>>
>> All the commands you show here work fine for me on a Linux
>> x86_64 box using 1.7r27361 on a Westmere 6-core single-socket
>> machine with hyperthreads enabled. I cannot replicate any of
>> the reported problems, so there isn't much I can do at this point.
>>
>> As I've said before, the root problem here appears to be some
>> hwloc-related issue with your setup. Until that gets resolved
>> so we get correct topology info, I'm not sure what can be done
>> to resolve what you are seeing. I'll raise the question of
>> possibly providing some alternative support for setups like
>> yours that just can't get topology info, but that would
>> definitely be a long-term question.
>
> Can we check if you get wrong topology info or which info you get
> at all? Can you tell me a file and location where I can print the
> values of relevant variables on my architecture? Perhaps that can
> help to determine what goes wrong. I would use the latest trunk
> tarball and can make the test a day later, because all changes on
> our "installation server" are mirrored in the night to a our file
> server for all machines.
>
>
> Kind regards
>
> Siegmar
>
>
>
>
>> On Sep 23, 2012, at 3:20 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
>>
>>> Hi,
>>>
>>> yesterday I installed openmpi-1.7a1r27358 and it has an improved
>>> error message compared to openmpi-1.6.2, but doesn't show process bindings
>>> and has some other problems as well.
>>>
>>>
>>> "sunpc0" and "linpc0" are equipped with two dual-core processors running
>>> Solaris 10 x86_64 and Linux x86_64 resp. "tyr" is a dual-processor machine
>>> running Solaris 10 Sparc.
>>>
>>> tyr fd1026 105 mpiexec -np 2 -host sunpc0 -report-bindings \
>>> -map-by core -bind-to-core date
>>> Sun Sep 23 11:46:36 CEST 2012
>>> Sun Sep 23 11:46:36 CEST 2012
>>>
>>> tyr fd1026 106 mpicc -showme
>>> cc -I/usr/local/openmpi-1.7_64_cc/include -mt -m64
>>> -L/usr/local/openmpi-1.7_64_cc/lib64 -lmpi -lpicl -lm -lkstat -llgrp
>>> -lsocket -lnsl -lrt -lm
>>>
>>>
>>> openmpi-1.6.2 shows process bindings.
>>>
>>> tyr fd1026 103 mpiexec -np 2 -host sunpc0 -report-bindings \
>>> -bycore -bind-to-core date
>>> Sun Sep 23 12:09:06 CEST 2012
>>> [sunpc0:13197] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>> [sunpc0:13197] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>> Sun Sep 23 12:09:06 CEST 2012
>>>
>>>
>>> tyr fd1026 104 mpicc -showme
>>> cc -I/usr/local/openmpi-1.6.2_64_cc/include -mt -m64
>>> -L/usr/local/openmpi-1.6.2_64_cc/lib64 -lmpi -lm -lkstat -llgrp
>>> -lsocket -lnsl -lrt -lm
>>>
>>>
>>> On my Linux machine I get a warning.
>>>
>>> tyr fd1026 113 mpiexec -np 2 -host linpc0 -report-bindings \
>>> -map-by core -bind-to-core date
>>> --------------------------------------------------------------------------
>>> WARNING: a request was made to bind a process. While the system
>>> supports binding the process itself, at least one node does NOT
>>> support binding memory to the process location.
>>>
>>> Node: linpc0
>>>
>>> This is a warning only; your job will continue, though performance may
>>> be degraded.
>>> --------------------------------------------------------------------------
>>> Sun Sep 23 11:56:04 CEST 2012
>>> Sun Sep 23 11:56:04 CEST 2012
>>>
>>>
>>>
>>> Everything works fine with openmpi-1.6.2.
>>>
>>> tyr fd1026 106 mpiexec -np 2 -host linpc0 -report-bindings \
>>> -bycore -bind-to-core date
>>> [linpc0:15808] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>> [linpc0:15808] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>> Sun Sep 23 12:11:47 CEST 2012
>>> Sun Sep 23 12:11:47 CEST 2012
>>>
>>>
>>>
>>>
>>> Om my Solaris Sparc machine I get the following errors.
>>>
>>>
>>> tyr fd1026 121 mpiexec -np 2 -report-bindings -map-by core -bind-to-core
> date
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
> of bounds in file
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
> at line 847
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
> of bounds in file
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
> at line 1414
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
> of bounds in file
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
> at line 847
>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
> of bounds in file
>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
> at line 1414
>>>
>>>
>>>
>>> tyr fd1026 122 mpiexec -np 2 -host tyr -report-bindings -map-by core
> -bind-to core date
>>> --------------------------------------------------------------------------
>>> All nodes which are allocated for this job are already filled.
>>> --------------------------------------------------------------------------
>>>
>>>
>>> Once more everything works fine with openmpi-1.6.2.
>>>
>>> tyr fd1026 109 mpiexec -np 2 -report-bindings -bycore -bind-to-core date
>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 0 bound to socket 0[core 0]:
> [B][.]
>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 1 bound to socket 1[core 0]:
> [.][B]
>>> Sun Sep 23 12:14:09 CEST 2012
>>> Sun Sep 23 12:14:09 CEST 2012
>>>
>>> tyr fd1026 110 mpiexec -np 2 -host tyr -report-bindings -bycore
> -bind-to-core date
>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 0 bound to socket 0[core 0]:
> [B][.]
>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 1 bound to socket 1[core 0]:
> [.][B]
>>> Sun Sep 23 12:16:05 CEST 2012
>>> Sun Sep 23 12:16:05 CEST 2012
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>