Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bindings not reported and other problems in openmpi-1.7a1r27358
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-25 10:45:01


I will also add that Oracle seems to be fading away from Open MPI; their priorities seem to be shifting, so it's quite possible that Open MPI is experiencing bit rot / lack of testing on Solaris.

We already ran into the one issue where process binding is not well supported on Solaris (i.e., you can only bind on specific boundaries, as discussed here: http://www.open-mpi.org/community/lists/hwloc-users/2012/09/0708.php). You may well be running into other issues that we're finding difficult to answer because our Solaris developers have more-or-less left the building.

:-\

On Sep 24, 2012, at 4:51 PM, Ralph Castain wrote:

> Please try and keep the User list on the messages - allows others to chime in.
>
> You can see the topology by adding "-mca ess_base_verbose 5" to your command line. You'll get other stuff as well, and you'll need to --enable-debug in your configure.
>
>
> On Sep 24, 2012, at 4:47 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:
>
>> Hi,
>>
>>> The 1.7 series has a completely different way of handling node
>>> topology than was used in the 1.6 series. It provides some
>>> enhanced features, but it does have some drawbacks in the case
>>> where the topology info isn't correct. I fear you are running
>>> into this problem (again).
>>>
>>> All the commands you show here work fine for me on a Linux
>>> x86_64 box using 1.7r27361 on a Westmere 6-core single-socket
>>> machine with hyperthreads enabled. I cannot replicate any of
>>> the reported problems, so there isn't much I can do at this point.
>>>
>>> As I've said before, the root problem here appears to be some
>>> hwloc-related issue with your setup. Until that gets resolved
>>> so we get correct topology info, I'm not sure what can be done
>>> to resolve what you are seeing. I'll raise the question of
>>> possibly providing some alternative support for setups like
>>> yours that just can't get topology info, but that would
>>> definitely be a long-term question.
>>
>> Can we check if you get wrong topology info or which info you get
>> at all? Can you tell me a file and location where I can print the
>> values of relevant variables on my architecture? Perhaps that can
>> help to determine what goes wrong. I would use the latest trunk
>> tarball and can make the test a day later, because all changes on
>> our "installation server" are mirrored in the night to a our file
>> server for all machines.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>
>>
>>> On Sep 23, 2012, at 3:20 AM, Siegmar Gross
>> <Siegmar.Gross_at_[hidden]> wrote:
>>>
>>>> Hi,
>>>>
>>>> yesterday I installed openmpi-1.7a1r27358 and it has an improved
>>>> error message compared to openmpi-1.6.2, but doesn't show process bindings
>>>> and has some other problems as well.
>>>>
>>>>
>>>> "sunpc0" and "linpc0" are equipped with two dual-core processors running
>>>> Solaris 10 x86_64 and Linux x86_64 resp. "tyr" is a dual-processor machine
>>>> running Solaris 10 Sparc.
>>>>
>>>> tyr fd1026 105 mpiexec -np 2 -host sunpc0 -report-bindings \
>>>> -map-by core -bind-to-core date
>>>> Sun Sep 23 11:46:36 CEST 2012
>>>> Sun Sep 23 11:46:36 CEST 2012
>>>>
>>>> tyr fd1026 106 mpicc -showme
>>>> cc -I/usr/local/openmpi-1.7_64_cc/include -mt -m64
>>>> -L/usr/local/openmpi-1.7_64_cc/lib64 -lmpi -lpicl -lm -lkstat -llgrp
>>>> -lsocket -lnsl -lrt -lm
>>>>
>>>>
>>>> openmpi-1.6.2 shows process bindings.
>>>>
>>>> tyr fd1026 103 mpiexec -np 2 -host sunpc0 -report-bindings \
>>>> -bycore -bind-to-core date
>>>> Sun Sep 23 12:09:06 CEST 2012
>>>> [sunpc0:13197] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>>> [sunpc0:13197] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>>> Sun Sep 23 12:09:06 CEST 2012
>>>>
>>>>
>>>> tyr fd1026 104 mpicc -showme
>>>> cc -I/usr/local/openmpi-1.6.2_64_cc/include -mt -m64
>>>> -L/usr/local/openmpi-1.6.2_64_cc/lib64 -lmpi -lm -lkstat -llgrp
>>>> -lsocket -lnsl -lrt -lm
>>>>
>>>>
>>>> On my Linux machine I get a warning.
>>>>
>>>> tyr fd1026 113 mpiexec -np 2 -host linpc0 -report-bindings \
>>>> -map-by core -bind-to-core date
>>>> --------------------------------------------------------------------------
>>>> WARNING: a request was made to bind a process. While the system
>>>> supports binding the process itself, at least one node does NOT
>>>> support binding memory to the process location.
>>>>
>>>> Node: linpc0
>>>>
>>>> This is a warning only; your job will continue, though performance may
>>>> be degraded.
>>>> --------------------------------------------------------------------------
>>>> Sun Sep 23 11:56:04 CEST 2012
>>>> Sun Sep 23 11:56:04 CEST 2012
>>>>
>>>>
>>>>
>>>> Everything works fine with openmpi-1.6.2.
>>>>
>>>> tyr fd1026 106 mpiexec -np 2 -host linpc0 -report-bindings \
>>>> -bycore -bind-to-core date
>>>> [linpc0:15808] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
>>>> [linpc0:15808] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
>>>> Sun Sep 23 12:11:47 CEST 2012
>>>> Sun Sep 23 12:11:47 CEST 2012
>>>>
>>>>
>>>>
>>>>
>>>> Om my Solaris Sparc machine I get the following errors.
>>>>
>>>>
>>>> tyr fd1026 121 mpiexec -np 2 -report-bindings -map-by core -bind-to-core
>> date
>>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
>> of bounds in file
>>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
>> at line 847
>>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
>> of bounds in file
>>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
>> at line 1414
>>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
>> of bounds in file
>>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
>> at line 847
>>>> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out
>> of bounds in file
>>>> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c
>> at line 1414
>>>>
>>>>
>>>>
>>>> tyr fd1026 122 mpiexec -np 2 -host tyr -report-bindings -map-by core
>> -bind-to core date
>>>> --------------------------------------------------------------------------
>>>> All nodes which are allocated for this job are already filled.
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> Once more everything works fine with openmpi-1.6.2.
>>>>
>>>> tyr fd1026 109 mpiexec -np 2 -report-bindings -bycore -bind-to-core date
>>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 0 bound to socket 0[core 0]:
>> [B][.]
>>>> [tyr.informatik.hs-fulda.de:23869] MCW rank 1 bound to socket 1[core 0]:
>> [.][B]
>>>> Sun Sep 23 12:14:09 CEST 2012
>>>> Sun Sep 23 12:14:09 CEST 2012
>>>>
>>>> tyr fd1026 110 mpiexec -np 2 -host tyr -report-bindings -bycore
>> -bind-to-core date
>>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 0 bound to socket 0[core 0]:
>> [B][.]
>>>> [tyr.informatik.hs-fulda.de:23877] MCW rank 1 bound to socket 1[core 0]:
>> [.][B]
>>>> Sun Sep 23 12:16:05 CEST 2012
>>>> Sun Sep 23 12:16:05 CEST 2012
>>>>
>>>>
>>>> Kind regards
>>>>
>>>> Siegmar
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/