Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bindings not reported and other problems in openmpi-1.7a1r27358
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-09-23 13:05:25


The 1.7 series has a completely different way of handling node topology than was used in the 1.6 series. It provides some enhanced features, but it does have some drawbacks in the case where the topology info isn't correct. I fear you are running into this problem (again).

All the commands you show here work fine for me on a Linux x86_64 box using 1.7r27361 on a Westmere 6-core single-socket machine with hyperthreads enabled. I cannot replicate any of the reported problems, so there isn't much I can do at this point.

As I've said before, the root problem here appears to be some hwloc-related issue with your setup. Until that gets resolved so we get correct topology info, I'm not sure what can be done to resolve what you are seeing. I'll raise the question of possibly providing some alternative support for setups like yours that just can't get topology info, but that would definitely be a long-term question.

On Sep 23, 2012, at 3:20 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> yesterday I installed openmpi-1.7a1r27358 and it has an improved
> error message compared to openmpi-1.6.2, but doesn't show process bindings
> and has some other problems as well.
>
>
> "sunpc0" and "linpc0" are equipped with two dual-core processors running
> Solaris 10 x86_64 and Linux x86_64 resp. "tyr" is a dual-processor machine
> running Solaris 10 Sparc.
>
> tyr fd1026 105 mpiexec -np 2 -host sunpc0 -report-bindings \
> -map-by core -bind-to-core date
> Sun Sep 23 11:46:36 CEST 2012
> Sun Sep 23 11:46:36 CEST 2012
>
> tyr fd1026 106 mpicc -showme
> cc -I/usr/local/openmpi-1.7_64_cc/include -mt -m64
> -L/usr/local/openmpi-1.7_64_cc/lib64 -lmpi -lpicl -lm -lkstat -llgrp
> -lsocket -lnsl -lrt -lm
>
>
> openmpi-1.6.2 shows process bindings.
>
> tyr fd1026 103 mpiexec -np 2 -host sunpc0 -report-bindings \
> -bycore -bind-to-core date
> Sun Sep 23 12:09:06 CEST 2012
> [sunpc0:13197] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
> [sunpc0:13197] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
> Sun Sep 23 12:09:06 CEST 2012
>
>
> tyr fd1026 104 mpicc -showme
> cc -I/usr/local/openmpi-1.6.2_64_cc/include -mt -m64
> -L/usr/local/openmpi-1.6.2_64_cc/lib64 -lmpi -lm -lkstat -llgrp
> -lsocket -lnsl -lrt -lm
>
>
> On my Linux machine I get a warning.
>
> tyr fd1026 113 mpiexec -np 2 -host linpc0 -report-bindings \
> -map-by core -bind-to-core date
> --------------------------------------------------------------------------
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>
> Node: linpc0
>
> This is a warning only; your job will continue, though performance may
> be degraded.
> --------------------------------------------------------------------------
> Sun Sep 23 11:56:04 CEST 2012
> Sun Sep 23 11:56:04 CEST 2012
>
>
>
> Everything works fine with openmpi-1.6.2.
>
> tyr fd1026 106 mpiexec -np 2 -host linpc0 -report-bindings \
> -bycore -bind-to-core date
> [linpc0:15808] MCW rank 0 bound to socket 0[core 0]: [B .][. .]
> [linpc0:15808] MCW rank 1 bound to socket 0[core 1]: [. B][. .]
> Sun Sep 23 12:11:47 CEST 2012
> Sun Sep 23 12:11:47 CEST 2012
>
>
>
>
> Om my Solaris Sparc machine I get the following errors.
>
>
> tyr fd1026 121 mpiexec -np 2 -report-bindings -map-by core -bind-to-core date
> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out of bounds in file
> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c at line 847
> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out of bounds in file
> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c at line 1414
> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out of bounds in file
> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c at line 847
> [tyr.informatik.hs-fulda.de:23773] [[32457,0],0] ORTE_ERROR_LOG: Value out of bounds in file
> ../../../../openmpi-1.7a1r27358/orte/mca/odls/base/odls_base_default_fns.c at line 1414
>
>
>
> tyr fd1026 122 mpiexec -np 2 -host tyr -report-bindings -map-by core -bind-to core date
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
>
> Once more everything works fine with openmpi-1.6.2.
>
> tyr fd1026 109 mpiexec -np 2 -report-bindings -bycore -bind-to-core date
> [tyr.informatik.hs-fulda.de:23869] MCW rank 0 bound to socket 0[core 0]: [B][.]
> [tyr.informatik.hs-fulda.de:23869] MCW rank 1 bound to socket 1[core 0]: [.][B]
> Sun Sep 23 12:14:09 CEST 2012
> Sun Sep 23 12:14:09 CEST 2012
>
> tyr fd1026 110 mpiexec -np 2 -host tyr -report-bindings -bycore -bind-to-core date
> [tyr.informatik.hs-fulda.de:23877] MCW rank 0 bound to socket 0[core 0]: [B][.]
> [tyr.informatik.hs-fulda.de:23877] MCW rank 1 bound to socket 1[core 0]: [.][B]
> Sun Sep 23 12:16:05 CEST 2012
> Sun Sep 23 12:16:05 CEST 2012
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users