Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.5.4 paffinity with Magny-Cours
From: Kaizaad Bilimorya (kaizaad_at_[hidden])
Date: 2011-09-09 15:39:29


On Fri, 9 Sep 2011, Brice Goglin wrote:
> Le 09/09/2011 21:03, Kaizaad Bilimorya a écrit :
>>
>> We seem to have an issue similar to this thread
>>
>> "Bug in openmpi 1.5.4 in paffinity"
>> http://www.open-mpi.org/community/lists/users/2011/09/17151.php
>>
>> Using the following version of hwloc (from EPEL repo - we run CentOS 5.6)
>>
>> $ hwloc-info --version
>> hwloc-info 1.1rc6
>
> Hello,
>
> Note that Open MPI 1.5.4 uses its own embedded copy of hwloc 1.2.0.

Ok thanks, good to know.

> Your own 1.1rc6 should actual work fine (does lstopo crash?) but OMPI
> cannot use it :)

lstopo works. When we first got these chips I ran it (great tool btw, gave
me a better understanding of the chip architecture). It shows an
"interesting" picture for Magny-Cours (ie: 2 die's per socket along with 2
NumaNodes - yes Magny-Cours is a strange beast). We knew this was the
case, it is just nice to see the diagram in all its glory:

http://www.sharcnet.ca/~kaizaad/orca/orca_lstopo.jpg

>> A simple "mpi_hello" program works fine with cpusets and openMPI 1.4.2
>> but with openMPI 1.5.3 and cpusets we get the following segfault
>> (works fine on the node without cpusets enabled):
>>
>> [red2:28263] *** Process received signal *** [red2:28263] Signal:
>> Segmentation fault (11) [red2:28263] Signal code: Address not mapped
>> (1) [red2:28263] Failing at address: 0x8 [red2:28263] [ 0]
>> /lib64/libpthread.so.0 [0x2b3dce315b10] [red2:28263] [ 1]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so(opal_paffinity_hwloc_bitmap_or+0x142) [0x2b3dcef75cb2] [red2:28263] [ 2]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef71404] [red2:28263] [ 3]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef6bb26] [red2:28263] [ 4]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so(opal_paffinity_hwloc_topology_load+0xe2) [0x2b3dcef6e0b2] [red2:28263] [ 5]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef68b72] [red2:28263] [ 6]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(mca_base_components_open+0x302) [0x2b3dcd2b08f2] [red2:28263] [ 7]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(opal_paffinity_base_open+0x67) [0x2b3dcd2d3a87] [red2:28263] [ 8]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(opal_init+0x71) [0x2b3dcd28bfb1] [red2:28263] [ 9]
>> /opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(orte_init+0x23) [0x2b3dcd2318f3] [red2:28263] [10]
>> /opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x4049b5] [red2:28263] [11] /opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x404388]
>> [red2:28263] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b3dce540994] [red2:28263] [13]
>> /opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x4042b9] [red2:28263]
>> *** End of error message *** /var/spool/torque/mom_priv/jobs/968.SC:
>> line 3: 28263 Segmentation fault
>> /opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun -np 2 ./a.out
>>
>> Please let me know if you need more information about this issue
>
> This looks like the exact same issue. Did you try the patch(es) I sent
> earlier?
> See http://www.open-mpi.org/community/lists/users/2011/09/17159.php
> If it's not enough, try adding the other patch from
> http://www.open-mpi.org/community/lists/users/2011/09/17156.php
> Brice

I'll do that now.

thanks a bunch
-k