Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-09-04 16:13:40


Hello,

Could you log again on this node (with same cgroups enabled), run
    hwloc-gather-topology <name>
and send the resulting <name>.output and <name>.tar.bz2?

Send them to the hwloc-devel or open a ticket on
https://svn.open-mpi.org/trac/hwloc (or send them to me in private if
you don't want to subscribe).

thanks
Brice

Le 04/09/2011 22:00, Ake Sandgren a écrit :
> Hi!
>
> I'm getting a segfault in hwloc_setup_distances_from_os_matrix in the
> call to hwloc_bitmap_or due to objs or objs[i]->cpuset being freed and
> containing garbage, objs[i]->cpuset has infinite < 0.
>
> I only get this when using slurm with cgroups, asking for 2 nodes with 1
> cpu each. The cpuset is then already set when mpiexec starts and
> something breaks down.
>
> valgrind on mpiexec says:
> ==27540== Invalid read of size 8
> ==27540== at 0x7178F79:
> opal_paffinity_hwloc_finalize_logical_distances (distances.c:412)
> ==27540== by 0x7172C1E: hwloc_discover (topology.c:1805)
> ==27540== by 0x71745F2: opal_paffinity_hwloc_topology_load
> (topology.c:2244)
> ==27540== by 0x7164FB4: hwloc_open (paffinity_hwloc_component.c:93)
> ==27540== by 0x4F98D2E: mca_base_components_open
> (mca_base_components_open.c:214)
> ==27540== by 0x500084B: opal_paffinity_base_open
> (paffinity_base_open.c:120)
> ==27540== by 0x4F525BB: opal_init (opal_init.c:307)
> ==27540== by 0x4E50CA8: orte_init (orte_init.c:78)
> ==27540== by 0x403C8F: orterun (orterun.c:615)
> ==27540== by 0x4032C3: main (main.c:13)
> ==27540== Address 0x6e38380 is 160 bytes inside a block of size 248
> free'd
> ==27540== at 0x4C270BD: free (vg_replace_malloc.c:366)
> ==27540== by 0x716B6A1: unlink_and_free_object_and_children
> (topology.c:1131)
> ==27540== by 0x716BB35: remove_empty (topology.c:1150)
> ==27540== by 0x7170CBB: hwloc_discover (topology.c:1768)
> ==27540== by 0x71745F2: opal_paffinity_hwloc_topology_load
> (topology.c:2244)
> ==27540== by 0x7164FB4: hwloc_open (paffinity_hwloc_component.c:93)
> ==27540== by 0x4F98D2E: mca_base_components_open
> (mca_base_components_open.c:214)
> ==27540== by 0x500084B: opal_paffinity_base_open
> (paffinity_base_open.c:120)
> ==27540== by 0x4F525BB: opal_init (opal_init.c:307)
> ==27540== by 0x4E50CA8: orte_init (orte_init.c:78)
> ==27540== by 0x403C8F: orterun (orterun.c:615)
> ==27540== by 0x4032C3: main (main.c:13)
>
> I hope the above info is enough and that you can fix it :-)
>
> /Ã…ke S.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users