Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] After OS Update MPI_Init fails on one host
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-07-23 14:22:40


Yeah, it's failing when trying to unpack the topology obtained from hwloc. My guess is that one of the following calls changed in hwloc-1.4.3:

        if (0 != hwloc_topology_set_xmlbuffer(t, xmlbuffer, strlen(xmlbuffer))) {
            rc = OPAL_ERROR;
            free(xmlbuffer);
            hwloc_topology_destroy(t);
            goto cleanup;
        }
        /* since we are loading this from an external source, we have to
         * explicitly set a flag so hwloc sets things up correctly
         */
        if (0 != hwloc_topology_set_flags(t, HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM)) {
            free(xmlbuffer);
            rc = OPAL_ERROR;
            goto cleanup;
        }

Only other things in that routing are hwloc_topology_init and hwloc_topology_load, and those haven't changed in awhile.

On Jul 23, 2013, at 11:12 AM, Kevin H. Hobbs <hobbsk_at_[hidden]> wrote:

> On 07/23/2013 09:54 AM, Jeff Squyres (jsquyres) wrote:
>>
>> I don't know if Fedora RPMs include -g in their builds, or if Fedora
>> includes a debuginfo RPM that you could install such that you can attach
>> a debugger and be able to dig into OMPI's internals yourself.
>>
>
> There is a debuginfo package.
>
> Since I removed all of fedora's openmpi packages and installed from
> source into /opt/openmpi-1.6.5 and /opt/openmpi-1.6.5_hwloc-1.4.3 to
> narrow down on this problem, I now have to re-install the rpms with yum.
>
> sudo yum install openmpi openmpi-devel openmpi-debuginfo
>
> These don't put anything into my PATH or LD_LIBRARY_PATH so I have to :
>
> module load mpi/openmpi-x86_64
>
> I compiled my simple program with :
>
> mpicc -g -o mpi_simple mpi_simple.c
>
> The program links to fedora's copies of the libraries of interest :
>
> mpirun -n 1 ldd mpi_simple | grep hwloc
> libhwloc.so.5 => /lib64/libhwloc.so.5 (0x0000003c57600000)
> mpirun -n 1 ldd mpi_simple | grep mpi
> libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f7207e29000)
>
> I started the debugger with :
>
> mpirun -n 1 gdb mpi_simple
>
> When run in the debugger I got the error I described.
>
> I reran and in gdb did :
>
> set breakpoint pending on
> break util/nidmap.c:146
> run
> step
>
> took me into 'opal_dss_unpack' Then I did 'next' until I got passed
> 'opal_dss_unpack_buffer' which returned the -1 we see outside.
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users