Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] [patch] return value not updated in ompi_mpi_init()
From: Guillaume Thouvenin (guillaume.thouvenin_at_[hidden])
Date: 2010-02-09 09:13:38


Hello,

 It seems that a return value is not updated during the setup of
process affinity in function ompi_mpi_init()
ompi/runtime/ompi_mpi_init.c:459

 The problem is in the following piece of code:

    [... here ret == OPAL_SUCCESS ...]
    phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
    if (0 > phys_cpu) {
        error = "Could not get physical processor id - cannot set processor affinity";
        goto error;
    }
    [...]

 If opal_paffinity_base_get_physical_processor_id() failed ret is not
updated and we will reach the "error:" label while ret == OPAL_SUCCESS.

 As a result MPI_Init() will return without having initialized the
MPI_COMM_WORLD struct leading to a segmentation fault on calls like
MPI_Comm_size().

 I got the bug recently with new westmere processors for which the
function opal_paffinity_base_get_physical_processor_id() failed if we
are using the mca parameter "opal_paffinity_alone 1" during the
execution.

 I'm not sure that it's the right way to fix the problem but here is a
patch tested with v1.5. This patch allows to report the problem instead
of generating a segmentation fault.

With the patch, the output is:

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  Could not get physical processor id - cannot set processor affinity
  --> Returned "Not found" (-5) instead of "Success" (0)
--------------------------------------------------------------------------

Without the patch, the output was:

 *** Process received signal ***
 Signal: Segmentation fault (11)
 Signal code: Address not mapped (1)
 Failing at address: 0x10
[ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
[ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) [0x7fce74468dfc]
[ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
[ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
[ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
[ 5] ./IMB-MPI1 [0x403499]

Regards,
Guillaume

---
diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
                 OPAL_PAFFINITY_CPU_ZERO(mask);
                 phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
                 if (0 > phys_cpu) {
+                    ret = phys_cpu;
                     error = "Could not get physical processor id - cannot set processor affinity";
                     goto error;
                 }