Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] [patch] return value not updated in ompi_mpi_init()
From: Guillaume Thouvenin (guillaume.thouvenin_at_[hidden])
Date: 2010-02-09 09:13:38


Hello,

 It seems that a return value is not updated during the setup of
process affinity in function ompi_mpi_init()
ompi/runtime/ompi_mpi_init.c:459

 The problem is in the following piece of code:

    [... here ret == OPAL_SUCCESS ...]
    phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
    if (0 > phys_cpu) {
        error = "Could not get physical processor id - cannot set processor affinity";
        goto error;
    }
    [...]

 If opal_paffinity_base_get_physical_processor_id() failed ret is not
updated and we will reach the "error:" label while ret == OPAL_SUCCESS.

 As a result MPI_Init() will return without having initialized the
MPI_COMM_WORLD struct leading to a segmentation fault on calls like
MPI_Comm_size().

 I got the bug recently with new westmere processors for which the
function opal_paffinity_base_get_physical_processor_id() failed if we
are using the mca parameter "opal_paffinity_alone 1" during the
execution.

 I'm not sure that it's the right way to fix the problem but here is a
patch tested with v1.5. This patch allows to report the problem instead
of generating a segmentation fault.

With the patch, the output is:

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  Could not get physical processor id - cannot set processor affinity
  --> Returned "Not found" (-5) instead of "Success" (0)
--------------------------------------------------------------------------

Without the patch, the output was:

 *** Process received signal ***
 Signal: Segmentation fault (11)
 Signal code: Address not mapped (1)
 Failing at address: 0x10
[ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
[ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) [0x7fce74468dfc]
[ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
[ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
[ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
[ 5] ./IMB-MPI1 [0x403499]

Regards,
Guillaume

---
diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
--- a/ompi/runtime/ompi_mpi_init.c
+++ b/ompi/runtime/ompi_mpi_init.c
@@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
                 OPAL_PAFFINITY_CPU_ZERO(mask);
                 phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
                 if (0 > phys_cpu) {
+                    ret = phys_cpu;
                     error = "Could not get physical processor id - cannot set processor affinity";
                     goto error;
                 }