Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [patch] return value not updated in ompi_mpi_init()
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-02-09 11:29:19


Oops - yep, that is an oversight! Will fix - thanks!

On Feb 9, 2010, at 7:13 AM, Guillaume Thouvenin wrote:

> Hello,
>
> It seems that a return value is not updated during the setup of
> process affinity in function ompi_mpi_init()
> ompi/runtime/ompi_mpi_init.c:459
>
> The problem is in the following piece of code:
>
> [... here ret == OPAL_SUCCESS ...]
> phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
> if (0 > phys_cpu) {
> error = "Could not get physical processor id - cannot set processor affinity";
> goto error;
> }
> [...]
>
> If opal_paffinity_base_get_physical_processor_id() failed ret is not
> updated and we will reach the "error:" label while ret == OPAL_SUCCESS.
>
> As a result MPI_Init() will return without having initialized the
> MPI_COMM_WORLD struct leading to a segmentation fault on calls like
> MPI_Comm_size().
>
> I got the bug recently with new westmere processors for which the
> function opal_paffinity_base_get_physical_processor_id() failed if we
> are using the mca parameter "opal_paffinity_alone 1" during the
> execution.
>
> I'm not sure that it's the right way to fix the problem but here is a
> patch tested with v1.5. This patch allows to report the problem instead
> of generating a segmentation fault.
>
> With the patch, the output is:
>
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> Could not get physical processor id - cannot set processor affinity
> --> Returned "Not found" (-5) instead of "Success" (0)
> --------------------------------------------------------------------------
>
> Without the patch, the output was:
>
> *** Process received signal ***
> Signal: Segmentation fault (11)
> Signal code: Address not mapped (1)
> Failing at address: 0x10
> [ 0] /lib64/libpthread.so.0 [0x3d4e20ee90]
> [ 1] /home_nfs/thouveng/dev/openmpi-v1.5/lib/libmpi.so.0(MPI_Comm_size+0x9c) [0x7fce74468dfc]
> [ 2] ./IMB-MPI1(IMB_init_pointers+0x2f) [0x40629f]
> [ 3] ./IMB-MPI1(main+0x65) [0x4035c5]
> [ 4] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d4da1ea2d]
> [ 5] ./IMB-MPI1 [0x403499]
>
>
> Regards,
> Guillaume
>
> ---
> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
> --- a/ompi/runtime/ompi_mpi_init.c
> +++ b/ompi/runtime/ompi_mpi_init.c
> @@ -459,6 +459,7 @@ int ompi_mpi_init(int argc, char **argv,
> OPAL_PAFFINITY_CPU_ZERO(mask);
> phys_cpu = opal_paffinity_base_get_physical_processor_id(nrank);
> if (0 > phys_cpu) {
> + ret = phys_cpu;
> error = "Could not get physical processor id - cannot set processor affinity";
> goto error;
> }
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel