Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-12 20:48:44


This looks like an uninitialized variable that gnu c handles one way and intel another. Someone recently contributed a patch to the ompi trunk to fix just such a thing in this code area - don't know if it addresses this problem or not.

Can you try the ompi trunk (a nightly tarball from the last day or so forward) and see if this still occurs?

Thanks
Ralph

On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:

> Hi all,
>
> There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c compiler, related with the built in processor binding functionallity. The problem does not occur when ompi is compiled with the gnu c compiler.
>
> A mpi program execution fails (segfault) on mpi_init() when the following rank file is used:
> rank 0=node01 slot=0-3
> rank 1=node01 slot=0-3
> but runs fine with:
> rank 0=node01 slot=0
> rank 1=node01 slot=1-3
> and fine with:
> rank 0=node01 slot=0-1
> rank 1=node01 slot=1-3
> but segfaults with:
> rank 0=node01 slot=0-2
> rank 1=node01 slot=1-3
>
> This is on a two-processor quad-core opteron machine (occurs on all nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> This is the siplest case that fails. Generally, I would like to bind processors to physical procs but always allow any core, like
> rank 0=node01 slot=p0:0-3
> rank 1=node01 slot=p0:0-3
> rank 2=node01 slot=p0:0-3
> rank 3=node01 slot=p0:0-3
> rank 4=node01 slot=p1:0-3
> rank 5=node01 slot=p1:0-3
> rank 6=node01 slot=p1:0-3
> rank 7=node01 slot=p1:0-3
> which fails too.
>
> This happens with a test code that contains only two lines of code, calling mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.
>
> One more interesting thing is, that the problem with setting the process affinity does not occur on our four-processor quad-core opteron nodes, with exactly the same OS etc.
>
>
> Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this rankfile:
> rank 0=node01 slot=0-3
> rank 1=node01 slot=0-3
> ------------- WRONG -----------------
> [node01:23174] mca:base:select:(paffinity) Querying component [linux]
> [node01:23174] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:23174] mca:base:select:(paffinity) Selected component [linux]
> [node01:23174] paffinity slot assignment: slot_list == 0-3
> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:23174] paffinity slot assignment: slot_list == 0-3
> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> [node01:23175] mca:base:select:(paffinity) Querying component [linux]
> [node01:23175] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:23175] mca:base:select:(paffinity) Selected component [linux]
> [node01:23176] mca:base:select:(paffinity) Querying component [linux]
> [node01:23176] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:23176] mca:base:select:(paffinity) Selected component [linux]
> [node01:23175] paffinity slot assignment: slot_list == 0-3
> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:23176] paffinity slot assignment: slot_list == 0-3
> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> [node01:23175] *** Process received signal ***
> [node01:23176] *** Process received signal ***
> [node01:23175] Signal: Segmentation fault (11)
> [node01:23175] Signal code: Address not mapped (1)
> [node01:23175] Failing at address: 0x30
> [node01:23176] Signal: Segmentation fault (11)
> [node01:23176] Signal code: Address not mapped (1)
> [node01:23176] Failing at address: 0x30
> ------------- WRONG -----------------
>
> ------------- RIGHT -----------------
> [node25:23241] mca:base:select:(paffinity) Querying component [linux]
> [node25:23241] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node25:23241] mca:base:select:(paffinity) Selected component [linux]
> [node25:23241] paffinity slot assignment: slot_list == 0-3
> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node25:23241] paffinity slot assignment: slot_list == 0-3
> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> [node25:23242] mca:base:select:(paffinity) Querying component [linux]
> [node25:23242] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node25:23242] mca:base:select:(paffinity) Selected component [linux]
> [node25:23243] mca:base:select:(paffinity) Querying component [linux]
> [node25:23243] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node25:23243] mca:base:select:(paffinity) Selected component [linux]
> ------------- RIGHT -----------------
>
> Apparently, only a master process (ID [node01:23174] and [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case, also the compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the their own paffinity properties.
>
>
>
> Note that for the rankfile also the notation does not work. But that seems to have a different origin, as it tries to bind to a core# 4, whereas there are just 0-3.
> rank 0=node01 slot=0:*
> rank 1=node01 slot=0:*
>
>
> Thanks for your help on this!
>
> --
> Daan van Rossum
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel