Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-12-14 14:57:43


I don't really want to throw fud on this list but we've seen all sorts
of oddities with OMPI 1.3.4 being built with Intel's 11.1 compiler
versus their 11.0 or other compilers (gcc, Sun Studio, pgi, and
pathscale). I have not tested your specific failing case but
considering your issue doesn't show up with gcc I am wondering if there
is some sort of optimization issue with the 11.1 compiler.

It might be interesting to see if using certain optimization levels with
the Intel 11.1 compiler produces a working OMPI library.

--td

Daan van Rossum wrote:
> Hi Ralph,
>
> I took the Dec 10th snapshot, but got exactly the same behavior as with version 1.3.4.
>
> I just noticed that even this rankfile doesn't work, with a single process:
> rank 0=node01 slot=0-3
>
> ------------
> [node01:31105] mca:base:select:(paffinity) Querying component [linux]
> [node01:31105] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:31105] mca:base:select:(paffinity) Selected component [linux]
> [node01:31105] paffinity slot assignment: slot_list == 0-3
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] mca:base:select:(paffinity) Querying component [linux]
> [node01:31106] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:31106] mca:base:select:(paffinity) Selected component [linux]
> [node01:31106] paffinity slot assignment: slot_list == 0-3
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] *** An error occurred in MPI_Comm_rank
> [node01:31106] *** on a NULL communicator
> [node01:31106] *** Unknown error
> [node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> ------------
>
> The spawned compute process doesn't sense that it should skip the setting paffinity...
>
>
> I saw the posting from last July about a similar problem (the problem that I mentioned on the bottom, with the slot=0:* notation not working). But that is a different problem (besides, that is still not working as it seems).
>
> Best,
> Daan
>
> * on Saturday, 12.12.09 at 18:48, Ralph Castain <rhc_at_[hidden]> wrote:
>
>
>> This looks like an uninitialized variable that gnu c handles one way and intel another. Someone recently contributed a patch to the ompi trunk to fix just such a thing in this code area - don't know if it addresses this problem or not.
>>
>> Can you try the ompi trunk (a nightly tarball from the last day or so forward) and see if this still occurs?
>>
>> Thanks
>> Ralph
>>
>> On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
>>
>>
>>> Hi all,
>>>
>>> There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c compiler, related with the built in processor binding functionallity. The problem does not occur when ompi is compiled with the gnu c compiler.
>>>
>>> A mpi program execution fails (segfault) on mpi_init() when the following rank file is used:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> but runs fine with:
>>> rank 0=node01 slot=0
>>> rank 1=node01 slot=1-3
>>> and fine with:
>>> rank 0=node01 slot=0-1
>>> rank 1=node01 slot=1-3
>>> but segfaults with:
>>> rank 0=node01 slot=0-2
>>> rank 1=node01 slot=1-3
>>>
>>> This is on a two-processor quad-core opteron machine (occurs on all nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
>>> This is the siplest case that fails. Generally, I would like to bind processors to physical procs but always allow any core, like
>>> rank 0=node01 slot=p0:0-3
>>> rank 1=node01 slot=p0:0-3
>>> rank 2=node01 slot=p0:0-3
>>> rank 3=node01 slot=p0:0-3
>>> rank 4=node01 slot=p1:0-3
>>> rank 5=node01 slot=p1:0-3
>>> rank 6=node01 slot=p1:0-3
>>> rank 7=node01 slot=p1:0-3
>>> which fails too.
>>>
>>> This happens with a test code that contains only two lines of code, calling mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.
>>>
>>> One more interesting thing is, that the problem with setting the process affinity does not occur on our four-processor quad-core opteron nodes, with exactly the same OS etc.
>>>
>>>
>>> Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this rankfile:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> ------------- WRONG -----------------
>>> [node01:23174] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23174] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23174] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23175] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23175] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23176] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23175] paffinity slot assignment: slot_list == 0-3
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23176] paffinity slot assignment: slot_list == 0-3
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] *** Process received signal ***
>>> [node01:23176] *** Process received signal ***
>>> [node01:23175] Signal: Segmentation fault (11)
>>> [node01:23175] Signal code: Address not mapped (1)
>>> [node01:23175] Failing at address: 0x30
>>> [node01:23176] Signal: Segmentation fault (11)
>>> [node01:23176] Signal code: Address not mapped (1)
>>> [node01:23176] Failing at address: 0x30
>>> ------------- WRONG -----------------
>>>
>>> ------------- RIGHT -----------------
>>> [node25:23241] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23241] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23241] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node25:23242] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23242] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23242] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23243] mca:base:select:(paffinity) Selected component [linux]
>>> ------------- RIGHT -----------------
>>>
>>> Apparently, only a master process (ID [node01:23174] and [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case, also the compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the their own paffinity properties.
>>>
>>>
>>>
>>> Note that for the rankfile also the notation does not work. But that seems to have a different origin, as it tries to bind to a core# 4, whereas there are just 0-3.
>>> rank 0=node01 slot=0:*
>>> rank 1=node01 slot=0:*
>>>
>>>
>>> Thanks for your help on this!
>>>
>>> --
>>> Daan van Rossum
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> --
> Daan van Rossum
>
> University of Chicago
> Department of Astronomy and Astrophysics
> 5640 S. Ellis Ave
> Chicago, IL 60637
> phone: 773-7020624
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>