Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-14 14:41:08


I'll have to look through the logic and see if I can spot something obvious. I don't have access to an Intel compiler, and as you noted it works fine with gcc. Afraid I can't do much more than that, so this may take awhile and not necessarily have a positive result.

On Dec 14, 2009, at 12:32 PM, Daan van Rossum wrote:

> Hi Ralph,
>
> I took the Dec 10th snapshot, but got exactly the same behavior as with version 1.3.4.
>
> I just noticed that even this rankfile doesn't work, with a single process:
> rank 0=node01 slot=0-3
>
> ------------
> [node01:31105] mca:base:select:(paffinity) Querying component [linux]
> [node01:31105] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:31105] mca:base:select:(paffinity) Selected component [linux]
> [node01:31105] paffinity slot assignment: slot_list == 0-3
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] mca:base:select:(paffinity) Querying component [linux]
> [node01:31106] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> [node01:31106] mca:base:select:(paffinity) Selected component [linux]
> [node01:31106] paffinity slot assignment: slot_list == 0-3
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> [node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> [node01:31106] *** An error occurred in MPI_Comm_rank
> [node01:31106] *** on a NULL communicator
> [node01:31106] *** Unknown error
> [node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> ------------
>
> The spawned compute process doesn't sense that it should skip the setting paffinity...
>
>
> I saw the posting from last July about a similar problem (the problem that I mentioned on the bottom, with the slot=0:* notation not working). But that is a different problem (besides, that is still not working as it seems).
>
> Best,
> Daan
>
> * on Saturday, 12.12.09 at 18:48, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> This looks like an uninitialized variable that gnu c handles one way and intel another. Someone recently contributed a patch to the ompi trunk to fix just such a thing in this code area - don't know if it addresses this problem or not.
>>
>> Can you try the ompi trunk (a nightly tarball from the last day or so forward) and see if this still occurs?
>>
>> Thanks
>> Ralph
>>
>> On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
>>
>>> Hi all,
>>>
>>> There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c compiler, related with the built in processor binding functionallity. The problem does not occur when ompi is compiled with the gnu c compiler.
>>>
>>> A mpi program execution fails (segfault) on mpi_init() when the following rank file is used:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> but runs fine with:
>>> rank 0=node01 slot=0
>>> rank 1=node01 slot=1-3
>>> and fine with:
>>> rank 0=node01 slot=0-1
>>> rank 1=node01 slot=1-3
>>> but segfaults with:
>>> rank 0=node01 slot=0-2
>>> rank 1=node01 slot=1-3
>>>
>>> This is on a two-processor quad-core opteron machine (occurs on all nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
>>> This is the siplest case that fails. Generally, I would like to bind processors to physical procs but always allow any core, like
>>> rank 0=node01 slot=p0:0-3
>>> rank 1=node01 slot=p0:0-3
>>> rank 2=node01 slot=p0:0-3
>>> rank 3=node01 slot=p0:0-3
>>> rank 4=node01 slot=p1:0-3
>>> rank 5=node01 slot=p1:0-3
>>> rank 6=node01 slot=p1:0-3
>>> rank 7=node01 slot=p1:0-3
>>> which fails too.
>>>
>>> This happens with a test code that contains only two lines of code, calling mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.
>>>
>>> One more interesting thing is, that the problem with setting the process affinity does not occur on our four-processor quad-core opteron nodes, with exactly the same OS etc.
>>>
>>>
>>> Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this rankfile:
>>> rank 0=node01 slot=0-3
>>> rank 1=node01 slot=0-3
>>> ------------- WRONG -----------------
>>> [node01:23174] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23174] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23174] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23174] paffinity slot assignment: slot_list == 0-3
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23175] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23175] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Querying component [linux]
>>> [node01:23176] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node01:23176] mca:base:select:(paffinity) Selected component [linux]
>>> [node01:23175] paffinity slot assignment: slot_list == 0-3
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node01:23176] paffinity slot assignment: slot_list == 0-3
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node01:23175] *** Process received signal ***
>>> [node01:23176] *** Process received signal ***
>>> [node01:23175] Signal: Segmentation fault (11)
>>> [node01:23175] Signal code: Address not mapped (1)
>>> [node01:23175] Failing at address: 0x30
>>> [node01:23176] Signal: Segmentation fault (11)
>>> [node01:23176] Signal code: Address not mapped (1)
>>> [node01:23176] Failing at address: 0x30
>>> ------------- WRONG -----------------
>>>
>>> ------------- RIGHT -----------------
>>> [node25:23241] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23241] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23241] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
>>> [node25:23241] paffinity slot assignment: slot_list == 0-3
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
>>> [node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
>>> [node25:23242] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23242] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23242] mca:base:select:(paffinity) Selected component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Querying component [linux]
>>> [node25:23243] mca:base:select:(paffinity) Query of component [linux] set priority to 10
>>> [node25:23243] mca:base:select:(paffinity) Selected component [linux]
>>> ------------- RIGHT -----------------
>>>
>>> Apparently, only a master process (ID [node01:23174] and [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case, also the compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the their own paffinity properties.
>>>
>>>
>>>
>>> Note that for the rankfile also the notation does not work. But that seems to have a different origin, as it tries to bind to a core# 4, whereas there are just 0-3.
>>> rank 0=node01 slot=0:*
>>> rank 1=node01 slot=0:*
>>>
>>>
>>> Thanks for your help on this!
>>>
>>> --
>>> Daan van Rossum
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Daan van Rossum
>
> University of Chicago
> Department of Astronomy and Astrophysics
> 5640 S. Ellis Ave
> Chicago, IL 60637
> phone: 773-7020624
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel