Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-12-16 11:12:52


Hi,
can you provide $cat /proc/cpuinfo
I am not optimistic that it will help, but still...
thanks
Lenny.

On Wed, Dec 16, 2009 at 6:01 PM, Daan van Rossum <daan_at_[hidden]>wrote:

> Hi Terry,
>
> Thanks for your hint. I tried configure --enable-debug and even compiled it
> with all kind of manual debug flags turned on, but it doesn't help to get
> rid of this problem. So it definitively is not an optimization flaw.
> One more interesting test would be to try an older version of the Intel
> compiler. But the next older version that I have is 10.0.015, which is too
> old for the operating system (must be >10.1).
>
>
> A good thing is that this bug is very easy to test. You only need one line
> of MPI code and one process in the execution.
>
> A few more test cases:
> rank 0=node01 slot=1-7
> and
> rank 0=node01 slot=0,2-7
> and
> rank 0=node01 slot=0-1,3-7
> work WELL.
> But
> rank 0=node01 slot=0-2,4-7
> FAILS.
>
> As long as either slot 0, 1, OR 2 is excluded from the list it's allright.
> Excluding a different slot, like slot 3, does not help.
>
>
> I'll try to get hold of an Intel v10.1 compiler version.
>
> Best,
> Daan
>
> * on Monday, 14.12.09 at 14:57, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:
>
> > I don't really want to throw fud on this list but we've seen all
> > sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> > compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> > and pathscale). I have not tested your specific failing case but
> > considering your issue doesn't show up with gcc I am wondering if
> > there is some sort of optimization issue with the 11.1 compiler.
> >
> > It might be interesting to see if using certain optimization levels
> > with the Intel 11.1 compiler produces a working OMPI library.
> >
> > --td
> >
> > Daan van Rossum wrote:
> > >Hi Ralph,
> > >
> > >I took the Dec 10th snapshot, but got exactly the same behavior as with
> version 1.3.4.
> > >
> > >I just noticed that even this rankfile doesn't work, with a single
> process:
> > > rank 0=node01 slot=0-3
> > >
> > >------------
> > >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> > >[node01:31105] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> > >[node01:31105] paffinity slot assignment: slot_list == 0-3
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> > >[node01:31106] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> > >[node01:31106] paffinity slot assignment: slot_list == 0-3
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >[node01:31106] *** An error occurred in MPI_Comm_rank
> > >[node01:31106] *** on a NULL communicator
> > >[node01:31106] *** Unknown error
> > >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> > >------------
> > >
> > >The spawned compute process doesn't sense that it should skip the
> setting paffinity...
> > >
> > >
> > >I saw the posting from last July about a similar problem (the problem
> that I mentioned on the bottom, with the slot=0:* notation not working). But
> that is a different problem (besides, that is still not working as it
> seems).
> > >
> > >Best,
> > >Daan
> > >
> > >* on Saturday, 12.12.09 at 18:48, Ralph Castain <rhc_at_[hidden]>
> wrote:
> > >
> > >>This looks like an uninitialized variable that gnu c handles one way
> and intel another. Someone recently contributed a patch to the ompi trunk to
> fix just such a thing in this code area - don't know if it addresses this
> problem or not.
> > >>
> > >>Can you try the ompi trunk (a nightly tarball from the last day or so
> forward) and see if this still occurs?
> > >>
> > >>Thanks
> > >>Ralph
> > >>
> > >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> > >>
> > >>>Hi all,
> > >>>
> > >>>There's a problem with ompi 1.3.4 when compiled with the intel
> 11.1.059 c compiler, related with the built in processor binding
> functionallity. The problem does not occur when ompi is compiled with the
> gnu c compiler.
> > >>>
> > >>>A mpi program execution fails (segfault) on mpi_init() when the
> following rank file is used:
> > >>>rank 0=node01 slot=0-3
> > >>>rank 1=node01 slot=0-3
> > >>>but runs fine with:
> > >>>rank 0=node01 slot=0
> > >>>rank 1=node01 slot=1-3
> > >>>and fine with:
> > >>>rank 0=node01 slot=0-1
> > >>>rank 1=node01 slot=1-3
> > >>>but segfaults with:
> > >>>rank 0=node01 slot=0-2
> > >>>rank 1=node01 slot=1-3
> > >>>
> > >>>This is on a two-processor quad-core opteron machine (occurs on all
> nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> > >>>This is the siplest case that fails. Generally, I would like to bind
> processors to physical procs but always allow any core, like
> > >>>rank 0=node01 slot=p0:0-3
> > >>>rank 1=node01 slot=p0:0-3
> > >>>rank 2=node01 slot=p0:0-3
> > >>>rank 3=node01 slot=p0:0-3
> > >>>rank 4=node01 slot=p1:0-3
> > >>>rank 5=node01 slot=p1:0-3
> > >>>rank 6=node01 slot=p1:0-3
> > >>>rank 7=node01 slot=p1:0-3
> > >>>which fails too.
> > >>>
> > >>>This happens with a test code that contains only two lines of code,
> calling mpi_init and mpi_finalize subsequently, and happens in both fortran
> and in c.
> > >>>
> > >>>One more interesting thing is, that the problem with setting the
> process affinity does not occur on our four-processor quad-core opteron
> nodes, with exactly the same OS etc.
> > >>>
> > >>>
> > >>>Setting "--mca paffinity_base_verbose 5" shows what is going wrong for
> this rankfile:
> > >>>rank 0=node01 slot=0-3
> > >>>rank 1=node01 slot=0-3
> > >>>------------- WRONG -----------------
> > >>>[node01:23174] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node01:23174] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node01:23174] mca:base:select:(paffinity) Selected component [linux]
> > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > >>>[node01:23175] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node01:23175] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node01:23175] mca:base:select:(paffinity) Selected component [linux]
> > >>>[node01:23176] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node01:23176] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node01:23176] mca:base:select:(paffinity) Selected component [linux]
> > >>>[node01:23175] paffinity slot assignment: slot_list == 0-3
> > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >>>[node01:23176] paffinity slot assignment: slot_list == 0-3
> > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > >>>[node01:23175] *** Process received signal ***
> > >>>[node01:23176] *** Process received signal ***
> > >>>[node01:23175] Signal: Segmentation fault (11)
> > >>>[node01:23175] Signal code: Address not mapped (1)
> > >>>[node01:23175] Failing at address: 0x30
> > >>>[node01:23176] Signal: Segmentation fault (11)
> > >>>[node01:23176] Signal code: Address not mapped (1)
> > >>>[node01:23176] Failing at address: 0x30
> > >>>------------- WRONG -----------------
> > >>>
> > >>>------------- RIGHT -----------------
> > >>>[node25:23241] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node25:23241] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node25:23241] mca:base:select:(paffinity) Selected component [linux]
> > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > >>>[node25:23242] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node25:23242] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node25:23242] mca:base:select:(paffinity) Selected component [linux]
> > >>>[node25:23243] mca:base:select:(paffinity) Querying component [linux]
> > >>>[node25:23243] mca:base:select:(paffinity) Query of component [linux]
> set priority to 10
> > >>>[node25:23243] mca:base:select:(paffinity) Selected component [linux]
> > >>>------------- RIGHT -----------------
> > >>>
> > >>>Apparently, only a master process (ID [node01:23174] and
> [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case,
> also the compute processes ([node01:23175] and [node01:23176], rank0 and
> rank1) try to set the their own paffinity properties.
> > >>>
> > >>>
> > >>>
> > >>>Note that for the rankfile also the notation does not work. But that
> seems to have a different origin, as it tries to bind to a core# 4, whereas
> there are just 0-3.
> > >>>rank 0=node01 slot=0:*
> > >>>rank 1=node01 slot=0:*
> > >>>
> > >>>
> > >>>Thanks for your help on this!
> > >>>
> > >>>--
> > >>>Daan van Rossum
> > >>>_______________________________________________
> > >>>devel mailing list
> > >>>devel_at_[hidden]
> > >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>_______________________________________________
> > >>devel mailing list
> > >>devel_at_[hidden]
> > >>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >--
> > >Daan van Rossum
> > >
> > >University of Chicago
> > >Department of Astronomy and Astrophysics
> > >5640 S. Ellis Ave
> > >Chicago, IL 60637
> > >phone: 773-7020624
> > >_______________________________________________
> > >devel mailing list
> > >devel_at_[hidden]
> > >http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Daan van Rossum
>
> University of Chicago
> Department of Astronomy and Astrophysics
> 5640 S. Ellis Ave
> Chicago, IL 60637
> phone: 773-7020624
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>