Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SEGFAULT in mpi_init from paffinity with intel 11.1.059 compiler
From: Daan van Rossum (daan_at_[hidden])
Date: 2009-12-16 11:24:20


Sure. Processors were scaled down while idling to 1000MHz
 (I hope this will show up as attachement instead of inlined...)

* on Wednesday, 16.12.09 at 18:12, Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]> wrote:

> Hi,
> can you provide $cat /proc/cpuinfo
> I am not optimistic that it will help, but still...
> thanks
> Lenny.
>
> On Wed, Dec 16, 2009 at 6:01 PM, Daan van Rossum <daan_at_[hidden]>wrote:
>
> > Hi Terry,
> >
> > Thanks for your hint. I tried configure --enable-debug and even compiled it
> > with all kind of manual debug flags turned on, but it doesn't help to get
> > rid of this problem. So it definitively is not an optimization flaw.
> > One more interesting test would be to try an older version of the Intel
> > compiler. But the next older version that I have is 10.0.015, which is too
> > old for the operating system (must be >10.1).
> >
> >
> > A good thing is that this bug is very easy to test. You only need one line
> > of MPI code and one process in the execution.
> >
> > A few more test cases:
> > rank 0=node01 slot=1-7
> > and
> > rank 0=node01 slot=0,2-7
> > and
> > rank 0=node01 slot=0-1,3-7
> > work WELL.
> > But
> > rank 0=node01 slot=0-2,4-7
> > FAILS.
> >
> > As long as either slot 0, 1, OR 2 is excluded from the list it's allright.
> > Excluding a different slot, like slot 3, does not help.
> >
> >
> > I'll try to get hold of an Intel v10.1 compiler version.
> >
> > Best,
> > Daan
> >
> > * on Monday, 14.12.09 at 14:57, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:
> >
> > > I don't really want to throw fud on this list but we've seen all
> > > sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> > > compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> > > and pathscale). I have not tested your specific failing case but
> > > considering your issue doesn't show up with gcc I am wondering if
> > > there is some sort of optimization issue with the 11.1 compiler.
> > >
> > > It might be interesting to see if using certain optimization levels
> > > with the Intel 11.1 compiler produces a working OMPI library.
> > >
> > > --td
> > >
> > > Daan van Rossum wrote:
> > > >Hi Ralph,
> > > >
> > > >I took the Dec 10th snapshot, but got exactly the same behavior as with
> > version 1.3.4.
> > > >
> > > >I just noticed that even this rankfile doesn't work, with a single
> > process:
> > > > rank 0=node01 slot=0-3
> > > >
> > > >------------
> > > >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> > > >[node01:31105] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> > > >[node01:31105] paffinity slot assignment: slot_list == 0-3
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> > > >[node01:31106] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> > > >[node01:31106] paffinity slot assignment: slot_list == 0-3
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >[node01:31106] *** An error occurred in MPI_Comm_rank
> > > >[node01:31106] *** on a NULL communicator
> > > >[node01:31106] *** Unknown error
> > > >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > > >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> > > >------------
> > > >
> > > >The spawned compute process doesn't sense that it should skip the
> > setting paffinity...
> > > >
> > > >
> > > >I saw the posting from last July about a similar problem (the problem
> > that I mentioned on the bottom, with the slot=0:* notation not working). But
> > that is a different problem (besides, that is still not working as it
> > seems).
> > > >
> > > >Best,
> > > >Daan
> > > >
> > > >* on Saturday, 12.12.09 at 18:48, Ralph Castain <rhc_at_[hidden]>
> > wrote:
> > > >
> > > >>This looks like an uninitialized variable that gnu c handles one way
> > and intel another. Someone recently contributed a patch to the ompi trunk to
> > fix just such a thing in this code area - don't know if it addresses this
> > problem or not.
> > > >>
> > > >>Can you try the ompi trunk (a nightly tarball from the last day or so
> > forward) and see if this still occurs?
> > > >>
> > > >>Thanks
> > > >>Ralph
> > > >>
> > > >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> > > >>
> > > >>>Hi all,
> > > >>>
> > > >>>There's a problem with ompi 1.3.4 when compiled with the intel
> > 11.1.059 c compiler, related with the built in processor binding
> > functionallity. The problem does not occur when ompi is compiled with the
> > gnu c compiler.
> > > >>>
> > > >>>A mpi program execution fails (segfault) on mpi_init() when the
> > following rank file is used:
> > > >>>rank 0=node01 slot=0-3
> > > >>>rank 1=node01 slot=0-3
> > > >>>but runs fine with:
> > > >>>rank 0=node01 slot=0
> > > >>>rank 1=node01 slot=1-3
> > > >>>and fine with:
> > > >>>rank 0=node01 slot=0-1
> > > >>>rank 1=node01 slot=1-3
> > > >>>but segfaults with:
> > > >>>rank 0=node01 slot=0-2
> > > >>>rank 1=node01 slot=1-3
> > > >>>
> > > >>>This is on a two-processor quad-core opteron machine (occurs on all
> > nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> > > >>>This is the siplest case that fails. Generally, I would like to bind
> > processors to physical procs but always allow any core, like
> > > >>>rank 0=node01 slot=p0:0-3
> > > >>>rank 1=node01 slot=p0:0-3
> > > >>>rank 2=node01 slot=p0:0-3
> > > >>>rank 3=node01 slot=p0:0-3
> > > >>>rank 4=node01 slot=p1:0-3
> > > >>>rank 5=node01 slot=p1:0-3
> > > >>>rank 6=node01 slot=p1:0-3
> > > >>>rank 7=node01 slot=p1:0-3
> > > >>>which fails too.
> > > >>>
> > > >>>This happens with a test code that contains only two lines of code,
> > calling mpi_init and mpi_finalize subsequently, and happens in both fortran
> > and in c.
> > > >>>
> > > >>>One more interesting thing is, that the problem with setting the
> > process affinity does not occur on our four-processor quad-core opteron
> > nodes, with exactly the same OS etc.
> > > >>>
> > > >>>
> > > >>>Setting "--mca paffinity_base_verbose 5" shows what is going wrong for
> > this rankfile:
> > > >>>rank 0=node01 slot=0-3
> > > >>>rank 1=node01 slot=0-3
> > > >>>------------- WRONG -----------------
> > > >>>[node01:23174] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node01:23174] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node01:23174] mca:base:select:(paffinity) Selected component [linux]
> > > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > > >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > > >>>[node01:23175] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node01:23175] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node01:23175] mca:base:select:(paffinity) Selected component [linux]
> > > >>>[node01:23176] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node01:23176] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node01:23176] mca:base:select:(paffinity) Selected component [linux]
> > > >>>[node01:23175] paffinity slot assignment: slot_list == 0-3
> > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >>>[node01:23176] paffinity slot assignment: slot_list == 0-3
> > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > > >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > > >>>[node01:23175] *** Process received signal ***
> > > >>>[node01:23176] *** Process received signal ***
> > > >>>[node01:23175] Signal: Segmentation fault (11)
> > > >>>[node01:23175] Signal code: Address not mapped (1)
> > > >>>[node01:23175] Failing at address: 0x30
> > > >>>[node01:23176] Signal: Segmentation fault (11)
> > > >>>[node01:23176] Signal code: Address not mapped (1)
> > > >>>[node01:23176] Failing at address: 0x30
> > > >>>------------- WRONG -----------------
> > > >>>
> > > >>>------------- RIGHT -----------------
> > > >>>[node25:23241] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node25:23241] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node25:23241] mca:base:select:(paffinity) Selected component [linux]
> > > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> > > >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> > > >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> > > >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> > > >>>[node25:23242] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node25:23242] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node25:23242] mca:base:select:(paffinity) Selected component [linux]
> > > >>>[node25:23243] mca:base:select:(paffinity) Querying component [linux]
> > > >>>[node25:23243] mca:base:select:(paffinity) Query of component [linux]
> > set priority to 10
> > > >>>[node25:23243] mca:base:select:(paffinity) Selected component [linux]
> > > >>>------------- RIGHT -----------------
> > > >>>
> > > >>>Apparently, only a master process (ID [node01:23174] and
> > [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case,
> > also the compute processes ([node01:23175] and [node01:23176], rank0 and
> > rank1) try to set the their own paffinity properties.
> > > >>>
> > > >>>
> > > >>>
> > > >>>Note that for the rankfile also the notation does not work. But that
> > seems to have a different origin, as it tries to bind to a core# 4, whereas
> > there are just 0-3.
> > > >>>rank 0=node01 slot=0:*
> > > >>>rank 1=node01 slot=0:*
> > > >>>
> > > >>>
> > > >>>Thanks for your help on this!
> > > >>>
> > > >>>--
> > > >>>Daan van Rossum
> > > >>>_______________________________________________
> > > >>>devel mailing list
> > > >>>devel_at_[hidden]
> > > >>>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >>_______________________________________________
> > > >>devel mailing list
> > > >>devel_at_[hidden]
> > > >>http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > >
> > > >--
> > > >Daan van Rossum
> > > >
> > > >University of Chicago
> > > >Department of Astronomy and Astrophysics
> > > >5640 S. Ellis Ave
> > > >Chicago, IL 60637
> > > >phone: 773-7020624
> > > >_______________________________________________
> > > >devel mailing list
> > > >devel_at_[hidden]
> > > >http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Daan van Rossum
> >
> > University of Chicago
> > Department of Astronomy and Astrophysics
> > 5640 S. Ellis Ave
> > Chicago, IL 60637
> > phone: 773-7020624
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Daan van Rossum
University of Chicago
Department of Astronomy and Astrophysics
5640 S. Ellis Ave
Chicago, IL 60637
phone: 773-7020624