* on Monday, 14.12.09 at 14:57, Terry Dontje <Terry.Dontje@Sun.COM> wrote:
> I don't really want to throw fud on this list but we've seen all
> sorts of oddities with OMPI 1.3.4 being built with Intel's 11.1
> compiler versus their 11.0 or other compilers (gcc, Sun Studio, pgi,
> and pathscale). I have not tested your specific failing case but
> considering your issue doesn't show up with gcc I am wondering if
> there is some sort of optimization issue with the 11.1 compiler.
>
> It might be interesting to see if using certain optimization levels
> with the Intel 11.1 compiler produces a working OMPI library.
>
> --td
>
> Daan van Rossum wrote:
> >Hi Ralph,
> >
> >I took the Dec 10th snapshot, but got exactly the same behavior as with version 1.3.4.
> >
> >I just noticed that even this rankfile doesn't work, with a single process:
> > rank 0=node01 slot=0-3
> >
> >------------
> >[node01:31105] mca:base:select:(paffinity) Querying component [linux]
> >[node01:31105] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >[node01:31105] mca:base:select:(paffinity) Selected component [linux]
> >[node01:31105] paffinity slot assignment: slot_list == 0-3
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >[node01:31105] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >[node01:31106] mca:base:select:(paffinity) Querying component [linux]
> >[node01:31106] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >[node01:31106] mca:base:select:(paffinity) Selected component [linux]
> >[node01:31106] paffinity slot assignment: slot_list == 0-3
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >[node01:31106] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >[node01:31106] *** An error occurred in MPI_Comm_rank
> >[node01:31106] *** on a NULL communicator
> >[node01:31106] *** Unknown error
> >[node01:31106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >forrtl: severe (174): SIGSEGV, segmentation fault occurred
> >------------
> >
> >The spawned compute process doesn't sense that it should skip the setting paffinity...
> >
> >
> >I saw the posting from last July about a similar problem (the problem that I mentioned on the bottom, with the slot=0:* notation not working). But that is a different problem (besides, that is still not working as it seems).
> >
> >Best,
> >Daan
> >
> >* on Saturday, 12.12.09 at 18:48, Ralph Castain <
rhc@open-mpi.org> wrote:
> >
> >>This looks like an uninitialized variable that gnu c handles one way and intel another. Someone recently contributed a patch to the ompi trunk to fix just such a thing in this code area - don't know if it addresses this problem or not.
> >>
> >>Can you try the ompi trunk (a nightly tarball from the last day or so forward) and see if this still occurs?
> >>
> >>Thanks
> >>Ralph
> >>
> >>On Dec 11, 2009, at 4:06 PM, Daan van Rossum wrote:
> >>
> >>>Hi all,
> >>>
> >>>There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c compiler, related with the built in processor binding functionallity. The problem does not occur when ompi is compiled with the gnu c compiler.
> >>>
> >>>A mpi program execution fails (segfault) on mpi_init() when the following rank file is used:
> >>>rank 0=node01 slot=0-3
> >>>rank 1=node01 slot=0-3
> >>>but runs fine with:
> >>>rank 0=node01 slot=0
> >>>rank 1=node01 slot=1-3
> >>>and fine with:
> >>>rank 0=node01 slot=0-1
> >>>rank 1=node01 slot=1-3
> >>>but segfaults with:
> >>>rank 0=node01 slot=0-2
> >>>rank 1=node01 slot=1-3
> >>>
> >>>This is on a two-processor quad-core opteron machine (occurs on all nodes of the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
> >>>This is the siplest case that fails. Generally, I would like to bind processors to physical procs but always allow any core, like
> >>>rank 0=node01 slot=p0:0-3
> >>>rank 1=node01 slot=p0:0-3
> >>>rank 2=node01 slot=p0:0-3
> >>>rank 3=node01 slot=p0:0-3
> >>>rank 4=node01 slot=p1:0-3
> >>>rank 5=node01 slot=p1:0-3
> >>>rank 6=node01 slot=p1:0-3
> >>>rank 7=node01 slot=p1:0-3
> >>>which fails too.
> >>>
> >>>This happens with a test code that contains only two lines of code, calling mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.
> >>>
> >>>One more interesting thing is, that the problem with setting the process affinity does not occur on our four-processor quad-core opteron nodes, with exactly the same OS etc.
> >>>
> >>>
> >>>Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this rankfile:
> >>>rank 0=node01 slot=0-3
> >>>rank 1=node01 slot=0-3
> >>>------------- WRONG -----------------
> >>>[node01:23174] mca:base:select:(paffinity) Querying component [linux]
> >>>[node01:23174] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node01:23174] mca:base:select:(paffinity) Selected component [linux]
> >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >>>[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >>>[node01:23174] paffinity slot assignment: slot_list == 0-3
> >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> >>>[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> >>>[node01:23175] mca:base:select:(paffinity) Querying component [linux]
> >>>[node01:23175] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node01:23175] mca:base:select:(paffinity) Selected component [linux]
> >>>[node01:23176] mca:base:select:(paffinity) Querying component [linux]
> >>>[node01:23176] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node01:23176] mca:base:select:(paffinity) Selected component [linux]
> >>>[node01:23175] paffinity slot assignment: slot_list == 0-3
> >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >>>[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >>>[node01:23176] paffinity slot assignment: slot_list == 0-3
> >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> >>>[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> >>>[node01:23175] *** Process received signal ***
> >>>[node01:23176] *** Process received signal ***
> >>>[node01:23175] Signal: Segmentation fault (11)
> >>>[node01:23175] Signal code: Address not mapped (1)
> >>>[node01:23175] Failing at address: 0x30
> >>>[node01:23176] Signal: Segmentation fault (11)
> >>>[node01:23176] Signal code: Address not mapped (1)
> >>>[node01:23176] Failing at address: 0x30
> >>>------------- WRONG -----------------
> >>>
> >>>------------- RIGHT -----------------
> >>>[node25:23241] mca:base:select:(paffinity) Querying component [linux]
> >>>[node25:23241] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node25:23241] mca:base:select:(paffinity) Selected component [linux]
> >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
> >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
> >>>[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
> >>>[node25:23241] paffinity slot assignment: slot_list == 0-3
> >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
> >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
> >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
> >>>[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
> >>>[node25:23242] mca:base:select:(paffinity) Querying component [linux]
> >>>[node25:23242] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node25:23242] mca:base:select:(paffinity) Selected component [linux]
> >>>[node25:23243] mca:base:select:(paffinity) Querying component [linux]
> >>>[node25:23243] mca:base:select:(paffinity) Query of component [linux] set priority to 10
> >>>[node25:23243] mca:base:select:(paffinity) Selected component [linux]
> >>>------------- RIGHT -----------------
> >>>
> >>>Apparently, only a master process (ID [node01:23174] and [node25:23241]) set the paffinity in the RIGHT case, but in the WRONG case, also the compute processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the their own paffinity properties.
> >>>
> >>>
> >>>
> >>>Note that for the rankfile also the notation does not work. But that seems to have a different origin, as it tries to bind to a core# 4, whereas there are just 0-3.
> >>>rank 0=node01 slot=0:*
> >>>rank 1=node01 slot=0:*
> >>>
> >>>
> >>>Thanks for your help on this!
> >>>
> >>>--
> >>>Daan van Rossum
> >>>_______________________________________________
> >>>devel mailing list
> >>>
devel@open-mpi.org
> >>>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>_______________________________________________
> >>devel mailing list
> >>
devel@open-mpi.org
> >>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >--
> >Daan van Rossum
> >
> >University of Chicago
> >Department of Astronomy and Astrophysics
> >5640 S. Ellis Ave
> >Chicago, IL 60637
> >phone: 773-7020624
> >_______________________________________________
> >devel mailing list
> >
devel@open-mpi.org
> >
http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
>
devel@open-mpi.org
>
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Daan van Rossum
University of Chicago
Department of Astronomy and Astrophysics
5640 S. Ellis Ave
Chicago, IL 60637
phone: 773-7020624
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel