Hi Jeff

On 08/24/10 15:24, Jeff Squyres wrote:
I'm a little confused by your configure line:

./configure --prefix=/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2 --enable-cxx-exceptions CFLAGS=-O2 CXXFLAGS=-O2 FFLAGS=-O2 FCFLAGS=-O2


"oppss" that '2' was some leftover character after I edited the command line to configure wrt to GCC (from an original command line configuring with Intel compilers....)  thanks for noticing this.

I rerun the configure with

./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2" --enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" FCFLAGS="-O2"

and run make and I did NOT this time notice any error messages.

Thanks for the help with this. I will run now mpirun with various options in a PBS/Torque environmnet and see if hybrid MPI+OMP jobs are placed on the nodes in a sane fashion

Thanks


Michael



What's the lone "2" in the middle (after the prefix)?

With that extra "2", I'm not able to get configure to complete successfully (because it interprets that "2" as a platform name that does not exist).  If I remove that "2", configure completes properly and the build completes properly.

I'm afraid I no longer have any RH hosts to test on.  Can you do the following:

cd top_of_build_dir
cd ompi/debuggers
rm ompi_debuggers.lo
make

Then copy-n-paste the gcc command used to compile the ompi_debuggers.o file, remove "-o .libs/libdebuggers_la-ompi_debuggers.o", and add "-E", and redirect the output to a file.  Then send me that file -- it should give more of a clue as to exactly what the problem is that you're seeing.




On Aug 24, 2010, at 3:25 PM, Michael E. Thomadakis wrote:

On 08/24/10 14:22, Michael E. Thomadakis wrote:
Hi,

I used a 'tee' command to capture the output but I forgot to also redirect
stderr to the file.

This is what a fresh make gave (gcc 4.1.2 again) :

------------------------------------------------------------------
ompi_debuggers.c:81: error: missing terminating " character
ompi_debuggers.c:81: error: expected expression before \u2018;\u2019 token
ompi_debuggers.c: In function \u2018ompi_wait_for_debugger\u2019:
ompi_debuggers.c:212: error: \u2018mpidbg_dll_locations\u2019 undeclared
(first use in this function)
ompi_debuggers.c:212: error: (Each undeclared identifier is reported only once
ompi_debuggers.c:212: error: for each function it appears in.)
ompi_debuggers.c:212: warning: passing argument 3 of \u2018check\u2019 from
incompatible pointer type
make[2]: *** [libdebuggers_la-ompi_debuggers.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

------------------------------------------------------------------

Is this critical to run OMPI code?

Thanks for the quick reply Ralph,

Michael

On Tue, 24 Aug 2010, Ralph Castain wrote:

| Date: Tue, 24 Aug 2010 13:16:10 -0600
| From: Ralph Castain<rhc@open-mpi.org>
| To: Michael E.Thomadakis<miket7777@gmail.com>
| Cc: Open MPI Users<users@open-mpi.org>, miket@sc.tamu.edu
| Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when
|     "-npernode N" is used at command line
|
| Ummm....the configure log terminates normally, indicating it configured fine. The make log ends, but with no error shown - everything was building just fine.
|
| Did you maybe stop it before it was complete? Run out of disk quota? Or...?
|
|
| On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:
|
|>  Hi Ralph,
|>
|>  I tried to build 1.4.3.a1r23542 (08/02/2010) with
|>
|>  ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" --enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" FCFLAGS="-O2"
|>  with the GCC 4.1.2
|>
|>  miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
|>  Using built-in specs.
|>  Target: x86_64-redhat-linux
|>  Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
|>  Thread model: posix
|>  gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
|>
|>
|>  but it failed. I am attaching the configure and make logs.
|>
|>  regards
|>
|>  Michael
|>
|>
|>  On 08/23/10 20:53, Ralph Castain wrote:
|>>
|>>  Nope - none of them will work with 1.4.2. Sorry - bug not discovered until after release
|>>
|>>  On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
|>>
|>>>  Hi Jeff,
|>>>  thanks for the quick reply.
|>>>
|>>>  Would using '--cpus-per-proc N' in place of '-npernode N' or just '-bynode' do the trick?
|>>>
|>>>  It seems that using '--loadbalance' also crashes mpirun.
|>>>
|>>>  best ...
|>>>
|>>>  Michael
|>>>
|>>>
|>>>  On 08/23/10 19:30, Jeff Squyres wrote:
|>>>>
|>>>>  Yes, the -npernode segv is a known issue.
|>>>>
|>>>>  We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl and see if that fixes your problem?
|>>>>
|>>>>      http://www.open-mpi.org/nightly/v1.4/
|>>>>
|>>>>
|>>>>
|>>>>  On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
|>>>>
|>>>>>  Hello OMPI:
|>>>>>
|>>>>>  We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. OMPI was built uisng Intel compilers 11.1.072. I am attaching the configuration log and output from ompi_info -a.
|>>>>>
|>>>>>  The problem we are encountering is that whenever we use option '-npernode N' in the mpirun command line we get a segmentation fault as in below:
|>>>>>
|>>>>>
|>>>>>  miket@login002[pts/7]PS $ mpirun -npernode 1  --display-devel-map  --tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname
|>>>>>
|>>>>>   Map generated by mapping policy: 0402
|>>>>>          Npernode: 1     Oversubscribe allowed: TRUE     CPU Lists: FALSE
|>>>>>          Num new daemons: 2      New daemon starting vpid 1
|>>>>>          Num nodes: 3
|>>>>>
|>>>>>   Data for node: Name: login001          Launch id: -1   Arch: 0 State: 2
|>>>>>          Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
|>>>>>          Daemon: [[44812,0],1]   Daemon launched: False
|>>>>>          Num slots: 1    Slots in use: 2
|>>>>>          Num slots allocated: 1  Max slots: 0
|>>>>>          Username on node: NULL
|>>>>>          Num procs: 1    Next node_rank: 1
|>>>>>          Data for proc: [[44812,1],0]
|>>>>>                  Pid: 0  Local rank: 0   Node rank: 0
|>>>>>                  State: 0        App_context: 0  Slot list: NULL
|>>>>>
|>>>>>   Data for node: Name: login002          Launch id: -1   Arch: ffc91200  State: 2
|>>>>>          Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
|>>>>>          Daemon: [[44812,0],0]   Daemon launched: True
|>>>>>          Num slots: 1    Slots in use: 2
|>>>>>          Num slots allocated: 1  Max slots: 0
|>>>>>          Username on node: NULL
|>>>>>          Num procs: 1    Next node_rank: 1
|>>>>>          Data for proc: [[44812,1],0]
|>>>>>                  Pid: 0  Local rank: 0   Node rank: 0
|>>>>>                  State: 0        App_context: 0  Slot list: NULL
|>>>>>
|>>>>>   Data for node: Name: login003          Launch id: -1   Arch: 0 State: 2
|>>>>>          Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
|>>>>>          Daemon: [[44812,0],2]   Daemon launched: False
|>>>>>          Num slots: 1    Slots in use: 2
|>>>>>          Num slots allocated: 1  Max slots: 0
|>>>>>          Username on node: NULL
|>>>>>          Num procs: 1    Next node_rank: 1
|>>>>>          Data for proc: [[44812,1],0]
|>>>>>                  Pid: 0  Local rank: 0   Node rank: 0
|>>>>>                  State: 0        App_context: 0  Slot list: NULL
|>>>>>  [login002:02079] *** Process received signal ***
|>>>>>  [login002:02079] Signal: Segmentation fault (11)
|>>>>>  [login002:02079] Signal code: Address not mapped (1)
|>>>>>  [login002:02079] Failing at address: 0x50
|>>>>>  [login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
|>>>>>  [login002:02079] [ 1] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7) [0x2afa70d25de7]
|>>>>>  [login002:02079] [ 2] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3b8) [0x2afa70d36088]
|>>>>>  [login002:02079] [ 3] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd7) [0x2afa70d37fc7]
|>>>>>  [login002:02079] [ 4] /g/software/openmpi-1.4.2/intel/lib/openmpi/mca_plm_rsh.so [0x2afa721085a1]
|>>>>>  [login002:02079] [ 5] mpirun [0x404c27]
|>>>>>  [login002:02079] [ 6] mpirun [0x403e38]
|>>>>>  [login002:02079] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3568e1d994]
|>>>>>  [login002:02079] [ 8] mpirun [0x403d69]
|>>>>>  [login002:02079] *** End of error message ***
|>>>>>  Segmentation fault
|>>>>>
|>>>>>  We tried version 1.4.1 and this problem did not emerge.
|>>>>>
|>>>>>  This option is necessary for when our users launch hybrid MPI-OMP code were they can request M nodes and n ppn in a PBS/Torque setup so they can only get the right amount of MPI taks. Unfortunately, as soon as we use the 'npernode N' option mprun crashes.
|>>>>>
|>>>>>  Is this a known issue? I found related problem (of around May, 2010)  when people were using the same option but in a SLURM environment.
|>>>>>
|>>>>>  regards
|>>>>>
|>>>>>  Michael
|>>>>>
|>>>>>  <config.log.gz><ompi_info-a.out.gz>_______________________________________________
|>>>>>  users mailing list
|>>>>>  users@open-mpi.org
|>>>>>  http://www.open-mpi.org/mailman/listinfo.cgi/users
|>>>
|>>>  _______________________________________________
|>>>  users mailing list
|>>>  users@open-mpi.org
|>>>  http://www.open-mpi.org/mailman/listinfo.cgi/users
|>>
|>>
|>>  _______________________________________________
|>>  users mailing list
|>>  users@open-mpi.org
|>>  http://www.open-mpi.org/mailman/listinfo.cgi/users
|>
|>  <config_1.4.3.log.gz><make_1.4.3.out.gz>
|
|

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users