Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line
From: Michael E. Thomadakis (miket7777_at_[hidden])
Date: 2010-08-24 15:25:19


  On 08/24/10 14:22, Michael E. Thomadakis wrote:
> Hi,
>
> I used a 'tee' command to capture the output but I forgot to also redirect
> stderr to the file.
>
> This is what a fresh make gave (gcc 4.1.2 again) :
>
> ------------------------------------------------------------------
> ompi_debuggers.c:81: error: missing terminating " character
> ompi_debuggers.c:81: error: expected expression before \u2018;\u2019 token
> ompi_debuggers.c: In function \u2018ompi_wait_for_debugger\u2019:
> ompi_debuggers.c:212: error: \u2018mpidbg_dll_locations\u2019 undeclared
> (first use in this function)
> ompi_debuggers.c:212: error: (Each undeclared identifier is reported only once
> ompi_debuggers.c:212: error: for each function it appears in.)
> ompi_debuggers.c:212: warning: passing argument 3 of \u2018check\u2019 from
> incompatible pointer type
> make[2]: *** [libdebuggers_la-ompi_debuggers.lo] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
>
> ------------------------------------------------------------------
>
> Is this critical to run OMPI code?
>
> Thanks for the quick reply Ralph,
>
> Michael
>
> On Tue, 24 Aug 2010, Ralph Castain wrote:
>
> | Date: Tue, 24 Aug 2010 13:16:10 -0600
> | From: Ralph Castain<rhc_at_[hidden]>
> | To: Michael E.Thomadakis<miket7777_at_[hidden]>
> | Cc: Open MPI Users<users_at_[hidden]>, miket_at_[hidden]
> | Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when
> | "-npernode N" is used at command line
> |
> | Ummm....the configure log terminates normally, indicating it configured fine. The make log ends, but with no error shown - everything was building just fine.
> |
> | Did you maybe stop it before it was complete? Run out of disk quota? Or...?
> |
> |
> | On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:
> |
> |> Hi Ralph,
> |>
> |> I tried to build 1.4.3.a1r23542 (08/02/2010) with
> |>
> |> ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" --enable-cxx-exceptions CFLAGS="-O2" CXXFLAGS="-O2" FFLAGS="-O2" FCFLAGS="-O2"
> |> with the GCC 4.1.2
> |>
> |> miket_at_login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
> |> Using built-in specs.
> |> Target: x86_64-redhat-linux
> |> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
> |> Thread model: posix
> |> gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
> |>
> |>
> |> but it failed. I am attaching the configure and make logs.
> |>
> |> regards
> |>
> |> Michael
> |>
> |>
> |> On 08/23/10 20:53, Ralph Castain wrote:
> |>>
> |>> Nope - none of them will work with 1.4.2. Sorry - bug not discovered until after release
> |>>
> |>> On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
> |>>
> |>>> Hi Jeff,
> |>>> thanks for the quick reply.
> |>>>
> |>>> Would using '--cpus-per-proc N' in place of '-npernode N' or just '-bynode' do the trick?
> |>>>
> |>>> It seems that using '--loadbalance' also crashes mpirun.
> |>>>
> |>>> best ...
> |>>>
> |>>> Michael
> |>>>
> |>>>
> |>>> On 08/23/10 19:30, Jeff Squyres wrote:
> |>>>>
> |>>>> Yes, the -npernode segv is a known issue.
> |>>>>
> |>>>> We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl and see if that fixes your problem?
> |>>>>
> |>>>> http://www.open-mpi.org/nightly/v1.4/
> |>>>>
> |>>>>
> |>>>>
> |>>>> On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
> |>>>>
> |>>>>> Hello OMPI:
> |>>>>>
> |>>>>> We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. OMPI was built uisng Intel compilers 11.1.072. I am attaching the configuration log and output from ompi_info -a.
> |>>>>>
> |>>>>> The problem we are encountering is that whenever we use option '-npernode N' in the mpirun command line we get a segmentation fault as in below:
> |>>>>>
> |>>>>>
> |>>>>> miket_at_login002[pts/7]PS $ mpirun -npernode 1 --display-devel-map --tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname
> |>>>>>
> |>>>>> Map generated by mapping policy: 0402
> |>>>>> Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
> |>>>>> Num new daemons: 2 New daemon starting vpid 1
> |>>>>> Num nodes: 3
> |>>>>>
> |>>>>> Data for node: Name: login001 Launch id: -1 Arch: 0 State: 2
> |>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> |>>>>> Daemon: [[44812,0],1] Daemon launched: False
> |>>>>> Num slots: 1 Slots in use: 2
> |>>>>> Num slots allocated: 1 Max slots: 0
> |>>>>> Username on node: NULL
> |>>>>> Num procs: 1 Next node_rank: 1
> |>>>>> Data for proc: [[44812,1],0]
> |>>>>> Pid: 0 Local rank: 0 Node rank: 0
> |>>>>> State: 0 App_context: 0 Slot list: NULL
> |>>>>>
> |>>>>> Data for node: Name: login002 Launch id: -1 Arch: ffc91200 State: 2
> |>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> |>>>>> Daemon: [[44812,0],0] Daemon launched: True
> |>>>>> Num slots: 1 Slots in use: 2
> |>>>>> Num slots allocated: 1 Max slots: 0
> |>>>>> Username on node: NULL
> |>>>>> Num procs: 1 Next node_rank: 1
> |>>>>> Data for proc: [[44812,1],0]
> |>>>>> Pid: 0 Local rank: 0 Node rank: 0
> |>>>>> State: 0 App_context: 0 Slot list: NULL
> |>>>>>
> |>>>>> Data for node: Name: login003 Launch id: -1 Arch: 0 State: 2
> |>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
> |>>>>> Daemon: [[44812,0],2] Daemon launched: False
> |>>>>> Num slots: 1 Slots in use: 2
> |>>>>> Num slots allocated: 1 Max slots: 0
> |>>>>> Username on node: NULL
> |>>>>> Num procs: 1 Next node_rank: 1
> |>>>>> Data for proc: [[44812,1],0]
> |>>>>> Pid: 0 Local rank: 0 Node rank: 0
> |>>>>> State: 0 App_context: 0 Slot list: NULL
> |>>>>> [login002:02079] *** Process received signal ***
> |>>>>> [login002:02079] Signal: Segmentation fault (11)
> |>>>>> [login002:02079] Signal code: Address not mapped (1)
> |>>>>> [login002:02079] Failing at address: 0x50
> |>>>>> [login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
> |>>>>> [login002:02079] [ 1] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7) [0x2afa70d25de7]
> |>>>>> [login002:02079] [ 2] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3b8) [0x2afa70d36088]
> |>>>>> [login002:02079] [ 3] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd7) [0x2afa70d37fc7]
> |>>>>> [login002:02079] [ 4] /g/software/openmpi-1.4.2/intel/lib/openmpi/mca_plm_rsh.so [0x2afa721085a1]
> |>>>>> [login002:02079] [ 5] mpirun [0x404c27]
> |>>>>> [login002:02079] [ 6] mpirun [0x403e38]
> |>>>>> [login002:02079] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3568e1d994]
> |>>>>> [login002:02079] [ 8] mpirun [0x403d69]
> |>>>>> [login002:02079] *** End of error message ***
> |>>>>> Segmentation fault
> |>>>>>
> |>>>>> We tried version 1.4.1 and this problem did not emerge.
> |>>>>>
> |>>>>> This option is necessary for when our users launch hybrid MPI-OMP code were they can request M nodes and n ppn in a PBS/Torque setup so they can only get the right amount of MPI taks. Unfortunately, as soon as we use the 'npernode N' option mprun crashes.
> |>>>>>
> |>>>>> Is this a known issue? I found related problem (of around May, 2010) when people were using the same option but in a SLURM environment.
> |>>>>>
> |>>>>> regards
> |>>>>>
> |>>>>> Michael
> |>>>>>
> |>>>>> <config.log.gz><ompi_info-a.out.gz>_______________________________________________
> |>>>>> users mailing list
> |>>>>> users_at_[hidden]
> |>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> |>>>
> |>>> _______________________________________________
> |>>> users mailing list
> |>>> users_at_[hidden]
> |>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> |>>
> |>>
> |>> _______________________________________________
> |>> users mailing list
> |>> users_at_[hidden]
> |>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> |>
> |> <config_1.4.3.log.gz><make_1.4.3.out.gz>
> |
> |
>