Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line
From: Michael E. Thomadakis (miket7777_at_[hidden])
Date: 2010-08-24 15:06:46


  Hi Ralph,

I tried to build 1.4.3.a1r23542 (08/02/2010) with

./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2"
--enable-cxx-exceptions CFLAGS="-O2" CXXFLAGS="-O2" FFLAGS="-O2"
FCFLAGS="-O2"
with the GCC 4.1.2

miket_at_login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada
--enable-java-awt=gtk --disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre
--with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)

but it failed. I am attaching the configure and make logs.

regards

Michael

On 08/23/10 20:53, Ralph Castain wrote:
> Nope - none of them will work with 1.4.2. Sorry - bug not discovered
> until after release
>
> On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
>
>> Hi Jeff,
>> thanks for the quick reply.
>>
>> Would using '--cpus-per-proc /N/' in place of '-npernode /N/' or just
>> '-bynode' do the trick?
>>
>> It seems that using '--loadbalance' also crashes mpirun.
>>
>> best ...
>>
>> Michael
>>
>>
>> On 08/23/10 19:30, Jeff Squyres wrote:
>>> Yes, the -npernode segv is a known issue.
>>>
>>> We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl and see if that fixes your problem?
>>>
>>> http://www.open-mpi.org/nightly/v1.4/
>>>
>>>
>>>
>>> On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
>>>
>>>> Hello OMPI:
>>>>
>>>> We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. OMPI was built uisng Intel compilers 11.1.072. I am attaching the configuration log and output from ompi_info -a.
>>>>
>>>> The problem we are encountering is that whenever we use option '-npernode N' in the mpirun command line we get a segmentation fault as in below:
>>>>
>>>>
>>>> miket_at_login002[pts/7]PS $ mpirun -npernode 1 --display-devel-map --tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname
>>>>
>>>> Map generated by mapping policy: 0402
>>>> Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
>>>> Num new daemons: 2 New daemon starting vpid 1
>>>> Num nodes: 3
>>>>
>>>> Data for node: Name: login001 Launch id: -1 Arch: 0 State: 2
>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>> Daemon: [[44812,0],1] Daemon launched: False
>>>> Num slots: 1 Slots in use: 2
>>>> Num slots allocated: 1 Max slots: 0
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[44812,1],0]
>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>> State: 0 App_context: 0 Slot list: NULL
>>>>
>>>> Data for node: Name: login002 Launch id: -1 Arch: ffc91200 State: 2
>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>> Daemon: [[44812,0],0] Daemon launched: True
>>>> Num slots: 1 Slots in use: 2
>>>> Num slots allocated: 1 Max slots: 0
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[44812,1],0]
>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>> State: 0 App_context: 0 Slot list: NULL
>>>>
>>>> Data for node: Name: login003 Launch id: -1 Arch: 0 State: 2
>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>> Daemon: [[44812,0],2] Daemon launched: False
>>>> Num slots: 1 Slots in use: 2
>>>> Num slots allocated: 1 Max slots: 0
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[44812,1],0]
>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>> State: 0 App_context: 0 Slot list: NULL
>>>> [login002:02079] *** Process received signal ***
>>>> [login002:02079] Signal: Segmentation fault (11)
>>>> [login002:02079] Signal code: Address not mapped (1)
>>>> [login002:02079] Failing at address: 0x50
>>>> [login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
>>>> [login002:02079] [ 1] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7) [0x2afa70d25de7]
>>>> [login002:02079] [ 2] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3b8) [0x2afa70d36088]
>>>> [login002:02079] [ 3] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd7) [0x2afa70d37fc7]
>>>> [login002:02079] [ 4] /g/software/openmpi-1.4.2/intel/lib/openmpi/mca_plm_rsh.so [0x2afa721085a1]
>>>> [login002:02079] [ 5] mpirun [0x404c27]
>>>> [login002:02079] [ 6] mpirun [0x403e38]
>>>> [login002:02079] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3568e1d994]
>>>> [login002:02079] [ 8] mpirun [0x403d69]
>>>> [login002:02079] *** End of error message ***
>>>> Segmentation fault
>>>>
>>>> We tried version 1.4.1 and this problem did not emerge.
>>>>
>>>> This option is necessary for when our users launch hybrid MPI-OMP code were they can request M nodes and n ppn in a PBS/Torque setup so they can only get the right amount of MPI taks. Unfortunately, as soon as we use the 'npernode N' option mprun crashes.
>>>>
>>>> Is this a known issue? I found related problem (of around May, 2010) when people were using the same option but in a SLURM environment.
>>>>
>>>> regards
>>>>
>>>> Michael
>>>>
>>>> <config.log.gz><ompi_info-a.out.gz>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users