Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-08-24 15:16:10


Ummm....the configure log terminates normally, indicating it configured fine. The make log ends, but with no error shown - everything was building just fine.

Did you maybe stop it before it was complete? Run out of disk quota? Or...?

On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:

> Hi Ralph,
>
> I tried to build 1.4.3.a1r23542 (08/02/2010) with
>
> ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" --enable-cxx-exceptions CFLAGS="-O2" CXXFLAGS="-O2" FFLAGS="-O2" FCFLAGS="-O2"
> with the GCC 4.1.2
>
> miket_at_login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
>
>
> but it failed. I am attaching the configure and make logs.
>
> regards
>
> Michael
>
>
> On 08/23/10 20:53, Ralph Castain wrote:
>>
>> Nope - none of them will work with 1.4.2. Sorry - bug not discovered until after release
>>
>> On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
>>
>>> Hi Jeff,
>>> thanks for the quick reply.
>>>
>>> Would using '--cpus-per-proc N' in place of '-npernode N' or just '-bynode' do the trick?
>>>
>>> It seems that using '--loadbalance' also crashes mpirun.
>>>
>>> best ...
>>>
>>> Michael
>>>
>>>
>>> On 08/23/10 19:30, Jeff Squyres wrote:
>>>>
>>>> Yes, the -npernode segv is a known issue.
>>>>
>>>> We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl and see if that fixes your problem?
>>>>
>>>> http://www.open-mpi.org/nightly/v1.4/
>>>>
>>>>
>>>>
>>>> On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
>>>>
>>>>> Hello OMPI:
>>>>>
>>>>> We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. OMPI was built uisng Intel compilers 11.1.072. I am attaching the configuration log and output from ompi_info -a.
>>>>>
>>>>> The problem we are encountering is that whenever we use option '-npernode N' in the mpirun command line we get a segmentation fault as in below:
>>>>>
>>>>>
>>>>> miket_at_login002[pts/7]PS $ mpirun -npernode 1 --display-devel-map --tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname
>>>>>
>>>>> Map generated by mapping policy: 0402
>>>>> Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
>>>>> Num new daemons: 2 New daemon starting vpid 1
>>>>> Num nodes: 3
>>>>>
>>>>> Data for node: Name: login001 Launch id: -1 Arch: 0 State: 2
>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>>> Daemon: [[44812,0],1] Daemon launched: False
>>>>> Num slots: 1 Slots in use: 2
>>>>> Num slots allocated: 1 Max slots: 0
>>>>> Username on node: NULL
>>>>> Num procs: 1 Next node_rank: 1
>>>>> Data for proc: [[44812,1],0]
>>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>>> State: 0 App_context: 0 Slot list: NULL
>>>>>
>>>>> Data for node: Name: login002 Launch id: -1 Arch: ffc91200 State: 2
>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>>> Daemon: [[44812,0],0] Daemon launched: True
>>>>> Num slots: 1 Slots in use: 2
>>>>> Num slots allocated: 1 Max slots: 0
>>>>> Username on node: NULL
>>>>> Num procs: 1 Next node_rank: 1
>>>>> Data for proc: [[44812,1],0]
>>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>>> State: 0 App_context: 0 Slot list: NULL
>>>>>
>>>>> Data for node: Name: login003 Launch id: -1 Arch: 0 State: 2
>>>>> Num boards: 1 Num sockets/board: 2 Num cores/socket: 4
>>>>> Daemon: [[44812,0],2] Daemon launched: False
>>>>> Num slots: 1 Slots in use: 2
>>>>> Num slots allocated: 1 Max slots: 0
>>>>> Username on node: NULL
>>>>> Num procs: 1 Next node_rank: 1
>>>>> Data for proc: [[44812,1],0]
>>>>> Pid: 0 Local rank: 0 Node rank: 0
>>>>> State: 0 App_context: 0 Slot list: NULL
>>>>> [login002:02079] *** Process received signal ***
>>>>> [login002:02079] Signal: Segmentation fault (11)
>>>>> [login002:02079] Signal code: Address not mapped (1)
>>>>> [login002:02079] Failing at address: 0x50
>>>>> [login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
>>>>> [login002:02079] [ 1] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7) [0x2afa70d25de7]
>>>>> [login002:02079] [ 2] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3b8) [0x2afa70d36088]
>>>>> [login002:02079] [ 3] /g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd7) [0x2afa70d37fc7]
>>>>> [login002:02079] [ 4] /g/software/openmpi-1.4.2/intel/lib/openmpi/mca_plm_rsh.so [0x2afa721085a1]
>>>>> [login002:02079] [ 5] mpirun [0x404c27]
>>>>> [login002:02079] [ 6] mpirun [0x403e38]
>>>>> [login002:02079] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3568e1d994]
>>>>> [login002:02079] [ 8] mpirun [0x403d69]
>>>>> [login002:02079] *** End of error message ***
>>>>> Segmentation fault
>>>>>
>>>>> We tried version 1.4.1 and this problem did not emerge.
>>>>>
>>>>> This option is necessary for when our users launch hybrid MPI-OMP code were they can request M nodes and n ppn in a PBS/Torque setup so they can only get the right amount of MPI taks. Unfortunately, as soon as we use the 'npernode N' option mprun crashes.
>>>>>
>>>>> Is this a known issue? I found related problem (of around May, 2010) when people were using the same option but in a SLURM environment.
>>>>>
>>>>> regards
>>>>>
>>>>> Michael
>>>>>
>>>>> <config.log.gz><ompi_info-a.out.gz>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> <config_1.4.3.log.gz><make_1.4.3.out.gz>