Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-13 19:11:31


It has nothing to do with LAMA as you aren't using that mapper.

How many nodes are in this allocation?

On Nov 13, 2013, at 4:06 PM, tmishima_at_[hidden] wrote:

>
>
> Hi Ralph, this is an additional information.
>
> Here is the main part of output by adding "-mca rmaps_base_verbose 50".
>
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating map
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP in
> allocation
> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
> [node08.cluster:26952] mca:rmaps: creating new map for job [56581,1]
> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using ppr mapper
> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job [56581,1]
> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using seq mapper
> [node08.cluster:26952] mca:rmaps:resilient: cannot perform initial map of
> job [56581,1] - no fault groups
> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not using mindist
> mapper
> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 inuse 0
>
> From this result, I guess it's related to oversubscribe.
> So I added "-oversubscribe" and rerun, then it worked well as show below:
>
> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in list
> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
> [node08.cluster:27019] node: node08 daemon: 0
> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node node08
> [node08.cluster:27019] [[56774,0],0] Starting at node node08
> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job [56774,1]
> slots 1 num_procs 8
> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - skipping
> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is oversubscribed -
> performing second pass
> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to node
> node08
> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot for job
> [56774,1]
> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node node08
>
> I think something is wrong with treatment of oversubscription, which might
> be
> related to "#3893: LAMA mapper has problems"
>
> tmishima
>
>> Hmmm...looks like we aren't getting your allocation. Can you rerun and
> add -mca ras_base_verbose 50?
>>
>> On Nov 12, 2013, at 11:30 PM, tmishima_at_[hidden] wrote:
>>
>>>
>>>
>>> Hi Ralph,
>>>
>>> Here is the output of "-mca plm_base_verbose 5".
>>>
>>> [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh]
>>> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
>>> agent /usr/bin/rsh path NULL
>>> [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh]
> set
>>> priority to 10
>>> [node08.cluster:23573] mca:base:select:( plm) Querying component
> [slurm]
>>> [node08.cluster:23573] mca:base:select:( plm) Skipping component
> [slurm].
>>> Query failed to return a module
>>> [node08.cluster:23573] mca:base:select:( plm) Querying component [tm]
>>> [node08.cluster:23573] mca:base:select:( plm) Query of component [tm]
> set
>>> priority to 75
>>> [node08.cluster:23573] mca:base:select:( plm) Selected component [tm]
>>> [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573
> nodename
>>> hash 85176670
>>> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480
>>> [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm
>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map
>>> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in
>>> allocation
>>>
> --------------------------------------------------------------------------
>>> All nodes which are allocated for this job are already filled.
>>>
> --------------------------------------------------------------------------
>>>
>>> Here, openmpi's configuration is as follows:
>>>
>>> ./configure \
>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
>>> --with-tm \
>>> --with-verbs \
>>> --disable-ipv6 \
>>> --disable-vt \
>>> --enable-debug \
>>> CC=pgcc CFLAGS="-tp k8-64e" \
>>> CXX=pgCC CXXFLAGS="-tp k8-64e" \
>>> F77=pgfortran FFLAGS="-tp k8-64e" \
>>> FC=pgfortran FCFLAGS="-tp k8-64e"
>>>
>>>> Hi Ralph,
>>>>
>>>> Okey, I can help you. Please give me some time to report the output.
>>>>
>>>> Tetsuya Mishima
>>>>
>>>>> I can try, but I have no way of testing Torque any more - so all I
> can
>>> do
>>>> is a code review. If you can build --enable-debug and add -mca
>>>> plm_base_verbose 5 to your cmd line, I'd appreciate seeing the
>>>>> output.
>>>>>
>>>>>
>>>>> On Nov 12, 2013, at 9:58 PM, tmishima_at_[hidden] wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Ralph,
>>>>>>
>>>>>> Thank you for your quick response.
>>>>>>
>>>>>> I'd like to report one more regressive issue about Torque support of
>>>>>> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
>>>>>> has problems" I reported a few days ago.
>>>>>>
>>>>>> The script below does not work with openmpi-1.7.4a1r29646,
>>>>>> although it worked with openmpi-1.7.3 as I told you before.
>>>>>>
>>>>>> #!/bin/sh
>>>>>> #PBS -l nodes=node08:ppn=8
>>>>>> export OMP_NUM_THREADS=1
>>>>>> cd $PBS_O_WORKDIR
>>>>>> cp $PBS_NODEFILE pbs_hosts
>>>>>> NPROCS=`wc -l < pbs_hosts`
>>>>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
> -bind-to
>>>> core
>>>>>> Myprog
>>>>>>
>>>>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works
>>> fine.
>>>>>> Since this happens without lama request, I guess it's not the
> problem
>>>>>> in lama itself. Anyway, please look into this issue as well.
>>>>>>
>>>>>> Regards,
>>>>>> Tetsuya Mishima
>>>>>>
>>>>>>> Done - thanks!
>>>>>>>
>>>>>>> On Nov 12, 2013, at 7:35 PM, tmishima_at_[hidden] wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear openmpi developers,
>>>>>>>>
>>>>>>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646
>>>> built
>>>>>> by
>>>>>>>> PGI13.10 as shown below:
>>>>>>>>
>>>>>>>> [mishima_at_manage testbed-openmpi-1.7.3]$ mpirun -np 4
> -cpus-per-proc
>>> 2
>>>>>>>> -report-bindings mPre
>>>>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt
> 0]],
>>>>>> socket
>>>>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
>>>>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt
> 0]],
>>>>>> socket
>>>>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
>>>>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt
> 0]],
>>>>>> socket
>>>>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
>>>>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt
> 0]],
>>>>>> socket
>>>>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
>>>>>>>> [manage:23082] *** Process received signal ***
>>>>>>>> [manage:23082] Signal: Segmentation fault (11)
>>>>>>>> [manage:23082] Signal code: Address not mapped (1)
>>>>>>>> [manage:23082] Failing at address: 0x34
>>>>>>>> [manage:23082] *** End of error message ***
>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>
>>>>>>>> [mishima_at_manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
>>>>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
>>>>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>>>>>> ...
>>>>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
>>> -report-bindings
>>>>>>>> mPre'.
>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
>>>>>> sd=32767,
>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>> (gdb) where
>>>>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
>>>>>> sd=32767,
>>>>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
>>>>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023,
> flags=32767,
>>>>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
>>>>>>>> #2 0x00002b5f848eb06a in event_process_active_single_queue
>>>>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
>>>>>>>> at ./event.c:1366
>>>>>>>> #3 0x00002b5f848eb270 in event_process_active
>>>>>> (base=0x5f848eb84900007f)
>>>>>>>> at ./event.c:1435
>>>>>>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop
>>>>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
>>>>>>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
>>>>>>>> at ./orterun.c:1030
>>>>>>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
>>>>>> at ./main.c:13
>>>>>>>> (gdb) quit
>>>>>>>>
>>>>>>>>
>>>>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
>>> unnecessary,
>>>>>> which
>>>>>>>> causes the segfault.
>>>>>>>>
>>>>>>>> 624 /* lookup the corresponding process */
>>>>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
>>>>>>>> 626 if (NULL == peer) {
>>>>>>>> 627 ui64 = (uint64_t*)(&peer->name);
>>>>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
>>>>>>>> orte_oob_base_framework.framework_output,
>>>>>>>> 629 "%s mca_oob_tcp_recv_connect:
>>>>>>>> connection from new peer",
>>>>>>>> 630 ORTE_NAME_PRINT
>>>> (ORTE_PROC_MY_NAME));
>>>>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
>>>>>>>> 632 peer->mod = mod;
>>>>>>>> 633 peer->name = hdr->origin;
>>>>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING;
>>>>>>>> 635 ui64 = (uint64_t*)(&peer->name);
>>>>>>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
>>>>>> (&mod->
>>>>>>>> peers, (*ui64), peer)) {
>>>>>>>> 637 OBJ_RELEASE(peer);
>>>>>>>> 638 return;
>>>>>>>> 639 }
>>>>>>>>
>>>>>>>>
>>>>>>>> Please fix this mistake in the next release.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Tetsuya Mishima
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users