Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646
From: tmishima_at_[hidden]
Date: 2013-11-13 19:06:02


Hi Ralph, this is an additional information.

Here is the main part of output by adding "-mca rmaps_base_verbose 50".

[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating map
[node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP in
allocation
[node08.cluster:26952] mca:rmaps: mapping job [56581,1]
[node08.cluster:26952] mca:rmaps: creating new map for job [56581,1]
[node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using ppr mapper
[node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job [56581,1]
[node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using seq mapper
[node08.cluster:26952] mca:rmaps:resilient: cannot perform initial map of
job [56581,1] - no fault groups
[node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not using mindist
mapper
[node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
[node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
[node08.cluster:26952] [[56581,0],0] Filtering thru apps
[node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
[node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0 inuse 0

>From this result, I guess it's related to oversubscribe.
So I added "-oversubscribe" and rerun, then it worked well as show below:

[node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in list
[node08.cluster:27019] [[56774,0],0] Filtering thru apps
[node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
[node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
[node08.cluster:27019] node: node08 daemon: 0
[node08.cluster:27019] [[56774,0],0] Starting bookmark at node node08
[node08.cluster:27019] [[56774,0],0] Starting at node node08
[node08.cluster:27019] mca:rmaps:rr: mapping by slot for job [56774,1]
slots 1 num_procs 8
[node08.cluster:27019] mca:rmaps:rr:slot working node node08
[node08.cluster:27019] mca:rmaps:rr:slot node node08 is full - skipping
[node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is oversubscribed -
performing second pass
[node08.cluster:27019] mca:rmaps:rr:slot working node node08
[node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to node
node08
[node08.cluster:27019] mca:rmaps:base: computing vpids by slot for job
[56774,1]
[node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node node08
[node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node node08

I think something is wrong with treatment of oversubscription, which might
be
related to "#3893: LAMA mapper has problems"

tmishima

> Hmmm...looks like we aren't getting your allocation. Can you rerun and
add -mca ras_base_verbose 50?
>
> On Nov 12, 2013, at 11:30 PM, tmishima_at_[hidden] wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Here is the output of "-mca plm_base_verbose 5".
> >
> > [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh]
> > [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> > agent /usr/bin/rsh path NULL
> > [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh]
set
> > priority to 10
> > [node08.cluster:23573] mca:base:select:( plm) Querying component
[slurm]
> > [node08.cluster:23573] mca:base:select:( plm) Skipping component
[slurm].
> > Query failed to return a module
> > [node08.cluster:23573] mca:base:select:( plm) Querying component [tm]
> > [node08.cluster:23573] mca:base:select:( plm) Query of component [tm]
set
> > priority to 75
> > [node08.cluster:23573] mca:base:select:( plm) Selected component [tm]
> > [node08.cluster:23573] plm:base:set_hnp_name: initial bias 23573
nodename
> > hash 85176670
> > [node08.cluster:23573] plm:base:set_hnp_name: final jobfam 59480
> > [node08.cluster:23573] [[59480,0],0] plm:base:receive start comm
> > [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm creating map
> > [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only HNP in
> > allocation
> >
--------------------------------------------------------------------------
> > All nodes which are allocated for this job are already filled.
> >
--------------------------------------------------------------------------
> >
> > Here, openmpi's configuration is as follows:
> >
> > ./configure \
> > --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> > --with-tm \
> > --with-verbs \
> > --disable-ipv6 \
> > --disable-vt \
> > --enable-debug \
> > CC=pgcc CFLAGS="-tp k8-64e" \
> > CXX=pgCC CXXFLAGS="-tp k8-64e" \
> > F77=pgfortran FFLAGS="-tp k8-64e" \
> > FC=pgfortran FCFLAGS="-tp k8-64e"
> >
> >> Hi Ralph,
> >>
> >> Okey, I can help you. Please give me some time to report the output.
> >>
> >> Tetsuya Mishima
> >>
> >>> I can try, but I have no way of testing Torque any more - so all I
can
> > do
> >> is a code review. If you can build --enable-debug and add -mca
> >> plm_base_verbose 5 to your cmd line, I'd appreciate seeing the
> >>> output.
> >>>
> >>>
> >>> On Nov 12, 2013, at 9:58 PM, tmishima_at_[hidden] wrote:
> >>>
> >>>>
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>> Thank you for your quick response.
> >>>>
> >>>> I'd like to report one more regressive issue about Torque support of
> >>>> openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper
> >>>> has problems" I reported a few days ago.
> >>>>
> >>>> The script below does not work with openmpi-1.7.4a1r29646,
> >>>> although it worked with openmpi-1.7.3 as I told you before.
> >>>>
> >>>> #!/bin/sh
> >>>> #PBS -l nodes=node08:ppn=8
> >>>> export OMP_NUM_THREADS=1
> >>>> cd $PBS_O_WORKDIR
> >>>> cp $PBS_NODEFILE pbs_hosts
> >>>> NPROCS=`wc -l < pbs_hosts`
> >>>> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
-bind-to
> >> core
> >>>> Myprog
> >>>>
> >>>> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it works
> > fine.
> >>>> Since this happens without lama request, I guess it's not the
problem
> >>>> in lama itself. Anyway, please look into this issue as well.
> >>>>
> >>>> Regards,
> >>>> Tetsuya Mishima
> >>>>
> >>>>> Done - thanks!
> >>>>>
> >>>>> On Nov 12, 2013, at 7:35 PM, tmishima_at_[hidden] wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Dear openmpi developers,
> >>>>>>
> >>>>>> I got a segmentation fault in traial use of openmpi-1.7.4a1r29646
> >> built
> >>>> by
> >>>>>> PGI13.10 as shown below:
> >>>>>>
> >>>>>> [mishima_at_manage testbed-openmpi-1.7.3]$ mpirun -np 4
-cpus-per-proc
> > 2
> >>>>>> -report-bindings mPre
> >>>>>> [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt
0]],
> >>>> socket
> >>>>>> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> >>>>>> [manage.cluster:23082] MCW rank 3 bound to socket 1[core 6[hwt
0]],
> >>>> socket
> >>>>>> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> >>>>>> [manage.cluster:23082] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>> socket
> >>>>>> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> >>>>>> [manage.cluster:23082] MCW rank 1 bound to socket 0[core 2[hwt
0]],
> >>>> socket
> >>>>>> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> >>>>>> [manage:23082] *** Process received signal ***
> >>>>>> [manage:23082] Signal: Segmentation fault (11)
> >>>>>> [manage:23082] Signal code: Address not mapped (1)
> >>>>>> [manage:23082] Failing at address: 0x34
> >>>>>> [manage:23082] *** End of error message ***
> >>>>>> Segmentation fault (core dumped)
> >>>>>>
> >>>>>> [mishima_at_manage testbed-openmpi-1.7.3]$ gdb mpirun core.23082
> >>>>>> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> >>>>>> Copyright (C) 2009 Free Software Foundation, Inc.
> >>>>>> ...
> >>>>>> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> > -report-bindings
> >>>>>> mPre'.
> >>>>>> Program terminated with signal 11, Segmentation fault.
> >>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> >>>> sd=32767,
> >>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>> (gdb) where
> >>>>>> #0 0x00002b5f861c9c4f in recv_connect (mod=0x5f861ca20b00007f,
> >>>> sd=32767,
> >>>>>> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> >>>>>> #1 0x00002b5f861ca20b in recv_handler (sd=1778385023,
flags=32767,
> >>>>>> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> >>>>>> #2 0x00002b5f848eb06a in event_process_active_single_queue
> >>>>>> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> >>>>>> at ./event.c:1366
> >>>>>> #3 0x00002b5f848eb270 in event_process_active
> >>>> (base=0x5f848eb84900007f)
> >>>>>> at ./event.c:1435
> >>>>>> #4 0x00002b5f848eb849 in opal_libevent2021_event_base_loop
> >>>>>> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> >>>>>> #5 0x00000000004077a0 in orterun (argc=7, argv=0x7fff25bbd4a8)
> >>>>>> at ./orterun.c:1030
> >>>>>> #6 0x00000000004067fb in main (argc=7, argv=0x7fff25bbd4a8)
> >>>> at ./main.c:13
> >>>>>> (gdb) quit
> >>>>>>
> >>>>>>
> >>>>>> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
> > unnecessary,
> >>>> which
> >>>>>> causes the segfault.
> >>>>>>
> >>>>>> 624 /* lookup the corresponding process */
> >>>>>> 625 peer = mca_oob_tcp_peer_lookup(mod, &hdr->origin);
> >>>>>> 626 if (NULL == peer) {
> >>>>>> 627 ui64 = (uint64_t*)(&peer->name);
> >>>>>> 628 opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> >>>>>> orte_oob_base_framework.framework_output,
> >>>>>> 629 "%s mca_oob_tcp_recv_connect:
> >>>>>> connection from new peer",
> >>>>>> 630 ORTE_NAME_PRINT
> >> (ORTE_PROC_MY_NAME));
> >>>>>> 631 peer = OBJ_NEW(mca_oob_tcp_peer_t);
> >>>>>> 632 peer->mod = mod;
> >>>>>> 633 peer->name = hdr->origin;
> >>>>>> 634 peer->state = MCA_OOB_TCP_ACCEPTING;
> >>>>>> 635 ui64 = (uint64_t*)(&peer->name);
> >>>>>> 636 if (OPAL_SUCCESS != opal_hash_table_set_value_uint64
> >>>> (&mod->
> >>>>>> peers, (*ui64), peer)) {
> >>>>>> 637 OBJ_RELEASE(peer);
> >>>>>> 638 return;
> >>>>>> 639 }
> >>>>>>
> >>>>>>
> >>>>>> Please fix this mistake in the next release.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Tetsuya Mishima
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users