Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646
From: tmishima_at_[hidden]
Date: 2013-11-14 18:25:09


Hi Ralph,

I checked -cpus-per-proc in openmpi-1.7.4a1r29646.
It works well as I want to do, which can adjust nprocs
of each nodes dividing by number of threads.

I think my problem is solved so far using -cpus-per-proc,
thank you very mush.

Regarding oversbuscribed problem, I checked NPROCS was really 8
by outputing the number.

SCRIPT:
echo mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to
core Myprog
mpirun -machinefile pbs_hosts -np $NPROCS -report-bindings -bind-to core
Myprog

OUTPUT:
mpirun -machinefile pbs_hosts -np 8 -report-bindings -bind-to core Myprog
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

By the way, how did you verify the problem.
It looks like for me that you run the job directly from cmd line.

[rhc_at_bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4
--report-bindings -hostfile hosts hostname

In my environment, such a direct run without Torque script also works fine.
Anyway, as I already told you, my problem itself was solved. So I think the
priority to check is very low.

tmishima

> FWIW: I verified that this works fine under a slurm allocation of 2
nodes, each with 12 slots. I filled the node without getting an
"oversbuscribed" error message
>
> [rhc_at_bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4
--report-bindings -hostfile hosts hostname
> [bend001:24318] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0
[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]:
[BB/BB/BB/BB/../..][../../../../../..]
> [bend001:24318] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0
[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]:
[../../../../BB/BB][BB/BB/../../../..]
> [bend001:24318] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1
[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]:
[../../../../../..][../../BB/BB/BB/BB]
> bend001
> bend001
> bend001
>
> where
>
> [rhc_at_bend001 svn-trunk]$ cat hosts
> bend001 slots=12
>
> The only way I get the "out of resources" error is if I ask for more
processes than I have slots - i.e., I give it the hosts file as shown, but
ask for 13 or more processes.
>
>
> BTW: note one important issue with cpus-per-proc, as shown above. Because
I specified 4 cpus/proc, and my sockets each have 6 cpus, one of my procs
wound up being split across the two sockets (2
> cores on each). That's about the worst situation you can have.
>
> So a word of caution: it is up to the user to ensure that the mapping is
"good". We just do what you asked us to do.
>
>
> On Nov 13, 2013, at 8:30 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> Guess I don't see why modifying the allocation is required - we have
mapping options that should support such things. If you specify the total
number of procs you want, and cpus-per-proc=4, it should
> do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8
on the 32 proc nodes, and up to 6 on the 64 slot nodes (since you specified
np=16). So I guess I don't understand the issue.
>
> Regardless, if NPROCS=8 (and you verified that by printing it out, not
just assuming wc -l got that value), then it shouldn't think it is
oversubscribed. I'll take a look under a slurm allocation as
> that is all I can access.
>
>
> On Nov 13, 2013, at 7:23 PM, tmishima_at_[hidden] wrote:
>
>
>
> Our cluster consists of three types of nodes. They have 8, 32
> and 64 slots respectively. Since the performance of each core is
> almost same, mixed use of these nodes is possible.
>
> Furthremore, in this case, for hybrid application with openmpi+openmp,
> the modification of hostfile is necesarry as follows:
>
> #PBS -l nodes=1:ppn=32+4:ppn=8
> export OMP_NUM_THREADS=4
> modify $PBS_NODEFILE pbs_hosts # 64 lines are condensed to 16 lines
> mpirun -hostfile pbs_hosts -np 16 -cpus-per-proc 4 -x OMP_NUM_THREADS
> Myprog
>
> That's why I want to do that.
>
> Of course I know, If I quit mixed use, -npernode is better for this
> purpose.
>
> (The script I showed you first is just a simplified one to clarify the
> problem.)
>
> tmishima
>
>
> Why do it the hard way? I'll look at the FAQ because that definitely
> isn't a recommended thing to do - better to use -host to specify the
> subset, or just specify the desired mapping using all the
> various mappers we provide.
>
> On Nov 13, 2013, at 6:39 PM, tmishima_at_[hidden] wrote:
>
>
>
> Sorry for cross-post.
>
> Nodefile is very simple which consists of 8 lines:
>
> node08
> node08
> node08
> node08
> node08
> node08
> node08
> node08
>
> Therefore, NPROCS=8
>
> My aim is to modify the allocation as you pointed out. According to
> Openmpi
> FAQ,
> proper subset of the hosts allocated to the Torque / PBS Pro job should
> be
> allowed.
>
> tmishima
>
> Please - can you answer my question on script2? What is the value of
> NPROCS?
>
> Why would you want to do it this way? Are you planning to modify the
> allocation?? That generally is a bad idea as it can confuse the system
>
>
> On Nov 13, 2013, at 5:55 PM, tmishima_at_[hidden] wrote:
>
>
>
> Since what I really want is to run script2 correctly, please let us
> concentrate script2.
>
> I'm not an expert of the inside of openmpi. What I can do is just
> obsabation
> from the outside. I doubt these lines are strange, especially the
> last
> one.
>
> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in list
> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots 0
> inuse
> 0
>
> These lines come from this part of orte_rmaps_base_get_target_nodes
> in rmaps_base_support_fns.c:
>
>     } else if (node->slots <= node->slots_inuse &&
>                (ORTE_MAPPING_NO_OVERSUBSCRIBE &
> ORTE_GET_MAPPING_DIRECTIVE(policy))) {
>         /* remove the node as fully used */
>         OPAL_OUTPUT_VERBOSE((5,
> orte_rmaps_base_framework.framework_output,
>                              "%s Removing node %s slots %d inuse
> %d",
>                              ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>                              node->name, node->slots, node->
> slots_inuse));
>         opal_list_remove_item(allocated_nodes, item);
>         OBJ_RELEASE(item);  /* "un-retain" it */
>
> I wonder why node->slots and node->slots_inuse is 0, which I can read
> from the above line "Removing node node08 slots 0 inuse 0".
>
> Or I'm not sure but
> "else if (node->slots <= node->slots_inuse &&" should be
> "else if (node->slots < node->slots_inuse &&" ?
>
> tmishima
>
> On Nov 13, 2013, at 4:43 PM, tmishima_at_[hidden] wrote:
>
>
>
> Yes, the node08 has 8 slots but the process I run is also 8.
>
> #PBS -l nodes=node08:ppn=8
>
> Therefore, I think it should allow this allocation. Is that right?
>
> Correct
>
>
> My question is why scritp1 works and script2 does not. They are
> almost same.
>
> #PBS -l nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
>
> #SCRITP1
> mpirun -report-bindings -bind-to core Myprog
>
> #SCRIPT2
> mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings
> -bind-to
> core
>
> This version is not only reading the PBS allocation, but also
> invoking
> the hostfile filter on top of it. Different code path. I'll take a
> look
> -
> it should still match up assuming NPROCS=8. Any
> possibility that it is a different number? I don't recall, but isn't
> there some extra lines in the nodefile - e.g., comments?
>
>
> Myprog
>
> tmishima
>
> I guess here's my confusion. If you are using only one node, and
> that
> node has 8 allocated slots, then we will not allow you to run more
> than
> 8
> processes on that node unless you specifically provide
> the --oversubscribe flag. This is because you are operating in a
> managed
> environment (in this case, under Torque), and so we treat the
> allocation as
> "mandatory" by default.
>
> I suspect that is the issue here, in which case the system is
> behaving
> as
> it should.
>
> Is the above accurate?
>
>
> On Nov 13, 2013, at 4:11 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
>
> It has nothing to do with LAMA as you aren't using that mapper.
>
> How many nodes are in this allocation?
>
> On Nov 13, 2013, at 4:06 PM, tmishima_at_[hidden] wrote:
>
>
>
> Hi Ralph, this is an additional information.
>
> Here is the main part of output by adding "-mca
> rmaps_base_verbose
> 50".
>
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating
> map
> [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm only HNP
> in
> allocation
> [node08.cluster:26952] mca:rmaps: mapping job [56581,1]
> [node08.cluster:26952] mca:rmaps: creating new map for job
> [56581,1]
> [node08.cluster:26952] mca:rmaps:ppr: job [56581,1] not using
> ppr
> mapper
> [node08.cluster:26952] [[56581,0],0] rmaps:seq mapping job
> [56581,1]
> [node08.cluster:26952] mca:rmaps:seq: job [56581,1] not using
> seq
> mapper
> [node08.cluster:26952] mca:rmaps:resilient: cannot perform
> initial
> map
> of
> job [56581,1] - no fault groups
> [node08.cluster:26952] mca:rmaps:mindist: job [56581,1] not
> using
> mindist
> mapper
> [node08.cluster:26952] mca:rmaps:rr: mapping job [56581,1]
> [node08.cluster:26952] [[56581,0],0] Starting with 1 nodes in
> list
> [node08.cluster:26952] [[56581,0],0] Filtering thru apps
> [node08.cluster:26952] [[56581,0],0] Retained 1 nodes in list
> [node08.cluster:26952] [[56581,0],0] Removing node node08 slots
> 0
> inuse 0
>
> From this result, I guess it's related to oversubscribe.
> So I added "-oversubscribe" and rerun, then it worked well as
> show
> below:
>
> [node08.cluster:27019] [[56774,0],0] Starting with 1 nodes in
> list
> [node08.cluster:27019] [[56774,0],0] Filtering thru apps
> [node08.cluster:27019] [[56774,0],0] Retained 1 nodes in list
> [node08.cluster:27019] AVAILABLE NODES FOR MAPPING:
> [node08.cluster:27019]     node: node08 daemon: 0
> [node08.cluster:27019] [[56774,0],0] Starting bookmark at node
> node08
> [node08.cluster:27019] [[56774,0],0] Starting at node node08
> [node08.cluster:27019] mca:rmaps:rr: mapping by slot for job
> [56774,1]
> slots 1 num_procs 8
> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> [node08.cluster:27019] mca:rmaps:rr:slot node node08 is full -
> skipping
> [node08.cluster:27019] mca:rmaps:rr:slot job [56774,1] is
> oversubscribed -
> performing second pass
> [node08.cluster:27019] mca:rmaps:rr:slot working node node08
> [node08.cluster:27019] mca:rmaps:rr:slot adding up to 8 procs to
> node
> node08
> [node08.cluster:27019] mca:rmaps:base: computing vpids by slot
> for
> job
> [56774,1]
> [node08.cluster:27019] mca:rmaps:base: assigning rank 0 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 1 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 2 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 3 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 4 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 5 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 6 to node
> node08
> [node08.cluster:27019] mca:rmaps:base: assigning rank 7 to node
> node08
>
> I think something is wrong with treatment of oversubscription,
> which
> might
> be
> related to "#3893: LAMA mapper has problems"
>
> tmishima
>
> Hmmm...looks like we aren't getting your allocation. Can you
> rerun
> and
> add -mca ras_base_verbose 50?
>
> On Nov 12, 2013, at 11:30 PM, tmishima_at_[hidden] wrote:
>
>
>
> Hi Ralph,
>
> Here is the output of "-mca plm_base_verbose 5".
>
> [node08.cluster:23573] mca:base:select:(  plm) Querying
> component
> [rsh]
> [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on
> agent /usr/bin/rsh path NULL
> [node08.cluster:23573] mca:base:select:(  plm) Query of
> component
> [rsh]
> set
> priority to 10
> [node08.cluster:23573] mca:base:select:(  plm) Querying
> component
> [slurm]
> [node08.cluster:23573] mca:base:select:(  plm) Skipping
> component
> [slurm].
> Query failed to return a module
> [node08.cluster:23573] mca:base:select:(  plm) Querying
> component
> [tm]
> [node08.cluster:23573] mca:base:select:(  plm) Query of
> component
> [tm]
> set
> priority to 75
> [node08.cluster:23573] mca:base:select:(  plm) Selected
> component
> [tm]
> [node08.cluster:23573] plm:base:set_hnp_name: initial bias
> 23573
> nodename
> hash 85176670
> [node08.cluster:23573] plm:base:set_hnp_name: final jobfam
> 59480
> [node08.cluster:23573] [[59480,0],0] plm:base:receive start
> comm
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_job
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm
> creating
> map
> [node08.cluster:23573] [[59480,0],0] plm:base:setup_vm only
> HNP
> in
> allocation
>
>
>
>
>
>
--------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
>
>
>
>
>
>
--------------------------------------------------------------------------
>
> Here, openmpi's configuration is as follows:
>
> ./configure \
> --prefix=/home/mishima/opt/mpi/openmpi-1.7.4a1-pgi13.10 \
> --with-tm \
> --with-verbs \
> --disable-ipv6 \
> --disable-vt \
> --enable-debug \
> CC=pgcc CFLAGS="-tp k8-64e" \
> CXX=pgCC CXXFLAGS="-tp k8-64e" \
> F77=pgfortran FFLAGS="-tp k8-64e" \
> FC=pgfortran FCFLAGS="-tp k8-64e"
>
> Hi Ralph,
>
> Okey, I can help you. Please give me some time to report the
> output.
>
> Tetsuya Mishima
>
> I can try, but I have no way of testing Torque any more - so
> all
> I
> can
> do
> is a code review. If you can build --enable-debug and add
> -mca
> plm_base_verbose 5 to your cmd line, I'd appreciate seeing
> the
> output.
>
>
> On Nov 12, 2013, at 9:58 PM, tmishima_at_[hidden]
> wrote:
>
>
>
> Hi Ralph,
>
> Thank you for your quick response.
>
> I'd like to report one more regressive issue about Torque
> support
> of
> openmpi-1.7.4a1r29646, which might be related to "#3893:
> LAMA
> mapper
> has problems" I reported a few days ago.
>
> The script below does not work with openmpi-1.7.4a1r29646,
> although it worked with openmpi-1.7.3 as I told you before.
>
> #!/bin/sh
> #PBS -l nodes=node08:ppn=8
> export OMP_NUM_THREADS=1
> cd $PBS_O_WORKDIR
> cp $PBS_NODEFILE pbs_hosts
> NPROCS=`wc -l < pbs_hosts`
> mpirun -machinefile pbs_hosts -np ${NPROCS}
> -report-bindings
> -bind-to
> core
> Myprog
>
> If I drop "-machinefile pbs_hosts -np ${NPROCS} ", then it
> works
> fine.
> Since this happens without lama request, I guess it's not
> the
> problem
> in lama itself. Anyway, please look into this issue as
> well.
>
> Regards,
> Tetsuya Mishima
>
> Done - thanks!
>
> On Nov 12, 2013, at 7:35 PM, tmishima_at_[hidden]
> wrote:
>
>
>
> Dear openmpi developers,
>
> I got a segmentation fault in traial use of
> openmpi-1.7.4a1r29646
> built
> by
> PGI13.10 as shown below:
>
> [mishima_at_manage testbed-openmpi-1.7.3]$ mpirun -np 4
> -cpus-per-proc
> 2
> -report-bindings mPre
> [manage.cluster:23082] MCW rank 2 bound to socket 0[core
> 4
> [hwt
> 0]],
> socket
> 0[core 5[hwt 0]]: [././././B/B][./././././.]
> [manage.cluster:23082] MCW rank 3 bound to socket 1[core
> 6
> [hwt
> 0]],
> socket
> 1[core 7[hwt 0]]: [./././././.][B/B/./././.]
> [manage.cluster:23082] MCW rank 0 bound to socket 0[core
> 0
> [hwt
> 0]],
> socket
> 0[core 1[hwt 0]]: [B/B/./././.][./././././.]
> [manage.cluster:23082] MCW rank 1 bound to socket 0[core
> 2
> [hwt
> 0]],
> socket
> 0[core 3[hwt 0]]: [././B/B/./.][./././././.]
> [manage:23082] *** Process received signal ***
> [manage:23082] Signal: Segmentation fault (11)
> [manage:23082] Signal code: Address not mapped (1)
> [manage:23082] Failing at address: 0x34
> [manage:23082] *** End of error message ***
> Segmentation fault (core dumped)
>
> [mishima_at_manage testbed-openmpi-1.7.3]$ gdb mpirun
> core.23082
> GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1)
> Copyright (C) 2009 Free Software Foundation, Inc.
> ...
> Core was generated by `mpirun -np 4 -cpus-per-proc 2
> -report-bindings
> mPre'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00002b5f861c9c4f in recv_connect>>>
> (mod=0x5f861ca20b00007f,
> sd=32767,
> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> 631             peer = OBJ_NEW(mca_oob_tcp_peer_t);
> (gdb) where
> #0  0x00002b5f861c9c4f in recv_connect
> (mod=0x5f861ca20b00007f,
> sd=32767,
> hdr=0x1ca20b00007fff25) at ./oob_tcp.c:631
> #1  0x00002b5f861ca20b in recv_handler (sd=1778385023,
> flags=32767,
> cbdata=0x8eb06a00007fff25) at ./oob_tcp.c:760
> #2  0x00002b5f848eb06a in
> event_process_active_single_queue
> (base=0x5f848eb27000007f, activeq=0x848eb27000007fff)
> at ./event.c:1366
> #3  0x00002b5f848eb270 in event_process_active
> (base=0x5f848eb84900007f)
> at ./event.c:1435
> #4  0x00002b5f848eb849 in
> opal_libevent2021_event_base_loop
> (base=0x4077a000007f, flags=32767) at ./event.c:1645
> #5  0x00000000004077a0 in orterun (argc=7,
> argv=0x7fff25bbd4a8)
> at ./orterun.c:1030
> #6  0x00000000004067fb in main (argc=7,
> argv=0x7fff25bbd4a8)
> at ./main.c:13
> (gdb) quit
>
>
> The line 627 in orte/mca/oob/tcp/oob_tcp.c is apparently
> unnecessary,
> which
> causes the segfault.
>
> 624      /* lookup the corresponding process
> */>>>>>>>>>>>>> 625      peer = mca_oob_tcp_peer_lookup(mod, &hdr->
> origin);
> 626      if (NULL == peer) {
> 627          ui64 = (uint64_t*)(&peer->name);
> 628          opal_output_verbose(OOB_TCP_DEBUG_CONNECT,
> orte_oob_base_framework.framework_output,
> 629                              "%s
> mca_oob_tcp_recv_connect:
> connection from new peer",
> 630                              ORTE_NAME_PRINT
> (ORTE_PROC_MY_NAME));
> 631          peer = OBJ_NEW(mca_oob_tcp_peer_t);
> 632          peer->mod = mod;
> 633          peer->name = hdr->origin;
> 634          peer->state = MCA_OOB_TCP_ACCEPTING;
> 635          ui64 = (uint64_t*)(&peer->name);
> 636          if (OPAL_SUCCESS !=
> opal_hash_table_set_value_uint64
> (&mod->
> peers, (*ui64), peer)) {
> 637              OBJ_RELEASE(peer);
> 638              return;
> 639          }
>
>
> Please fix this mistake in the next release.
>
> Regards,
> Tetsuya Mishima
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list>> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
>
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> users mailing list
> users_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users