Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Galen M. Shipman (gshipman_at_[hidden])
Date: 2006-02-02 23:49:01


Hi Jean,

I suspect the problem may be in the bcast,
ompi_coll_tuned_bcast_intra_basic_linear. Can you try the same run using

mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np
2 -mca coll self,basic -d xterm -e gdb PMB-MPI1

This will use the basic collectives and may isolate the problem.

Thanks,

Galen

On Feb 2, 2006, at 7:04 PM, Jean-Christophe Hugly wrote:

> On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote:
>
>> Is it possible for you to get a stack trace where this is hanging?
>>
>> You might try:
>>
>>
>> mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np
>> 2 -d xterm -e gdb PMB-MPI1
>>
>>
>
> I did that, and when it was hanging I control-C'd in each gdb and
> asked
> for a bt.
>
> Here's the debug output from the mpirun command:
> ======================================================================
> ============================
>
> mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2
> -d xterm -e gdb PMB-MPI1
> [bench1:16017] procdir: (null)
> [bench1:16017] jobdir: (null)
> [bench1:16017]
> unidir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe
> [bench1:16017] top: openmpi-sessions-root_at_bench1_0
> [bench1:16017] tmp: /tmp
> [bench1:16017] connect_uni: contact info read
> [bench1:16017] connect_uni: connection not allowed
> [bench1:16017] [0,0,0] setting up session dir with
> [bench1:16017] tmpdir /tmp
> [bench1:16017] universe default-universe-16017
> [bench1:16017] user root
> [bench1:16017] host bench1
> [bench1:16017] jobid 0
> [bench1:16017] procid 0
> [bench1:16017]
> procdir: /tmp/openmpi-sessions-root_at_bench1_0/default-
> universe-16017/0/0
> [bench1:16017]
> jobdir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-16017/0
> [bench1:16017]
> unidir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-16017
> [bench1:16017] top: openmpi-sessions-root_at_bench1_0
> [bench1:16017] tmp: /tmp
> [bench1:16017] [0,0,0]
> contact_file /tmp/openmpi-sessions-root_at_bench1_0/default-
> universe-16017/universe-setup.txt
> [bench1:16017] [0,0,0] wrote setup file
> [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x1)
> [bench1:16017] pls:rsh: local csh: 0, local bash: 1
> [bench1:16017] pls:rsh: assuming same remote shell as local shell
> [bench1:16017] pls:rsh: remote csh: 0, remote bash: 1
> [bench1:16017] pls:rsh: final template argv:
> [bench1:16017] pls:rsh: /usr/bin/ssh -X <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --
> nodename
> <template> --universe root_at_bench1:default-universe-16017 --nsreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-
> yield 0
> [bench1:16017] pls:rsh: launching on node bench2
> [bench1:16017] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [bench1:16017] pls:rsh: bench2 is a REMOTE node
> [bench1:16017] pls:rsh: executing: /usr/bin/ssh -X bench2
> PATH=/opt/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ompi/
> lib:
> $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/ompi/bin/orted --
> debug
> --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
> bench2 --universe root_at_bench1:default-universe-16017 --nsreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-
> yield 0
> [bench2:16980] [0,0,1] setting up session dir with
> [bench2:16980] universe default-universe-16017
> [bench2:16980] user root
> [bench2:16980] host bench2
> [bench2:16980] jobid 0
> [bench2:16980] procid 1
> [bench2:16980]
> procdir: /tmp/openmpi-sessions-root_at_bench2_0/default-
> universe-16017/0/1
> [bench2:16980]
> jobdir: /tmp/openmpi-sessions-root_at_bench2_0/default-universe-16017/0
> [bench2:16980]
> unidir: /tmp/openmpi-sessions-root_at_bench2_0/default-universe-16017
> [bench2:16980] top: openmpi-sessions-root_at_bench2_0
> [bench2:16980] tmp: /tmp
> [bench1:16017] pls:rsh: launching on node bench1
> [bench1:16017] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [bench1:16017] pls:rsh: bench1 is a LOCAL node
> [bench1:16017] pls:rsh: reset
> PATH: /opt/ompi/bin:/sbin:/usr/sbin:/usr/local/sbin:/opt/gnome/
> sbin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/
> gnome/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/ompi/bin
> [bench1:16017] pls:rsh: reset LD_LIBRARY_PATH: /opt/ompi/lib
> [bench1:16017] pls:rsh: executing: orted --debug --bootproxy 1 --name
> 0.0.2 --num_procs 3 --vpid_start 0 --nodename bench1 --universe
> root_at_bench1:default-universe-16017 --nsreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
> "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-
> yield 0
> [bench1:16021] [0,0,2] setting up session dir with
> [bench1:16021] universe default-universe-16017
> [bench1:16021] user root
> [bench1:16021] host bench1
> [bench1:16021] jobid 0
> [bench1:16021] procid 2
> [bench1:16021]
> procdir: /tmp/openmpi-sessions-root_at_bench1_0/default-
> universe-16017/0/2
> [bench1:16021]
> jobdir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-16017/0
> [bench1:16021]
> unidir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-16017
> [bench1:16021] top: openmpi-sessions-root_at_bench1_0
> [bench1:16021] tmp: /tmp
> Warning: translation table syntax error: Unknown keysym name: DRemove
> Warning: ... found while parsing '<Key>DRemove: ignore()'
> Warning: String to TranslationTable conversion encountered errors
> Warning: translation table syntax error: Unknown keysym name: DRemove
> Warning: ... found while parsing '<Key>DRemove: ignore()'
> Warning: String to TranslationTable conversion encountered errors
> [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x3)
> [bench1:16017] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 2
> MPIR_proctable:
> (i, host, exe, pid) = (0, bench1, /usr/bin/xterm, 16025)
> (i, host, exe, pid) = (1, bench2, /usr/bin/xterm, 16984)
> [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x4)
>
>
> Here's the output in one xterm:
> ======================================================================
> ===
> GNU gdb 6.3
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License,
> and you
> are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for
> details.
> This GDB was configured as "x86_64-suse-linux"...Using host
> libthread_db
> library
> "/lib64/tls/libthread_db.so.1".
>
> (gdb) run
> Starting program: /root/SRC_PMB/PMB-MPI1
> [Thread debugging using libthread_db enabled]
> [New Thread 46912509890080 (LWP 16984)]
> [bench2:16984] [0,1,0] setting up session dir with
> [bench2:16984] universe default-universe-16017
> [bench2:16984] user root
> [bench2:16984] host bench2
> [bench2:16984] jobid 1
> [bench2:16984] procid 0
> [bench2:16984]
> procdir: /tmp/openmpi-sessions-root_at_bench2_0/default-universe-160
> 17/1/0
> [bench2:16984]
> jobdir: /tmp/openmpi-sessions-root_at_bench2_0/default-universe-1601
> 7/1
> [bench2:16984]
> unidir: /tmp/openmpi-sessions-root_at_bench2_0/default-universe-1601
> 7
> [bench2:16984] top: openmpi-sessions-root_at_bench2_0
> [bench2:16984] tmp: /tmp
> [bench2:16984] [0,1,0] ompi_mpi_init completed
> #---------------------------------------------------
> # PALLAS MPI Benchmark Suite V2.2, MPI-1 part
> #---------------------------------------------------
> # Date : Thu Feb 2 09:51:32 2006
> # Machine : x86_64# System : Linux
> # Release : 2.6.13-15-smp
> # Version : #7 SMP Mon Jan 30 12:05:45 PST 2006
>
> #
> # Minimum message length in bytes: 0
> # Maximum message length in bytes: 4194304
> #
> # MPI_Datatype : MPI_BYTE
> # MPI_Datatype for reductions : MPI_FLOAT
> # MPI_Op : MPI_SUM
> #
> #
>
> # List of Benchmarks to run:
>
> # PingPong
> # PingPing
> # Sendrecv
> # Exchange
> # Allreduce
> # Reduce
> # Reduce_scatter
> # Allgather
> # Allgatherv
> # Alltoall
> # Bcast
> # Barrier
>
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 46912509890080 (LWP 16984)]
> mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469
> 469 cq.c: No such file or directory.
> in cq.c
> (gdb) inf st
> #0 mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469
> #1 0x00002aaaadef1a85 in mca_btl_openib_component_progress ()
> from /opt/ompi/lib/openmpi/mca_btl_openib.so
> #2 0x00002aaaadde8f62 in mca_bml_r2_progress ()
> from /opt/ompi/lib/openmpi/mca_bml_r2.so
> #3 0x00002aaaaaec79c0 in opal_progress ()
> from /opt/ompi/lib/libopal.so.0
> #4 0x00002aaaadac7255 in mca_pml_ob1_recv ()
> from /opt/ompi/lib/openmpi/mca_pml_ob1.so
> #5 0x00002aaaaea434c2 in ompi_coll_tuned_reduce_intra_basic_linear ()
> from /opt/ompi/lib/openmpi/mca_coll_tuned.so
> #6 0x00002aaaaea405e6 in
> ompi_coll_tuned_allreduce_intra_nonoverlapping
> ()
> from /opt/ompi/lib/openmpi/mca_coll_tuned.so
> #7 0x00002aaaaac06b17 in ompi_comm_nextcid ()
> from /opt/ompi/lib/libmpi.so.0
> #8 0x00002aaaaac0513b in ompi_comm_split ()
> from /opt/ompi/lib/libmpi.so.0
> #9 0x00002aaaaac2bcd8 in PMPI_Comm_split ()
> from /opt/ompi/lib/libmpi.so.0
> #10 0x0000000000403b81 in Set_Communicator ()
> #11 0x000000000040385e in Init_Communicator ()
> #12 0x0000000000402e06 in main ()
> (gdb)
>
> Here's the output in the other xterm:
> ======================================================================
> ==========================
> GNU gdb 6.3
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License,
> and you
> are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for
> details.
> This GDB was configured as "x86_64-suse-linux"...Using host
> libthread_db
> library
> "/lib64/tls/libthread_db.so.1".
>
> (gdb) run
> Starting program: /root/SRC_PMB/PMB-MPI1
> [Thread debugging using libthread_db enabled]
> [New Thread 46912509889280 (LWP 16025)]
> [bench1:16025] [0,1,1] setting up session dir with
> [bench1:16025] universe default-universe-16017
> [bench1:16025] user root
> [bench1:16025] host bench1
> [bench1:16025] jobid 1
> [bench1:16025] procid 1
> [bench1:16025]
> procdir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-160
> 17/1/1
> [bench1:16025]
> jobdir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-1601
> 7/1
> [bench1:16025]
> unidir: /tmp/openmpi-sessions-root_at_bench1_0/default-universe-1601
> 7
> [bench1:16025] top: openmpi-sessions-root_at_bench1_0
> [bench1:16025] tmp: /tmp
> [bench1:16025] [0,1,1] ompi_mpi_init completed
>
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 46912509889280 (LWP 16025)]
> 0x00002aaaab493fc5 in pthread_spin_lock ()
> from /lib64/tls/libpthread.so.0
> (gdb) inf st
> #0 0x00002aaaab493fc5 in pthread_spin_lock ()
> from /lib64/tls/libpthread.so.0
> #1 0x00002aaaaeb50d3e in mthca_poll_cq (ibcq=0x8b4990, ne=1,
> wc=0x7fffffbe0290) at cq.c:454
> #2 0x00002aaaaddf0ce0 in mca_btl_openib_component_progress ()
> from /opt/ompi/lib/openmpi/mca_btl_openib.so
> #3 0x00002aaaadce7f62 in mca_bml_r2_progress ()
> from /opt/ompi/lib/openmpi/mca_bml_r2.so
> #4 0x00002aaaaaec79c0 in opal_progress ()
> from /opt/ompi/lib/libopal.so.0
> #5 0x00002aaaad9c6255 in mca_pml_ob1_recv ()
> from /opt/ompi/lib/openmpi/mca_pml_ob1.so
> #6 0x00002aaaae940ba9 in ompi_coll_tuned_bcast_intra_basic_linear ()
> from /opt/ompi/lib/openmpi/mca_coll_tuned.so
> #7 0x00002aaaae52830f in mca_coll_basic_allgather_intra ()
> from /opt/ompi/lib/openmpi/mca_coll_basic.so
> #8 0x00002aaaaac04ee3 in ompi_comm_split ()
> from /opt/ompi/lib/libmpi.so.0
> #9 0x00002aaaaac2bcd8 in PMPI_Comm_split ()
> from /opt/ompi/lib/libmpi.so.0
> #10 0x0000000000403b81 in Set_Communicator ()
> #11 0x000000000040385e in Init_Communicator ()
> #12 0x0000000000402e06 in main ()
> (gdb)
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users