Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-07-02 07:55:50


A few clarifying questions:

What is your netmask on these hosts?

Where is the MPI_ALLREDUCE in your app -- right away, or somewhere deep
within the application? Can you replicate this with a simple MPI
application that essentially calls MPI_INIT, MPI_ALLREDUCE, and
MPI_FINALIZE?

Can you replicate this with a simple MPI app that does an MPI_SEND /
MPI_RECV between two processes on the different subnets?

Thanks.

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of openmpi-user
> Sent: Sunday, July 02, 2006 7:20 AM
> To: users_at_[hidden]
> Subject: [OMPI users] OS X, OpenMPI 1.1: An error occurred in
> MPI_Allreduce on communicator MPI_COMM_WORLD
>
> Hi All,
>
> when the nodes belong to different subnets the following
> error messages
> pop up:
> [powerbook.2-net:20826] *** An error occurred in MPI_Allreduce
> [powerbook.2-net:20826] *** on communicator MPI_COMM_WORLD
> [powerbook.2-net:20826] *** MPI_ERR_INTERN: internal error
> [powerbook.2-net:20826] *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> Here hostfile sets up three nodes in two subnets (192.168.3.x and
> 192.168.2.x with mask 255.255.255.0). The 192.168.3.x-nodes are
> connected via Gigabit-Ethernet, the 192.168.2.x-nodes are
> connected via
> WLAN.
>
> Frank
>
>
> This is the full output:
>
> [powerbook:/Network/CFD/MVH-1.0] motte% mpirun -d -np 7 --hostfile
> ./hostfile /Network/CFD/MVH-1.0/vhone
> [powerbook.2-net:20821] procdir: (null)
> [powerbook.2-net:20821] jobdir: (null)
> [powerbook.2-net:20821] unidir:
> /tmp/openmpi-sessions-motte_at_powerbook.2-net_0/default-universe
> [powerbook.2-net:20821] top: openmpi-sessions-motte_at_powerbook.2-net_0
> [powerbook.2-net:20821] tmp: /tmp
> [powerbook.2-net:20821] connect_uni: contact info read
> [powerbook.2-net:20821] connect_uni: connection not allowed
> [powerbook.2-net:20821] [0,0,0] setting up session dir with
> [powerbook.2-net:20821] tmpdir /tmp
> [powerbook.2-net:20821] universe default-universe-20821
> [powerbook.2-net:20821] user motte
> [powerbook.2-net:20821] host powerbook.2-net
> [powerbook.2-net:20821] jobid 0
> [powerbook.2-net:20821] procid 0
> [powerbook.2-net:20821] procdir:
> /tmp/openmpi-sessions-motte_at_powerbook.2-net_0/default-universe
> -20821/0/0
> [powerbook.2-net:20821] jobdir:
> /tmp/openmpi-sessions-motte_at_powerbook.2-net_0/default-universe-20821/0
> [powerbook.2-net:20821] unidir:
> /tmp/openmpi-sessions-motte_at_powerbook.2-net_0/default-universe-20821
> [powerbook.2-net:20821] top: openmpi-sessions-motte_at_powerbook.2-net_0
> [powerbook.2-net:20821] tmp: /tmp
> [powerbook.2-net:20821] [0,0,0] contact_file
> /tmp/openmpi-sessions-motte_at_powerbook.2-net_0/default-universe
-20821/universe-setup.txt
> [powerbook.2-net:20821] [0,0,0] wrote setup file
> [powerbook.2-net:20821] pls:rsh: local csh: 1, local bash: 0
> [powerbook.2-net:20821] pls:rsh: assuming same remote shell
> as local shell
> [powerbook.2-net:20821] pls:rsh: remote csh: 1, remote bash: 0
> [powerbook.2-net:20821] pls:rsh: final template argv:
> [powerbook.2-net:20821] pls:rsh: /usr/bin/ssh <template> orted
> --debug --bootproxy 1 --name <template> --num_procs 4 --vpid_start 0
> --nodename <template> --universe
> motte_at_powerbook.2-net:default-universe-20821 --nsreplica
> "0.0.0;tcp://192.168.2.3:54609" --gprreplica
> "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0
> [powerbook.2-net:20821] pls:rsh: launching on node Powerbook.2-net
> [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [powerbook.2-net:20821] pls:rsh: Powerbook.2-net is a LOCAL node
> [powerbook.2-net:20821] pls:rsh: changing to directory /Users/motte
> [powerbook.2-net:20821] pls:rsh: executing: orted --debug
> --bootproxy 1
> --name 0.0.1 --num_procs 4 --vpid_start 0 --nodename Powerbook.2-net
> --universe motte_at_powerbook.2-net:default-universe-20821 --nsreplica
> "0.0.0;tcp://192.168.2.3:54609" --gprreplica
> "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0
> [powerbook.2-net:20822] [0,0,1] setting up session dir with
> [powerbook.2-net:20822] universe default-universe-20821
> [powerbook.2-net:20822] user motte
> [powerbook.2-net:20822] host Powerbook.2-net
> [powerbook.2-net:20822] jobid 0
> [powerbook.2-net:20822] procid 1
> [powerbook.2-net:20822] procdir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe
> -20821/0/1
> [powerbook.2-net:20822] jobdir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe-20821/0
> [powerbook.2-net:20822] unidir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe-20821
> [powerbook.2-net:20822] top: openmpi-sessions-motte_at_Powerbook.2-net_0
> [powerbook.2-net:20822] tmp: /tmp
> [powerbook.2-net:20821] pls:rsh: launching on node g4d003.3-net
> [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [powerbook.2-net:20821] pls:rsh: g4d003.3-net is a REMOTE node
> [powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh g4d003.3-net
> orted --debug --bootproxy 1 --name 0.0.2 --num_procs 4 --vpid_start 0
> --nodename g4d003.3-net --universe
> motte_at_powerbook.2-net:default-universe-20821 --nsreplica
> "0.0.0;tcp://192.168.2.3:54609" --gprreplica
> "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0
> [powerbook.2-net:20821] pls:rsh: launching on node G5Dual.3-net
> [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [powerbook.2-net:20821] pls:rsh: G5Dual.3-net is a REMOTE node
> [powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh G5Dual.3-net
> orted --debug --bootproxy 1 --name 0.0.3 --num_procs 4 --vpid_start 0
> --nodename G5Dual.3-net --universe
> motte_at_powerbook.2-net:default-universe-20821 --nsreplica
> "0.0.0;tcp://192.168.2.3:54609" --gprreplica
> "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0
> [g4d003.3-net:00396] [0,0,2] setting up session dir with
> [g4d003.3-net:00396] universe default-universe-20821
> [g4d003.3-net:00396] user motte
> [g4d003.3-net:00396] host g4d003.3-net
> [g4d003.3-net:00396] jobid 0
> [g4d003.3-net:00396] procid 2
> [g4d003.3-net:00396] procdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/0/2
> [g4d003.3-net:00396] jobdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/0
> [g4d003.3-net:00396] unidir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821
> [g4d003.3-net:00396] top: openmpi-sessions-motte_at_g4d003.3-net_0
> [g4d003.3-net:00396] tmp: /tmp
> [g5dual.3-net:00938] [0,0,3] setting up session dir with
> [g5dual.3-net:00938] universe default-universe-20821
> [g5dual.3-net:00938] user motte
> [g5dual.3-net:00938] host G5Dual.3-net
> [g5dual.3-net:00938] jobid 0
> [g5dual.3-net:00938] procid 3
> [g5dual.3-net:00938] procdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/0/3
> [g5dual.3-net:00938] jobdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/0
> [g5dual.3-net:00938] unidir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821
> [g5dual.3-net:00938] top: openmpi-sessions-motte_at_G5Dual.3-net_0
> [g5dual.3-net:00938] tmp: /tmp
> [powerbook.2-net:20826] [0,1,6] setting up session dir with
> [powerbook.2-net:20826] universe default-universe-20821
> [powerbook.2-net:20826] user motte
> [powerbook.2-net:20826] host Powerbook.2-net
> [powerbook.2-net:20826] jobid 1
> [powerbook.2-net:20826] procid 6
> [powerbook.2-net:20826] procdir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe
> -20821/1/6
> [powerbook.2-net:20826] jobdir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe-20821/1
> [powerbook.2-net:20826] unidir:
> /tmp/openmpi-sessions-motte_at_Powerbook.2-net_0/default-universe-20821
> [powerbook.2-net:20826] top: openmpi-sessions-motte_at_Powerbook.2-net_0
> [powerbook.2-net:20826] tmp: /tmp
> [g5dual.3-net:00940] [0,1,0] setting up session dir with
> [g5dual.3-net:00940] universe default-universe-20821
> [g5dual.3-net:00940] user motte
> [g5dual.3-net:00940] host G5Dual.3-net
> [g5dual.3-net:00940] jobid 1
> [g5dual.3-net:00940] procid 0
> [g5dual.3-net:00940] procdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1/0
> [g5dual.3-net:00940] jobdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1
> [g5dual.3-net:00940] unidir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821
> [g5dual.3-net:00940] top: openmpi-sessions-motte_at_G5Dual.3-net_0
> [g5dual.3-net:00940] tmp: /tmp
> [g5dual.3-net:00946] [0,1,3] setting up session dir with
> [g5dual.3-net:00946] universe default-universe-20821
> [g5dual.3-net:00946] user motte
> [g5dual.3-net:00946] host G5Dual.3-net
> [g5dual.3-net:00946] jobid 1
> [g5dual.3-net:00946] procid 3
> [g5dual.3-net:00946] procdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1/3
> [g5dual.3-net:00946] jobdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1
> [g5dual.3-net:00946] unidir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821
> [g5dual.3-net:00946] top: openmpi-sessions-motte_at_G5Dual.3-net_0
> [g5dual.3-net:00946] tmp: /tmp
> [g5dual.3-net:00942] [0,1,1] setting up session dir with
> [g5dual.3-net:00942] universe default-universe-20821
> [g5dual.3-net:00942] user motte
> [g5dual.3-net:00942] host G5Dual.3-net
> [g5dual.3-net:00942] jobid 1
> [g5dual.3-net:00942] procid 1
> [g5dual.3-net:00942] procdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1/1
> [g5dual.3-net:00942] jobdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1
> [g5dual.3-net:00942] unidir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821
> [g5dual.3-net:00942] top: openmpi-sessions-motte_at_G5Dual.3-net_0
> [g5dual.3-net:00942] tmp: /tmp
> [g5dual.3-net:00944] [0,1,2] setting up session dir with
> [g5dual.3-net:00944] universe default-universe-20821
> [g5dual.3-net:00944] user motte
> [g5dual.3-net:00944] host G5Dual.3-net
> [g5dual.3-net:00944] jobid 1
> [g5dual.3-net:00944] procid 2
> [g5dual.3-net:00944] procdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1/2
> [g5dual.3-net:00944] jobdir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821/1
> [g5dual.3-net:00944] unidir:
> /tmp/openmpi-sessions-motte_at_G5Dual.3-net_0/default-universe-20821
> [g5dual.3-net:00944] top: openmpi-sessions-motte_at_G5Dual.3-net_0
> [g5dual.3-net:00944] tmp: /tmp
> [g4d003.3-net:00398] [0,1,4] setting up session dir with
> [g4d003.3-net:00398] universe default-universe-20821
> [g4d003.3-net:00398] user motte
> [g4d003.3-net:00398] host g4d003.3-net
> [g4d003.3-net:00398] jobid 1
> [g4d003.3-net:00398] procid 4
> [g4d003.3-net:00398] procdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/1/4
> [g4d003.3-net:00398] jobdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/1
> [g4d003.3-net:00398] unidir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821
> [g4d003.3-net:00398] top: openmpi-sessions-motte_at_g4d003.3-net_0
> [g4d003.3-net:00398] tmp: /tmp
> [g4d003.3-net:00400] [0,1,5] setting up session dir with
> [g4d003.3-net:00400] universe default-universe-20821
> [g4d003.3-net:00400] user motte
> [g4d003.3-net:00400] host g4d003.3-net
> [g4d003.3-net:00400] jobid 1
> [g4d003.3-net:00400] procid 5
> [g4d003.3-net:00400] procdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/1/5
> [g4d003.3-net:00400] jobdir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821/1
> [g4d003.3-net:00400] unidir:
> /tmp/openmpi-sessions-motte_at_g4d003.3-net_0/default-universe-20821
> [g4d003.3-net:00400] top: openmpi-sessions-motte_at_g4d003.3-net_0
> [g4d003.3-net:00400] tmp: /tmp
> [powerbook.2-net:20821] spawn: in job_state_callback(jobid =
> 1, state = 0x4)
> [powerbook.2-net:20821] Info: Setting up debugger process table for
> applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 7
> MPIR_proctable:
> (i, host, exe, pid) = (0, G5Dual.3-net,
> /Network/CFD/MVH-1.0/vhone, 940)
> (i, host, exe, pid) = (1, G5Dual.3-net,
> /Network/CFD/MVH-1.0/vhone, 942)
> (i, host, exe, pid) = (2, G5Dual.3-net,
> /Network/CFD/MVH-1.0/vhone, 944)
> (i, host, exe, pid) = (3, G5Dual.3-net,
> /Network/CFD/MVH-1.0/vhone, 946)
> (i, host, exe, pid) = (4, g4d003.3-net,
> /Network/CFD/MVH-1.0/vhone, 398)
> (i, host, exe, pid) = (5, g4d003.3-net,
> /Network/CFD/MVH-1.0/vhone, 400)
> (i, host, exe, pid) = (6, Powerbook.2-net,
> /Network/CFD/MVH-1.0/vhone, 20826)
> [powerbook.2-net:20826] [0,1,6] ompi_mpi_init completed
> [g5dual.3-net:00940] [0,1,0] ompi_mpi_init completed
> [g5dual.3-net:00942] [0,1,1] ompi_mpi_init completed
> [g5dual.3-net:00944] [0,1,2] ompi_mpi_init completed
> [g5dual.3-net:00946] [0,1,3] ompi_mpi_init completed
> [g4d003.3-net:00398] [0,1,4] ompi_mpi_init completed
> [g4d003.3-net:00400] [0,1,5] ompi_mpi_init completed
> [powerbook.2-net:20826] *** An error occurred in MPI_Allreduce
> [powerbook.2-net:20826] *** on communicator MPI_COMM_WORLD
> [powerbook.2-net:20826] *** MPI_ERR_INTERN: internal error
> [powerbook.2-net:20826] *** MPI_ERRORS_ARE_FATAL (goodbye)
> --------------------------------------------------------------
> ------------
> WARNING: A process refused to die!
>
> Host: powerbook.2-net
> PID: 20826
>
> This process may still be running and/or consuming resources.
> --------------------------------------------------------------
> ------------
> --------------------------------------------------------------
> ------------
> WARNING: A process refused to die!
>
> Host: g4d003.3-net
> PID: 398
>
> This process may still be running and/or consuming resources.
> --------------------------------------------------------------
> ------------
> --------------------------------------------------------------
> ------------
> (skipped)
>
>
>