Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-05-02 12:07:28


Do you have a firewall on the node called darwin ? Look like fisher
is unable to create a TCP connection to darwin, and the firewall
seems to be one of the most common problems...

   Thanks,
     george.

On May 2, 2006, at 5:19 AM, Ali Soleimani wrote:

> Hello all,
>
> I recently got OpenMPI 1.0.2 (rev 9571) compiled and running on a
> small EM64T-based cluster. Everything works fine when running on a
> single
> host, or when running simple commands or testscripts on multiple
> hosts. But
> when I try and run a major program (cosmomc), I get the following
> error:
>
>
> [alis_at_darwin cosmomc_mpi]$ mpirun -np 2 cosmomc params.ini
> Number of MPI processes: 2
> [0,1,0][btl_tcp_endpoint.c:
> 559:mca_btl_tcp_endpoint_complete_connect] connect() failed with
> errno=113
>
>
> I do not have more than one network interface (just eth0 and lo)
> and I
> tried the various options suggested in the FAQ for disabling
> interfaces. My
> machines have only one IP address each. It does not seem to matter
> whether I
> use single hostnames, fully-qualfied hostnames, or IP addresses in
> the host
> list.
> Curiously, even though it reports this error, the processes still
> seem
> to start up on the remote machines, though they do not produce output
> properly. The relevant ps line on the non-host machine:
>
> alis 4393 0.0 0.0 37124 2896 ? S 05:10 0:00
> sshd: alis_at_notty
> alis 4394 0.1 0.0 36396 1964 ? Ss 05:10 0:00
> orted --debug
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
> alis 4411 99.9 0.1 628872 5520 ? R 05:10 0:14
> cosmomc params.ini
>
> Any suggestions? A copy of the mpi_run output with --debug is
> included below.
>
>
> -----
>
>
> [alis_at_darwin cosmomc_mpi]$ mpirun --debug -np 2 cosmomc params.ini
> [darwin.phsx.ku.edu:25140] procdir: (null)
> [darwin.phsx.ku.edu:25140] jobdir: (null)
> [darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-
> alis_at_[hidden]_0/default-universe
> [darwin.phsx.ku.edu:25140] top: openmpi-sessions-
> alis_at_[hidden]_0
> [darwin.phsx.ku.edu:25140] tmp: /tmp
> [darwin.phsx.ku.edu:25140] connect_uni: contact info read
> [darwin.phsx.ku.edu:25140] connect_uni: connection not allowed
> [darwin.phsx.ku.edu:25140] [0,0,0] setting up session dir with
> [darwin.phsx.ku.edu:25140] tmpdir /tmp
> [darwin.phsx.ku.edu:25140] universe default-universe-25140
> [darwin.phsx.ku.edu:25140] user alis
> [darwin.phsx.ku.edu:25140] host darwin.phsx.ku.edu
> [darwin.phsx.ku.edu:25140] jobid 0
> [darwin.phsx.ku.edu:25140] procid 0
> [darwin.phsx.ku.edu:25140] procdir: /tmp/openmpi-sessions-
> alis_at_[hidden]_0/default-universe-25140/0/0
> [darwin.phsx.ku.edu:25140] jobdir: /tmp/openmpi-sessions-
> alis_at_[hidden]_0/default-universe-25140/0
> [darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-
> alis_at_[hidden]_0/default-universe-25140
> [darwin.phsx.ku.edu:25140] top: openmpi-sessions-
> alis_at_[hidden]_0
> [darwin.phsx.ku.edu:25140] tmp: /tmp
> [darwin.phsx.ku.edu:25140] [0,0,0] contact_file /tmp/openmpi-
> sessions-alis_at_[hidden]_0/default-universe-25140/universe-
> setup.txt
> [darwin.phsx.ku.edu:25140] [0,0,0] wrote setup file
> [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
> state = 0x1)
> [darwin.phsx.ku.edu:25140] pls:rsh: local csh: 0, local bash: 1
> [darwin.phsx.ku.edu:25140] pls:rsh: assuming same remote shell as
> local shell
> [darwin.phsx.ku.edu:25140] pls:rsh: remote csh: 0, remote bash: 1
> [darwin.phsx.ku.edu:25140] pls:rsh: final template argv:
> [darwin.phsx.ku.edu:25140] pls:rsh: /usr/bin/ssh <template>
> orted --debug --bootproxy 1 --name <template> --num_procs 3 --
> vpid_start 0 --nodename <template> --universe
> alis_at_[hidden]:default-universe-25140 --nsreplica
> "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://
> 129.237.98.242:37853" --mpi-call-yield 0
> [darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.242
> [darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.242 is a LOCAL node
> [darwin.phsx.ku.edu:25140] pls:rsh: changing to directory /home/alis
> [darwin.phsx.ku.edu:25140] pls:rsh: executing: orted --debug --
> bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
> 129.237.98.242 --universe alis_at_[hidden]:default-
> universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --
> gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
> [darwin.phsx.ku.edu:25141] [0,0,1] setting up session dir with
> [darwin.phsx.ku.edu:25141] universe default-universe-25140
> [darwin.phsx.ku.edu:25141] user alis
> [darwin.phsx.ku.edu:25141] host 129.237.98.242
> [darwin.phsx.ku.edu:25141] jobid 0
> [darwin.phsx.ku.edu:25141] procid 1
> [darwin.phsx.ku.edu:25141] procdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140/0/1
> [darwin.phsx.ku.edu:25141] jobdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140/0
> [darwin.phsx.ku.edu:25141] unidir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140
> [darwin.phsx.ku.edu:25141] top: openmpi-sessions-alis_at_129.237.98.242_0
> [darwin.phsx.ku.edu:25141] tmp: /tmp
> [darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.243
> [darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.243 is a REMOTE node
> [darwin.phsx.ku.edu:25140] pls:rsh: executing: /usr/bin/ssh
> 129.237.98.243 orted --debug --bootproxy 1 --name 0.0.2 --num_procs
> 3 --vpid_start 0 --nodename 129.237.98.243 --universe
> alis_at_[hidden]:default-universe-25140 --nsreplica
> "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://
> 129.237.98.242:37853" --mpi-call-yield 0
> [fisher.phsx.ku.edu:04445] [0,0,2] setting up session dir with
> [fisher.phsx.ku.edu:04445] universe default-universe-25140
> [fisher.phsx.ku.edu:04445] user alis
> [fisher.phsx.ku.edu:04445] host 129.237.98.243
> [fisher.phsx.ku.edu:04445] jobid 0
> [fisher.phsx.ku.edu:04445] procid 2
> [fisher.phsx.ku.edu:04445] procdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140/0/2
> [fisher.phsx.ku.edu:04445] jobdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140/0
> [fisher.phsx.ku.edu:04445] unidir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140
> [fisher.phsx.ku.edu:04445] top: openmpi-sessions-alis_at_129.237.98.243_0
> [fisher.phsx.ku.edu:04445] tmp: /tmp
> [darwin.phsx.ku.edu:25143] [0,1,0] setting up session dir with
> [darwin.phsx.ku.edu:25143] universe default-universe-25140
> [darwin.phsx.ku.edu:25143] user alis
> [darwin.phsx.ku.edu:25143] host 129.237.98.242
> [darwin.phsx.ku.edu:25143] jobid 1
> [darwin.phsx.ku.edu:25143] procid 0
> [darwin.phsx.ku.edu:25143] procdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140/1/0
> [darwin.phsx.ku.edu:25143] jobdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140/1
> [darwin.phsx.ku.edu:25143] unidir: /tmp/openmpi-sessions-
> alis_at_129.237.98.242_0/default-universe-25140
> [darwin.phsx.ku.edu:25143] top: openmpi-sessions-alis_at_129.237.98.242_0
> [darwin.phsx.ku.edu:25143] tmp: /tmp
> [fisher.phsx.ku.edu:04462] [0,1,1] setting up session dir with
> [fisher.phsx.ku.edu:04462] universe default-universe-25140
> [fisher.phsx.ku.edu:04462] user alis
> [fisher.phsx.ku.edu:04462] host 129.237.98.243
> [fisher.phsx.ku.edu:04462] jobid 1
> [fisher.phsx.ku.edu:04462] procid 1
> [fisher.phsx.ku.edu:04462] procdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140/1/1
> [fisher.phsx.ku.edu:04462] jobdir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140/1
> [fisher.phsx.ku.edu:04462] unidir: /tmp/openmpi-sessions-
> alis_at_129.237.98.243_0/default-universe-25140
> [fisher.phsx.ku.edu:04462] top: openmpi-sessions-alis_at_129.237.98.243_0
> [fisher.phsx.ku.edu:04462] tmp: /tmp
> [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
> state = 0x3)
> [darwin.phsx.ku.edu:25140] Info: Setting up debugger process table
> for applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 2
> MPIR_proctable:
> (i, host, exe, pid) = (0, 129.237.98.243, cosmomc, 4462)
> (i, host, exe, pid) = (1, 129.237.98.242, cosmomc, 25143)
> [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
> state = 5453392)
> [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
> state = 0x4)
> [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
> state = 5389856)
> [darwin.phsx.ku.edu:25143] [0,1,0] ompi_mpi_init completed
> [fisher.phsx.ku.edu:04462] [0,1,1] ompi_mpi_init completed
> [fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1,
> state = 5449344)
> [fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1,
> state = 5379136)
> Number of MPI processes: 2
> [0,1,0][btl_tcp_endpoint.c:
> 559:mca_btl_tcp_endpoint_complete_connect] connect() failed with
> errno=113
>
> ---
> At this point I have to kill the proc with Ctrl-C.
> ---
>
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir
> empty - deleting
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: univ session dir not
> empty - leaving
> Killed by signal 2.
> [darwin.phsx.ku.edu:25140] sess_dir_finalize: proc session dir not
> empty - leaving
> [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
> state = ORTE_PROC_STATE_ABORTED)
> [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
> state = 0xa)
> [darwin.phsx.ku.edu:25140] ERROR: A daemon on node 129.237.98.243
> failed to start as expected.
> [darwin.phsx.ku.edu:25140] ERROR: There may be more information
> available from
> [darwin.phsx.ku.edu:25140] ERROR: the remote shell (see above).
> [darwin.phsx.ku.edu:25140] ERROR: The daemon exited unexpectedly
> with status 255.
> mpirun: killing job...
> [darwin.phsx.ku.edu:25140] [0,0,0]-[0,0,2]
> mca_oob_tcp_msg_send_handler: writev failed with errno=104
> [darwin.phsx.ku.edu:25140] [0,0,0] ORTE_ERROR_LOG: Connection
> failed in file pls_base_proxy.c at line 140
> forrtl: error (69): process interrupted (SIGINT)
> ----------------------------------------------------------------------
> ----
> WARNING: A process refused to die!
>
> Host: darwin.phsx.ku.edu
> PID: 25143
>
> This process may still be running and/or consuming resources.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> WARNING: A process refused to die!
>
> Host: darwin.phsx.ku.edu
> PID: 25143
>
> This process may still be running and/or consuming resources.
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ----
> WARNING: A process refused to die!
>
> Host: darwin.phsx.ku.edu
> PID: 25143
>
> This process may still be running and/or consuming resources.
> ----------------------------------------------------------------------
> ----
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: proc session dir not
> empty - leaving
> [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
> state = ORTE_PROC_STATE_TERMINATED)
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: found proc session
> dir empty - deleting
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir
> empty - deleting
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: found univ session
> dir empty - deleting
> [darwin.phsx.ku.edu:25141] sess_dir_finalize: top session dir not
> empty - leaving
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users