Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ali Soleimani (alis_at_[hidden])
Date: 2006-05-02 05:19:53


Hello all,

        I recently got OpenMPI 1.0.2 (rev 9571) compiled and running on a
small EM64T-based cluster. Everything works fine when running on a single
host, or when running simple commands or testscripts on multiple hosts. But
when I try and run a major program (cosmomc), I get the following error:

[alis_at_darwin cosmomc_mpi]$ mpirun -np 2 cosmomc params.ini
Number of MPI processes: 2
[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113

        I do not have more than one network interface (just eth0 and lo) and I
tried the various options suggested in the FAQ for disabling interfaces. My
machines have only one IP address each. It does not seem to matter whether I
use single hostnames, fully-qualfied hostnames, or IP addresses in the host
list.
        Curiously, even though it reports this error, the processes still seem
to start up on the remote machines, though they do not produce output
properly. The relevant ps line on the non-host machine:

alis 4393 0.0 0.0 37124 2896 ? S 05:10 0:00 sshd: alis_at_notty
alis 4394 0.1 0.0 36396 1964 ? Ss 05:10 0:00 orted --debug
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
alis 4411 99.9 0.1 628872 5520 ? R 05:10 0:14 cosmomc params.ini

        Any suggestions? A copy of the mpi_run output with --debug is
included below.

-----

[alis_at_darwin cosmomc_mpi]$ mpirun --debug -np 2 cosmomc params.ini
[darwin.phsx.ku.edu:25140] procdir: (null)
[darwin.phsx.ku.edu:25140] jobdir: (null)
[darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-alis_at_[hidden]_0/default-universe
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-alis_at_[hidden]_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] connect_uni: contact info read
[darwin.phsx.ku.edu:25140] connect_uni: connection not allowed
[darwin.phsx.ku.edu:25140] [0,0,0] setting up session dir with
[darwin.phsx.ku.edu:25140] tmpdir /tmp
[darwin.phsx.ku.edu:25140] universe default-universe-25140
[darwin.phsx.ku.edu:25140] user alis
[darwin.phsx.ku.edu:25140] host darwin.phsx.ku.edu
[darwin.phsx.ku.edu:25140] jobid 0
[darwin.phsx.ku.edu:25140] procid 0
[darwin.phsx.ku.edu:25140] procdir: /tmp/openmpi-sessions-alis_at_[hidden]_0/default-universe-25140/0/0
[darwin.phsx.ku.edu:25140] jobdir: /tmp/openmpi-sessions-alis_at_[hidden]_0/default-universe-25140/0
[darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-alis_at_[hidden]_0/default-universe-25140
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-alis_at_[hidden]_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] [0,0,0] contact_file /tmp/openmpi-sessions-alis_at_[hidden]_0/default-universe-25140/universe-setup.txt
[darwin.phsx.ku.edu:25140] [0,0,0] wrote setup file
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x1)
[darwin.phsx.ku.edu:25140] pls:rsh: local csh: 0, local bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: assuming same remote shell as local shell
[darwin.phsx.ku.edu:25140] pls:rsh: remote csh: 0, remote bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: final template argv:
[darwin.phsx.ku.edu:25140] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe alis_at_[hidden]:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.242
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.242 is a LOCAL node
[darwin.phsx.ku.edu:25140] pls:rsh: changing to directory /home/alis
[darwin.phsx.ku.edu:25140] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename 129.237.98.242 --universe alis_at_[hidden]:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25141] [0,0,1] setting up session dir with
[darwin.phsx.ku.edu:25141] universe default-universe-25140
[darwin.phsx.ku.edu:25141] user alis
[darwin.phsx.ku.edu:25141] host 129.237.98.242
[darwin.phsx.ku.edu:25141] jobid 0
[darwin.phsx.ku.edu:25141] procid 1
[darwin.phsx.ku.edu:25141] procdir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140/0/1
[darwin.phsx.ku.edu:25141] jobdir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140/0
[darwin.phsx.ku.edu:25141] unidir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25141] top: openmpi-sessions-alis_at_129.237.98.242_0
[darwin.phsx.ku.edu:25141] tmp: /tmp
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.243
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.243 is a REMOTE node
[darwin.phsx.ku.edu:25140] pls:rsh: executing: /usr/bin/ssh 129.237.98.243 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename 129.237.98.243 --universe alis_at_[hidden]:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[fisher.phsx.ku.edu:04445] [0,0,2] setting up session dir with
[fisher.phsx.ku.edu:04445] universe default-universe-25140
[fisher.phsx.ku.edu:04445] user alis
[fisher.phsx.ku.edu:04445] host 129.237.98.243
[fisher.phsx.ku.edu:04445] jobid 0
[fisher.phsx.ku.edu:04445] procid 2
[fisher.phsx.ku.edu:04445] procdir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140/0/2
[fisher.phsx.ku.edu:04445] jobdir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140/0
[fisher.phsx.ku.edu:04445] unidir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04445] top: openmpi-sessions-alis_at_129.237.98.243_0
[fisher.phsx.ku.edu:04445] tmp: /tmp
[darwin.phsx.ku.edu:25143] [0,1,0] setting up session dir with
[darwin.phsx.ku.edu:25143] universe default-universe-25140
[darwin.phsx.ku.edu:25143] user alis
[darwin.phsx.ku.edu:25143] host 129.237.98.242
[darwin.phsx.ku.edu:25143] jobid 1
[darwin.phsx.ku.edu:25143] procid 0
[darwin.phsx.ku.edu:25143] procdir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140/1/0
[darwin.phsx.ku.edu:25143] jobdir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140/1
[darwin.phsx.ku.edu:25143] unidir: /tmp/openmpi-sessions-alis_at_129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25143] top: openmpi-sessions-alis_at_129.237.98.242_0
[darwin.phsx.ku.edu:25143] tmp: /tmp
[fisher.phsx.ku.edu:04462] [0,1,1] setting up session dir with
[fisher.phsx.ku.edu:04462] universe default-universe-25140
[fisher.phsx.ku.edu:04462] user alis
[fisher.phsx.ku.edu:04462] host 129.237.98.243
[fisher.phsx.ku.edu:04462] jobid 1
[fisher.phsx.ku.edu:04462] procid 1
[fisher.phsx.ku.edu:04462] procdir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140/1/1
[fisher.phsx.ku.edu:04462] jobdir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140/1
[fisher.phsx.ku.edu:04462] unidir: /tmp/openmpi-sessions-alis_at_129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04462] top: openmpi-sessions-alis_at_129.237.98.243_0
[fisher.phsx.ku.edu:04462] tmp: /tmp
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x3)
[darwin.phsx.ku.edu:25140] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, 129.237.98.243, cosmomc, 4462)
    (i, host, exe, pid) = (1, 129.237.98.242, cosmomc, 25143)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5453392)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x4)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5389856)
[darwin.phsx.ku.edu:25143] [0,1,0] ompi_mpi_init completed
[fisher.phsx.ku.edu:04462] [0,1,1] ompi_mpi_init completed
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5449344)
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5379136)
 Number of MPI processes: 2
[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113

---
At this point I have to kill the proc with Ctrl-C.
---
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: univ session dir not empty - leaving
Killed by signal 2.
[darwin.phsx.ku.edu:25140] sess_dir_finalize: proc session dir not empty - leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0xa)
[darwin.phsx.ku.edu:25140] ERROR: A daemon on node 129.237.98.243 failed to start as expected.
[darwin.phsx.ku.edu:25140] ERROR: There may be more information available from
[darwin.phsx.ku.edu:25140] ERROR: the remote shell (see above).
[darwin.phsx.ku.edu:25140] ERROR: The daemon exited unexpectedly with status 255.
mpirun: killing job...
[darwin.phsx.ku.edu:25140] [0,0,0]-[0,0,2] mca_oob_tcp_msg_send_handler: writev failed with errno=104
[darwin.phsx.ku.edu:25140] [0,0,0] ORTE_ERROR_LOG: Connection failed in file pls_base_proxy.c at line 140
forrtl: error (69): process interrupted (SIGINT)
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID:  25143
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID:  25143
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID:  25143
This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[darwin.phsx.ku.edu:25141] sess_dir_finalize: proc session dir not empty - leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED)
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found proc session dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found univ session dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: top session dir not empty - leaving