Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jonathan Underwood (jonathan.underwood_at_[hidden])
Date: 2007-06-11 20:13:04


On 11/06/07, Adrian Knoth <adi_at_[hidden]> wrote:
>
> What's the exact problem? compute-node -> frontend? I don't think you
> have two processes on the frontend node, and even if you do, they should
> use shared memory.

I stopped there being more than a single process on the frontend node
- this had no effect on the problem. The problem is that the processes
seem unable to communicate data to each other, although I can ssh
between machines with no problem ( I have set up passphraseless keys).

> Use tcpdump and/or recompile with debug enabled. In addition, set
> WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120)
> and recompile, thus giving you more debug output.
>
> Depending on your OMPI version, you can also add
>
> mpi_preconnect_all=1
>
> to your ~/.openmpi/mca-params.conf, by this establishing all connections
> during MPI_Init().

I can't use tcpdump as i don't have root access, but I have made the
change to btl_tcp_endpoint.c that you mention, rebuilt (make
distclean... ./configure --enable-debug) OpenMPI, rebuilt the
application against the new version of openMPI and re-ran the program.
This is the output I see (with -np 3, and only 1 slot on the
frontend):

[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08475] universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08475] user jgu
[steinbeck.phys.ucl.ac.uk:08475] host steinbeck.phys.ucl.ac.uk
[steinbeck.phys.ucl.ac.uk:08475] jobid 0
[steinbeck.phys.ucl.ac.uk:08475] procid 0
[steinbeck.phys.ucl.ac.uk:08475] procdir:
/tmp/openmpi-sessions-jgu_at_[hidden]_0/default-universe-8475/0/0
[steinbeck.phys.ucl.ac.uk:08475] jobdir:
/tmp/openmpi-sessions-jgu_at_[hidden]_0/default-universe-8475/0
[steinbeck.phys.ucl.ac.uk:08475] unidir:
/tmp/openmpi-sessions-jgu_at_[hidden]_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08475] top:
openmpi-sessions-jgu_at_[hidden]_0
[steinbeck.phys.ucl.ac.uk:08475] tmp: /tmp
[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] contact_file
/tmp/openmpi-sessions-jgu_at_[hidden]_0/default-universe-8475/universe-s
etup.txt
[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] wrote setup file
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: local csh: 0, local sh: 1
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: assuming same remote shell
as local shell
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: remote csh: 0, remote sh: 1
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: final template argv:
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: /usr/bin/ssh <template>
orted --debug --debug-daemons --bootproxy 1 --name <template> --num_p
rocs 3 --vpid_start 0 --nodename <template> --universe
jgu_at_[hidden]:default-universe-8475 --nsreplica
"0.0.0;tcp://128.40.5
.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256"
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node frontend
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: frontend is a LOCAL node
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: changing to directory /homes/jgu
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing:
(/cluster/data/jgu/bin/orted) orted --debug --debug-daemons
--bootproxy 1 --name 0.0.1
 --num_procs 3 --vpid_start 0 --nodename frontend --universe
jgu_at_[hidden]:default-universe-8475 --nsreplica
"0.0.0;tcp://12
8.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256" --set-sid
[BIBINPUTS=.::/amp/
tex// NNTPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473
HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen
SHELL=/bin/
bash HISTSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22
QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu
LD_LIBRARY_PATH=:/clust
er/data/jgu/lib:/cluster/data/jgu/lib
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=0
1;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:
*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*
.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT 1
00/ANSI X3.64 virtual terminal:\
        :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
        :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
        :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
        :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
        :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
        :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
        :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
        :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
        :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\
        :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\
        :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\
        :vb=\Eg:G0:as=\E(0:ae=\E(B:\
        :ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\
        :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\
        :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\
        :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\
        :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\
        :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\
        :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\
        :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\
        :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\
        :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km:
KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins
MAIL=/var/spool/mail/jgu PATH
=/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin
STY=1936.pts-
0.steinbeck INPUTRC=/etc/inputrc
PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8
LM_LICENSE_FILE=/homes/jgu/licenses:2600_at_hadry
a.phys.ucl.ac.uk:27000_at_[hidden]:1700_at_[hidden]
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS=
.::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0
SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22
LESSOPEN=|/usr/bin/lessp
ipe.sh %s PROMPT_COMMAND=echo -ne
"\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip
G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had
rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun
OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile
OMPI_MCA_orte_debug=1 OMPI
_MCA_orte_debug_daemons=1 OMPI_MCA_seed=0]
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node node0
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: node0 is a REMOTE node
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing: (//usr/bin/ssh)
/usr/bin/ssh node0 orted --debug --debug-daemons --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename node0 --universe
jgu_at_[hidden]:default-universe-8475 --nsreplica
"0.0.0;tcp://
128.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256"
[BIBINPUTS=.::/amp/tex// NN
TPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473
HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen
SHELL=/bin/bash HIS
TSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22
QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu
LD_LIBRARY_PATH=:/cluster/data/
jgu/lib:/cluster/data/jgu/lib
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;
41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;
35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT
100/ANSI
X3.64 virtual terminal:\
        :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
        :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
        :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
        :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
        :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
        :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
        :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
        :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
        :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\
        :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\
        :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\
        :vb=\Eg:G0:as=\E(0:ae=\E(B:\
        :ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\
        :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\
        :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\
        :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\
        :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\
        :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\
        :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\
        :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\
        :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\
        :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km:
KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins
MAIL=/var/spool/mail/jgu PATH
=/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin
STY=1936.pts-
0.steinbeck INPUTRC=/etc/inputrc
PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8
LM_LICENSE_FILE=/homes/jgu/licenses:2600_at_hadry
a.phys.ucl.ac.uk:27000_at_[hidden]:1700_at_[hidden]
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS=
.::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0
SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22
LESSOPEN=|/usr/bin/lessp
ipe.sh %s PROMPT_COMMAND=echo -ne
"\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip
G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had
rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun
OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile
OMPI_MCA_orte_debug=1 OMPI
_MCA_orte_debug_daemons=1 OMPI_MCA_seed=0]
[steinbeck.phys.ucl.ac.uk:08476] [0,0,1] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08476] universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08476] user jgu
[steinbeck.phys.ucl.ac.uk:08476] host frontend
[steinbeck.phys.ucl.ac.uk:08476] jobid 0
[steinbeck.phys.ucl.ac.uk:08476] procid 1
[steinbeck.phys.ucl.ac.uk:08476] procdir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475/0/1
[steinbeck.phys.ucl.ac.uk:08476] jobdir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475/0
[steinbeck.phys.ucl.ac.uk:08476] unidir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08476] top: openmpi-sessions-jgu_at_frontend_0
[steinbeck.phys.ucl.ac.uk:08476] tmp: /tmp
Daemon [0,0,1] checking in as pid 8476 on host frontend
[node0.cluster:08628] [0,0,2] setting up session dir with
[node0.cluster:08628] universe default-universe-8475
[node0.cluster:08628] user jgu
[node0.cluster:08628] host node0
[node0.cluster:08628] jobid 0
[node0.cluster:08628] procid 2
[node0.cluster:08628] procdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/0/2
[node0.cluster:08628] jobdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/0
[node0.cluster:08628] unidir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475
[node0.cluster:08628] top: openmpi-sessions-jgu_at_node0_0
[node0.cluster:08628] tmp: /tmp
Daemon [0,0,2] checking in as pid 8628 on host node0
[steinbeck.phys.ucl.ac.uk:08476] [0,0,1] orted: received launch callback
[node0.cluster:08628] [0,0,2] orted: received launch callback
[steinbeck.phys.ucl.ac.uk:08478] [0,1,0] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08478] universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478] user jgu
[steinbeck.phys.ucl.ac.uk:08478] host frontend
[steinbeck.phys.ucl.ac.uk:08478] jobid 1
[steinbeck.phys.ucl.ac.uk:08478] procid 0
[steinbeck.phys.ucl.ac.uk:08478] procdir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475/1/0
[steinbeck.phys.ucl.ac.uk:08478] jobdir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475/1
[steinbeck.phys.ucl.ac.uk:08478] unidir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu_at_frontend_0
[steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp
[node0.cluster:08650] [0,1,1] setting up session dir with
[node0.cluster:08650] universe default-universe-8475
[node0.cluster:08650] user jgu
[node0.cluster:08650] host node0
[node0.cluster:08650] jobid 1
[steinbeck.phys.ucl.ac.uk:08478] unidir:
/tmp/openmpi-sessions-jgu_at_frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu_at_frontend_0
[steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp
[node0.cluster:08650] [0,1,1] setting up session dir with
[node0.cluster:08650] universe default-universe-8475
[node0.cluster:08650] user jgu
[node0.cluster:08650] host node0
[node0.cluster:08650] jobid 1
[node0.cluster:08650] procid 1
[node0.cluster:08650] procdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/1/1
[node0.cluster:08650] jobdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/1
[node0.cluster:08650] unidir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475
[node0.cluster:08650] top: openmpi-sessions-jgu_at_node0_0
[node0.cluster:08650] tmp: /tmp
[node0.cluster:08651] [0,1,2] setting up session dir with
[node0.cluster:08651] universe default-universe-8475
[node0.cluster:08651] user jgu
[node0.cluster:08651] host node0
[node0.cluster:08651] jobid 1
[node0.cluster:08651] procid 2
[node0.cluster:08651] procdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/1/2
[node0.cluster:08651] jobdir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475/1
[node0.cluster:08651] unidir:
/tmp/openmpi-sessions-jgu_at_node0_0/default-universe-8475
[node0.cluster:08651] top: openmpi-sessions-jgu_at_node0_0
[node0.cluster:08651] tmp: /tmp
[steinbeck.phys.ucl.ac.uk:08475] spawn: in job_state_callback(jobid =
1, state = 0x4)
[steinbeck.phys.ucl.ac.uk:08475] Info: Setting up debugger process
table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 3
  MPIR_proctable:
    (i, host, exe, pid) = (0, frontend,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8478)
    (i, host, exe, pid) = (1, node0,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8650)
    (i, host, exe, pid) = (2, node0,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8651)
[steinbeck.phys.ucl.ac.uk:08478] [0,1,0] ompi_mpi_init completed
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110