Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] delay in launch?
From: Reuti (reuti_at_[hidden])
Date: 2009-01-15 17:28:51


Am 15.01.2009 um 16:20 schrieb Jeff Dusenberry:

> I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the
> SGE job scheduler for purposes of running a serial debugger. I'm
> experiencing file-locking problems on the .Xauthority file.
>
> I tried to fix this by asking for a delay between successive
> launches, to reduce the chances of contention for the lock by:
>
> ~$ qrsh -pe mpi 4 -P CIS /share/apps/openmpi/bin/mpiexec --mca
> pls_rsh_debug 1 --mca pls_rsh_delay 5 xterm
>
> The 'pls_rsh_delay 5' parameter seems to have no effect. I tried
> replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me
> additional debugging output, but didn't fix the file locking problem.
>
> Sometimes the above commands will work and I will get all 4 xterms,
> but more often I will get an error:
>
> /usr/bin/X11/xauth: error in locking authority file /export/home/
> duse/.Xauthority
>
> followed by
>
> X11 connection rejected because of wrong authentication.
> xterm Xt error: Can't open display: localhost:11.0
>
> and one or more of the xterms will fail to open.
>
> Am I missing something? Is there another debug flag I need to
> set? Any suggestions for a better way to do this would be
> appreciated.

You are right that it's neither Open MPI's, nor SGE's fault, but a
race condition in the SSH startup. You defined SSH with X11
forwarding in SGE (qconf -mconf) - right? Then you have first a ssh
connection from your workstation to the login-machine. Then from the
login-machine to the node where the mpiexec runs. And then one for
each slave node (means an additonal one on the machine where mpiexec
is already executed).

Although it might be possible to give every started sshd an
unique .Xauthority file, it's not straight forward to implement due
to SGE's startup of the daemons and you would need a sophisticated
~/.ssh/rc to create the files at different location and use it in the
forthcoming xterm.

If you want just to open a bunch of xterms, you could also use such a
script:

$ cat multi.sh
#!/bin/sh
. /usr/sge/default/common/settings.sh
for node in `cat $TMPDIR/machines`; do
     qrsh -inherit $node xterm &
     sleep 1
done
wait

The $TMPDIR/machinefile is usually defined for the MPICH(1)'s
parallel startup, but not for Open MPI, as it doesn't need it.
Nevertheless you could define it for your Open MPI PE or create
another PE with the line:

$ qconf -sp mpi
...
start_proc_args /usr/sge/mpi/startmpi.sh $pe_hostfile

When you run the script with "qrsh -pe mpi 4 ~/multi.sh" you should
get the xterms.

(It might be advisable to define "execd_params
ENABLE_ADDGRP_KILL=1" in your SGE configuration, to have the ability
to kill all the created xterm processes from SGE.)

HTH - Reuti