Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] delay in launch?
From: Jeff Dusenberry (jdusenberry_at_[hidden])
Date: 2009-01-16 16:20:20


Reuti wrote:
> Am 15.01.2009 um 16:20 schrieb Jeff Dusenberry:
>
>> I'm trying to launch multiple xterms under OpenMPI 1.2.8 and the SGE
>> job scheduler for purposes of running a serial debugger. I'm
>> experiencing file-locking problems on the .Xauthority file.
>>
>> I tried to fix this by asking for a delay between successive launches,
>> to reduce the chances of contention for the lock by:
>>
>> ~$ qrsh -pe mpi 4 -P CIS /share/apps/openmpi/bin/mpiexec --mca
>> pls_rsh_debug 1 --mca pls_rsh_delay 5 xterm
>>
>> The 'pls_rsh_delay 5' parameter seems to have no effect. I tried
>> replacing 'pls_rsh_debug 1' with 'orte_debug 1', which gave me
>> additional debugging output, but didn't fix the file locking problem.
>>
>> Sometimes the above commands will work and I will get all 4 xterms,
>> but more often I will get an error:
>>
>> /usr/bin/X11/xauth: error in locking authority file
>> /export/home/duse/.Xauthority
>>
>> followed by
>>
>> X11 connection rejected because of wrong authentication.
>> xterm Xt error: Can't open display: localhost:11.0
>>
>> and one or more of the xterms will fail to open.
>>
>> Am I missing something? Is there another debug flag I need to set?
>> Any suggestions for a better way to do this would be appreciated.
>
> You are right that it's neither Open MPI's, nor SGE's fault, but a race
> condition in the SSH startup. You defined SSH with X11 forwarding in SGE
> (qconf -mconf) - right? Then you have first a ssh connection from your
> workstation to the login-machine. Then from the login-machine to the
> node where the mpiexec runs. And then one for each slave node (means an
> additonal one on the machine where mpiexec is already executed).

Yes, that's all correct. Clearly not very efficient, but I haven't had
any luck getting xauth or xhost to work more directly.

> Although it might be possible to give every started sshd an unique
> .Xauthority file, it's not straight forward to implement due to SGE's
> startup of the daemons and you would need a sophisticated ~/.ssh/rc to
> create the files at different location and use it in the forthcoming xterm.

Thanks, that helped a lot, but I still can't quite get it to work. I do
want the xterms to run mpi jobs. I tried this sshrc script (modified
from the sshd man page):

XAUTHORITY=/local/$USER/.Xauthority${SSH_TTY##*/}
export XAUTHORITY
if read proto cookie && [ -n "$DISPLAY" ]; then
         if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then
                 # X11UseLocalhost=yes
                 echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie
         else
                 # X11UseLocalhost=no
                 echo add $DISPLAY $proto $cookie
         fi | xauth -q -
fi

and I am successful in creating a unique .Xauthority for each process
locally on each node when I log in via ssh directly. Unfortunately, I
do have to provide another definition of XAUTHORITY somewhere in my
startup scripts - the one above does not get seen outside of the sshrc
execution.

When I try to run this under qrsh/mpiexec, it acts as if it doesn't have
the SSH_TTY environment variable (is that due to SGE?), and we're back
to a race condition. Is there another variable I can use in the sge/mpi
context? I also don't understand where I would define the XAUTHORITY
variable when running under mpiexec.

I'm not sure this is the best way to approach this - I was originally
hoping that the mpiexec call would have a way to introduce a delay
between successive launches but that doesn't seem to be working either.

Jeff