Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Handling output of processes
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-03 17:03:47


Hi Jody

Well, the problem with both the output filename and the xterm option
was that I wasn't passing them back to the remote daemons under the
ssh launch environment. I should have that corrected now - things will
hopefully work with any tarball of r20407 or above.

Let me know...
Ralph

On Feb 3, 2009, at 11:34 AM, Ralph Castain wrote:

> Ah! I know the problem - forgot you are running under ssh, so the
> environment doesn't get passed.
>
> I'll have to find a way to pass the output filename to the backend
> nodes...should have it later today.
>
>
> On Feb 3, 2009, at 11:09 AM, jody wrote:
>
>> Hi Ralph
>>>>
>>>> --output-filename
>>>> It creates files, but only for the local processes:
>>>> [jody_at_localhost neander]$ mpirun -np 8 -hostfile testhosts
>>>> --output-filename gnana ./MPITest
>>>> ... output ...
>>>> [jody_at_localhost neander]$ ls -l gna*
>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.0
>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.1
>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.2
>>>> ( i set slots=3 on my workstation)
>>>>
>>>
>>> Did you give a location that is on an NFS mount?
>>
>> Yes, i started mpirun on a drive which all the remote nodes mount
>> as NFS drives.
>>>
>>> I'm willing to bet the files are being created - they are on your
>>> remote
>>> nodes. The daemons create their own local files for output from
>>> their local
>>> procs. We decided to do this for scalability reasons - if we have
>>> mpirun
>>> open all the output files, then you could easily hit the file
>>> descriptor
>>> limit on that node and cause the job not to launch.
>>>
>>> Check your remote nodes and see if the files are there.
>>
>> Where would i have to look? They are not in my home directories on
>> the nodes.
>>
>>>
>>> I can fix that easily enough - we'll just test to see if the xterm
>>> option
>>> has been set, and add the -X to ssh if so.
>>>
>>> Note that you can probably set this yourself right now by -mca
>>> plm_rsh_agent
>>> "ssh -X"
>>
>> I tried this, but it didn't work, though we may be getting there:
>>
>> [jody_at_localhost neander]$ mpirun -np 8 -mca plm_rsh_agent "ssh -X"
>> -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest
>> Warning: No xauth data; using fake authentication data for X11
>> forwarding.
>> Warning: No xauth data; using fake authentication data for X11
>> forwarding.
>> Warning: No xauth data; using fake authentication data for X11
>> forwarding.
>> ...
>> => The 3 remote processes (3,4,5) tried to get access.
>>
>> I remember having had an xauth problem like this in an other setup
>> before,
>> but i've forgotten how to solve it. I'll try to find out, and get
>> back to
>> you when i figured it out.
>>
>> BTW: calling an X-application over SSH works, e.g.
>> ssh -X node_00 xclock
>>
>>
>> Jody
>>>
>>>>
>>>>
>>>> So what i currently do to have my xterms running:
>>>> on my workstation i call
>>>> xhost + <hostname> for all
>>>> machines in my hostfile, to allow them to use X on my workstation.
>>>> Then i set my DISPLAY variable to point to my workstation
>>>> export DISPLAY=<mymachine>:0.0
>>>> Finally, i call mpirun with the -x option (to exports the DISPLAY
>>>> variable to all nodes) :
>>>> mpirun -np 4 -hostfile myfiles -x DISPLAY run_xterm.sh
>>>> MyApplication arg1
>>>> arg2
>>>>
>>>> Here run_xterm.sh is a shell script which creates a useful title
>>>> for
>>>> the xterm window
>>>> and calls the application with all its arguments (-hold leaves the
>>>> xterm open after the program terminates):
>>>> #!/bin/sh -f
>>>>
>>>> # feedback for command line
>>>> echo "Running on node `hostname`"
>>>>
>>>> # for version 1.2 use undocumented env variable
>>>> # for version 1.3 use documented env variable
>>>> export ID=$OMPI_COMM_WORLD_RANK
>>>> if [ X$ID = X ]; then
>>>> export ID=$OMPI_MCA_ns_nds_vpid
>>>> fi
>>>>
>>>> export TITLE="node #$ID"
>>>> # start terminal
>>>> xterm -T "$TITLE" -hold -e $*
>>>>
>>>> exit 0
>>>>
>>>> (i have similar scripts to run gdb or valgrind in xterm windows)
>>>> I know that the 'xhost +' is a horror for certain sysadmins,
>>>> but i feel quite safe, because the machines listed in my hostfile
>>>> are not accessible from outside our department.
>>>>
>>>> I haven't found any other alternative to have nice xterms when i
>>>> can't
>>>> use 'ssh -X'.
>>>>
>>>> To come back to the '--xterm' option: i just ran my xterm-script
>>>> after
>>>> doing the above xhost+ and DISPLAY things, and it worked - all
>>>> local and
>>>> remote
>>>> processes created their xterm windows. (In other words, the
>>>> environment
>>>> was
>>>> set to have my remote nodes use xterms on my workstation.)
>>>>
>>>> Immediately thereafter i called the same application with
>>>> mpirun -np 8 -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./
>>>> MPITest
>>>> but still, only the local process (#2) created an xterm.
>>>>
>>>>
>>>> Do you think it would be possible to have open MPI make its
>>>> ssh-connections with '-X',
>>>> or are there technical or security-related objections?
>>>>
>>>> Regards
>>>>
>>>> Jody
>>>>
>>>> On Mon, Feb 2, 2009 at 4:47 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>
>>>>> On Feb 2, 2009, at 2:55 AM, jody wrote:
>>>>>
>>>>>> Hi Ralph
>>>>>> The new options are great stuff!
>>>>>> Following your suggestion, i downloaded and installed
>>>>>>
>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>>
>>>>>> and tested the new options. (i have a simple cluster of
>>>>>> 8 machines over tcp). Not everything worked as specified, though:
>>>>>> * timestamp-output : works
>>>>>
>>>>> good!
>>>>>
>>>>>>
>>>>>> * xterm : doesn't work completely -
>>>>>> comma-separated rank list:
>>>>>> Only for the local processes a xterm is opened. The other
>>>>>> processes
>>>>>> (the ones on remote machines) only output to the stdout of the
>>>>>> calling window.
>>>>>> (Just to be sure i started my own script for opening separate
>>>>>> xterms
>>>>>> - that did work for the remoties, too)
>>>>>
>>>>> This is a problem we wrestled with for some time. The issue is
>>>>> that we
>>>>> really aren't comfortable modifying the DISPLAY envar on the
>>>>> remote nodes
>>>>> like you do in your script. It is fine for a user to do whatever
>>>>> they
>>>>> want,
>>>>> but for OMPI to do it...that's another matter. We can't even
>>>>> know for
>>>>> sure
>>>>> what to do because of the wide range of scenarios that might
>>>>> occur (e.g.,
>>>>> is
>>>>> mpirun local to you, or on a remote node connected to you via
>>>>> xterm,
>>>>> or...?).
>>>>>
>>>>> What you (the user) need to do is ensure that X11 is setup
>>>>> properly so
>>>>> that
>>>>> an Xwindow opened on the remote host is displayed on your
>>>>> screen. In this
>>>>> case, I believe you have to enable xforwarding - I'm not an
>>>>> xterm expert,
>>>>> so
>>>>> I can't advise you on how to do this. Suspect you may already
>>>>> know - in
>>>>> which case, can you please pass it along and I'll add it to our
>>>>> docs? :-)
>>>>>
>>>>>>
>>>>>>
>>>>>> If a '-1' is given instead of a list of ranks, it fails
>>>>>> (locally &
>>>>>> with remotes):
>>>>>> [jody_at_localhost neander]$ mpirun -np 4 --xterm -1 ./MPITest
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Sorry! You were supposed to get help about:
>>>>>> orte-odls-base:xterm-rank-out-of-bounds
>>>>>> from the file:
>>>>>> help-odls-base.txt
>>>>>> But I couldn't find any file matching that name. Sorry!
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun was unable to start the specified application as it
>>>>>> encountered an error
>>>>>> on node localhost. More information may be available above.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Fixed as of r20398 - this was a bug, had an if statement out of
>>>>> sequence.
>>>>>
>>>>>
>>>>>>
>>>>>> * output-filename : doesn't work here:
>>>>>> [jody_at_localhost neander]$ mpirun -np 4 --output-filename gnagna
>>>>>> ./MPITest
>>>>>> [jody_at_localhost neander]$ ls -l gna*
>>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-02 09:07 gnagna.%10lu
>>>>>>
>>>>>> There is output from the processes on remote machines on
>>>>>> stdout, but
>>>>>> none
>>>>>> from the local ones.
>>>>>
>>>>> Fixed as of r20400 - had a format statement syntax that was okay
>>>>> in some
>>>>> compilers, but not others.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> A question about installing: i installed the usual way
>>>>>> (configure,
>>>>>> make all install),
>>>>>> but the new man-files apparently weren't copied to their
>>>>>> destination:
>>>>>> If i do 'man mpirun' i get shown the contents of an old man-file
>>>>>> (without the new options).
>>>>>> I had to do ' less /opt//openmpi-1.4a1r20394/share/man/man1/
>>>>>> mpirun.1'
>>>>>> to see them.
>>>>>
>>>>> Strange - the install should put them in the right place, but I
>>>>> wonder if
>>>>> you updated your manpath to point at it?
>>>>>
>>>>>>
>>>>>>
>>>>>> About the xterm-option : when the application ends all xterms are
>>>>>> closed immediately.
>>>>>> (when doing things 'by hand' i used the -hold option for xterm)
>>>>>> Would it be possible to add this feature for your xterm option?
>>>>>> Perhaps by adding a '!' at the end of the rank list?
>>>>>
>>>>> Done! A "!" at the end of the list will activate -hold as of
>>>>> r20398.
>>>>>
>>>>>>
>>>>>>
>>>>>> About orte_iof: with the new version it works, but no matter
>>>>>> which
>>>>>> rank i specify,
>>>>>> it only prints out rank0's output:
>>>>>> [jody_at_localhost ~]$ orte-iof --pid 31049 --rank 4 --stdout
>>>>>> [localhost]I am #0/9 before the barrier
>>>>>>
>>>>>
>>>>> The problem here is that the option name changed from "rank" to
>>>>> "ranks"
>>>>> since you can now specify any number of ranks as comma-separated
>>>>> ranges.
>>>>> I
>>>>> have updated orte-iof so it will gracefully fail if you provide an
>>>>> unrecognized cmd line option and output the "help" detailing the
>>>>> accepted
>>>>> options.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Jody
>>>>>>
>>>>>> On Sun, Feb 1, 2009 at 10:49 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>> wrote:
>>>>>>>
>>>>>>> I'm afraid we discovered a bug in optimized builds with
>>>>>>> r20392. Please
>>>>>>> use
>>>>>>> any tarball with r20394 or above.
>>>>>>>
>>>>>>> Sorry for the confusion
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> On Feb 1, 2009, at 5:27 AM, Jeff Squyres wrote:
>>>>>>>
>>>>>>>> On Jan 31, 2009, at 11:39 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> For anyone following this thread:
>>>>>>>>>
>>>>>>>>> I have completed the IOF options discussed below.
>>>>>>>>> Specifically, I
>>>>>>>>> have
>>>>>>>>> added the following:
>>>>>>>>>
>>>>>>>>> * a new "timestamp-output" option that timestamp's each line
>>>>>>>>> of
>>>>>>>>> output
>>>>>>>>>
>>>>>>>>> * a new "output-filename" option that redirects each proc's
>>>>>>>>> output to
>>>>>>>>> a
>>>>>>>>> separate rank-named file.
>>>>>>>>>
>>>>>>>>> * a new "xterm" option that redirects the output of the
>>>>>>>>> specified
>>>>>>>>> ranks
>>>>>>>>> to a separate xterm window.
>>>>>>>>>
>>>>>>>>> You can obtain a copy of the updated code at:
>>>>>>>>>
>>>>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>>>>
>>>>>>>> Sweet stuff. :-)
>>>>>>>>
>>>>>>>> Note that the URL/tarball that Ralph cites is a nightly
>>>>>>>> snapshot and
>>>>>>>> will
>>>>>>>> expire after a while -- we only keep the most 5 recent nightly
>>>>>>>> tarballs
>>>>>>>> available. You can find Ralph's new IOF stuff in any 1.4a1
>>>>>>>> nightly
>>>>>>>> tarball
>>>>>>>> after the one he cited above. Note that the last part of the
>>>>>>>> tarball
>>>>>>>> name
>>>>>>>> refers to the subversion commit number (which increases
>>>>>>>> monotonically);
>>>>>>>> any
>>>>>>>> 1.4 nightly snapshot tarball beyond "r20392" will contain
>>>>>>>> this new IOF
>>>>>>>> stuff. Here's where to get our nightly snapshot tarballs:
>>>>>>>>
>>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>>
>>>>>>>> Don't read anything into the "1.4" version number -- we've
>>>>>>>> just bumped
>>>>>>>> the
>>>>>>>> version number internally to be different than the current
>>>>>>>> stable
>>>>>>>> series
>>>>>>>> (1.3). We haven't yet branched for the v1.4 series; hence,
>>>>>>>> "1.4a1"
>>>>>>>> currently refers to our development trunk.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> Cisco Systems
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>