Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Handling output of processes
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-03 13:34:17


Ah! I know the problem - forgot you are running under ssh, so the
environment doesn't get passed.

I'll have to find a way to pass the output filename to the backend
nodes...should have it later today.

On Feb 3, 2009, at 11:09 AM, jody wrote:

> Hi Ralph
>>>
>>> --output-filename
>>> It creates files, but only for the local processes:
>>> [jody_at_localhost neander]$ mpirun -np 8 -hostfile testhosts
>>> --output-filename gnana ./MPITest
>>> ... output ...
>>> [jody_at_localhost neander]$ ls -l gna*
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.0
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.1
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.2
>>> ( i set slots=3 on my workstation)
>>>
>>
>> Did you give a location that is on an NFS mount?
>
> Yes, i started mpirun on a drive which all the remote nodes mount as
> NFS drives.
>>
>> I'm willing to bet the files are being created - they are on your
>> remote
>> nodes. The daemons create their own local files for output from
>> their local
>> procs. We decided to do this for scalability reasons - if we have
>> mpirun
>> open all the output files, then you could easily hit the file
>> descriptor
>> limit on that node and cause the job not to launch.
>>
>> Check your remote nodes and see if the files are there.
>
> Where would i have to look? They are not in my home directories on
> the nodes.
>
>>
>> I can fix that easily enough - we'll just test to see if the xterm
>> option
>> has been set, and add the -X to ssh if so.
>>
>> Note that you can probably set this yourself right now by -mca
>> plm_rsh_agent
>> "ssh -X"
>
> I tried this, but it didn't work, though we may be getting there:
>
> [jody_at_localhost neander]$ mpirun -np 8 -mca plm_rsh_agent "ssh -X"
> -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> ...
> => The 3 remote processes (3,4,5) tried to get access.
>
> I remember having had an xauth problem like this in an other setup
> before,
> but i've forgotten how to solve it. I'll try to find out, and get
> back to
> you when i figured it out.
>
> BTW: calling an X-application over SSH works, e.g.
> ssh -X node_00 xclock
>
>
> Jody
>>
>>>
>>>
>>> So what i currently do to have my xterms running:
>>> on my workstation i call
>>> xhost + <hostname> for all
>>> machines in my hostfile, to allow them to use X on my workstation.
>>> Then i set my DISPLAY variable to point to my workstation
>>> export DISPLAY=<mymachine>:0.0
>>> Finally, i call mpirun with the -x option (to exports the DISPLAY
>>> variable to all nodes) :
>>> mpirun -np 4 -hostfile myfiles -x DISPLAY run_xterm.sh
>>> MyApplication arg1
>>> arg2
>>>
>>> Here run_xterm.sh is a shell script which creates a useful title for
>>> the xterm window
>>> and calls the application with all its arguments (-hold leaves the
>>> xterm open after the program terminates):
>>> #!/bin/sh -f
>>>
>>> # feedback for command line
>>> echo "Running on node `hostname`"
>>>
>>> # for version 1.2 use undocumented env variable
>>> # for version 1.3 use documented env variable
>>> export ID=$OMPI_COMM_WORLD_RANK
>>> if [ X$ID = X ]; then
>>> export ID=$OMPI_MCA_ns_nds_vpid
>>> fi
>>>
>>> export TITLE="node #$ID"
>>> # start terminal
>>> xterm -T "$TITLE" -hold -e $*
>>>
>>> exit 0
>>>
>>> (i have similar scripts to run gdb or valgrind in xterm windows)
>>> I know that the 'xhost +' is a horror for certain sysadmins,
>>> but i feel quite safe, because the machines listed in my hostfile
>>> are not accessible from outside our department.
>>>
>>> I haven't found any other alternative to have nice xterms when i
>>> can't
>>> use 'ssh -X'.
>>>
>>> To come back to the '--xterm' option: i just ran my xterm-script
>>> after
>>> doing the above xhost+ and DISPLAY things, and it worked - all
>>> local and
>>> remote
>>> processes created their xterm windows. (In other words, the
>>> environment
>>> was
>>> set to have my remote nodes use xterms on my workstation.)
>>>
>>> Immediately thereafter i called the same application with
>>> mpirun -np 8 -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./
>>> MPITest
>>> but still, only the local process (#2) created an xterm.
>>>
>>>
>>> Do you think it would be possible to have open MPI make its
>>> ssh-connections with '-X',
>>> or are there technical or security-related objections?
>>>
>>> Regards
>>>
>>> Jody
>>>
>>> On Mon, Feb 2, 2009 at 4:47 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>> On Feb 2, 2009, at 2:55 AM, jody wrote:
>>>>
>>>>> Hi Ralph
>>>>> The new options are great stuff!
>>>>> Following your suggestion, i downloaded and installed
>>>>>
>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>
>>>>> and tested the new options. (i have a simple cluster of
>>>>> 8 machines over tcp). Not everything worked as specified, though:
>>>>> * timestamp-output : works
>>>>
>>>> good!
>>>>
>>>>>
>>>>> * xterm : doesn't work completely -
>>>>> comma-separated rank list:
>>>>> Only for the local processes a xterm is opened. The other
>>>>> processes
>>>>> (the ones on remote machines) only output to the stdout of the
>>>>> calling window.
>>>>> (Just to be sure i started my own script for opening separate
>>>>> xterms
>>>>> - that did work for the remoties, too)
>>>>
>>>> This is a problem we wrestled with for some time. The issue is
>>>> that we
>>>> really aren't comfortable modifying the DISPLAY envar on the
>>>> remote nodes
>>>> like you do in your script. It is fine for a user to do whatever
>>>> they
>>>> want,
>>>> but for OMPI to do it...that's another matter. We can't even know
>>>> for
>>>> sure
>>>> what to do because of the wide range of scenarios that might
>>>> occur (e.g.,
>>>> is
>>>> mpirun local to you, or on a remote node connected to you via
>>>> xterm,
>>>> or...?).
>>>>
>>>> What you (the user) need to do is ensure that X11 is setup
>>>> properly so
>>>> that
>>>> an Xwindow opened on the remote host is displayed on your screen.
>>>> In this
>>>> case, I believe you have to enable xforwarding - I'm not an xterm
>>>> expert,
>>>> so
>>>> I can't advise you on how to do this. Suspect you may already
>>>> know - in
>>>> which case, can you please pass it along and I'll add it to our
>>>> docs? :-)
>>>>
>>>>>
>>>>>
>>>>> If a '-1' is given instead of a list of ranks, it fails (locally &
>>>>> with remotes):
>>>>> [jody_at_localhost neander]$ mpirun -np 4 --xterm -1 ./MPITest
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Sorry! You were supposed to get help about:
>>>>> orte-odls-base:xterm-rank-out-of-bounds
>>>>> from the file:
>>>>> help-odls-base.txt
>>>>> But I couldn't find any file matching that name. Sorry!
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun was unable to start the specified application as it
>>>>> encountered an error
>>>>> on node localhost. More information may be available above.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> Fixed as of r20398 - this was a bug, had an if statement out of
>>>> sequence.
>>>>
>>>>
>>>>>
>>>>> * output-filename : doesn't work here:
>>>>> [jody_at_localhost neander]$ mpirun -np 4 --output-filename gnagna
>>>>> ./MPITest
>>>>> [jody_at_localhost neander]$ ls -l gna*
>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-02 09:07 gnagna.%10lu
>>>>>
>>>>> There is output from the processes on remote machines on stdout,
>>>>> but
>>>>> none
>>>>> from the local ones.
>>>>
>>>> Fixed as of r20400 - had a format statement syntax that was okay
>>>> in some
>>>> compilers, but not others.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> A question about installing: i installed the usual way (configure,
>>>>> make all install),
>>>>> but the new man-files apparently weren't copied to their
>>>>> destination:
>>>>> If i do 'man mpirun' i get shown the contents of an old man-file
>>>>> (without the new options).
>>>>> I had to do ' less /opt//openmpi-1.4a1r20394/share/man/man1/
>>>>> mpirun.1'
>>>>> to see them.
>>>>
>>>> Strange - the install should put them in the right place, but I
>>>> wonder if
>>>> you updated your manpath to point at it?
>>>>
>>>>>
>>>>>
>>>>> About the xterm-option : when the application ends all xterms are
>>>>> closed immediately.
>>>>> (when doing things 'by hand' i used the -hold option for xterm)
>>>>> Would it be possible to add this feature for your xterm option?
>>>>> Perhaps by adding a '!' at the end of the rank list?
>>>>
>>>> Done! A "!" at the end of the list will activate -hold as of
>>>> r20398.
>>>>
>>>>>
>>>>>
>>>>> About orte_iof: with the new version it works, but no matter which
>>>>> rank i specify,
>>>>> it only prints out rank0's output:
>>>>> [jody_at_localhost ~]$ orte-iof --pid 31049 --rank 4 --stdout
>>>>> [localhost]I am #0/9 before the barrier
>>>>>
>>>>
>>>> The problem here is that the option name changed from "rank" to
>>>> "ranks"
>>>> since you can now specify any number of ranks as comma-separated
>>>> ranges.
>>>> I
>>>> have updated orte-iof so it will gracefully fail if you provide an
>>>> unrecognized cmd line option and output the "help" detailing the
>>>> accepted
>>>> options.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jody
>>>>>
>>>>> On Sun, Feb 1, 2009 at 10:49 PM, Ralph Castain <rhc_at_[hidden]>
>>>>> wrote:
>>>>>>
>>>>>> I'm afraid we discovered a bug in optimized builds with r20392.
>>>>>> Please
>>>>>> use
>>>>>> any tarball with r20394 or above.
>>>>>>
>>>>>> Sorry for the confusion
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On Feb 1, 2009, at 5:27 AM, Jeff Squyres wrote:
>>>>>>
>>>>>>> On Jan 31, 2009, at 11:39 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> For anyone following this thread:
>>>>>>>>
>>>>>>>> I have completed the IOF options discussed below.
>>>>>>>> Specifically, I
>>>>>>>> have
>>>>>>>> added the following:
>>>>>>>>
>>>>>>>> * a new "timestamp-output" option that timestamp's each line of
>>>>>>>> output
>>>>>>>>
>>>>>>>> * a new "output-filename" option that redirects each proc's
>>>>>>>> output to
>>>>>>>> a
>>>>>>>> separate rank-named file.
>>>>>>>>
>>>>>>>> * a new "xterm" option that redirects the output of the
>>>>>>>> specified
>>>>>>>> ranks
>>>>>>>> to a separate xterm window.
>>>>>>>>
>>>>>>>> You can obtain a copy of the updated code at:
>>>>>>>>
>>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>> openmpi-1.4a1r20392.tar.gz
>>>>>>>
>>>>>>> Sweet stuff. :-)
>>>>>>>
>>>>>>> Note that the URL/tarball that Ralph cites is a nightly
>>>>>>> snapshot and
>>>>>>> will
>>>>>>> expire after a while -- we only keep the most 5 recent nightly
>>>>>>> tarballs
>>>>>>> available. You can find Ralph's new IOF stuff in any 1.4a1
>>>>>>> nightly
>>>>>>> tarball
>>>>>>> after the one he cited above. Note that the last part of the
>>>>>>> tarball
>>>>>>> name
>>>>>>> refers to the subversion commit number (which increases
>>>>>>> monotonically);
>>>>>>> any
>>>>>>> 1.4 nightly snapshot tarball beyond "r20392" will contain this
>>>>>>> new IOF
>>>>>>> stuff. Here's where to get our nightly snapshot tarballs:
>>>>>>>
>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>
>>>>>>> Don't read anything into the "1.4" version number -- we've
>>>>>>> just bumped
>>>>>>> the
>>>>>>> version number internally to be different than the current
>>>>>>> stable
>>>>>>> series
>>>>>>> (1.3). We haven't yet branched for the v1.4 series; hence,
>>>>>>> "1.4a1"
>>>>>>> currently refers to our development trunk.
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users