Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Handling output of processes
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-02-03 13:34:17


Ah! I know the problem - forgot you are running under ssh, so the
environment doesn't get passed.

I'll have to find a way to pass the output filename to the backend
nodes...should have it later today.

On Feb 3, 2009, at 11:09 AM, jody wrote:

> Hi Ralph
>>>
>>> --output-filename
>>> It creates files, but only for the local processes:
>>> [jody_at_localhost neander]$ mpirun -np 8 -hostfile testhosts
>>> --output-filename gnana ./MPITest
>>> ... output ...
>>> [jody_at_localhost neander]$ ls -l gna*
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.0
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.1
>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.2
>>> ( i set slots=3 on my workstation)
>>>
>>
>> Did you give a location that is on an NFS mount?
>
> Yes, i started mpirun on a drive which all the remote nodes mount as
> NFS drives.
>>
>> I'm willing to bet the files are being created - they are on your
>> remote
>> nodes. The daemons create their own local files for output from
>> their local
>> procs. We decided to do this for scalability reasons - if we have
>> mpirun
>> open all the output files, then you could easily hit the file
>> descriptor
>> limit on that node and cause the job not to launch.
>>
>> Check your remote nodes and see if the files are there.
>
> Where would i have to look? They are not in my home directories on
> the nodes.
>
>>
>> I can fix that easily enough - we'll just test to see if the xterm
>> option
>> has been set, and add the -X to ssh if so.
>>
>> Note that you can probably set this yourself right now by -mca
>> plm_rsh_agent
>> "ssh -X"
>
> I tried this, but it didn't work, though we may be getting there:
>
> [jody_at_localhost neander]$ mpirun -np 8 -mca plm_rsh_agent "ssh -X"
> -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> Warning: No xauth data; using fake authentication data for X11
> forwarding.
> ...
> => The 3 remote processes (3,4,5) tried to get access.
>
> I remember having had an xauth problem like this in an other setup
> before,
> but i've forgotten how to solve it. I'll try to find out, and get
> back to
> you when i figured it out.
>
> BTW: calling an X-application over SSH works, e.g.
> ssh -X node_00 xclock
>
>
> Jody
>>
>>>
>>>
>>> So what i currently do to have my xterms running:
>>> on my workstation i call
>>> xhost + <hostname> for all
>>> machines in my hostfile, to allow them to use X on my workstation.
>>> Then i set my DISPLAY variable to point to my workstation
>>> export DISPLAY=<mymachine>:0.0
>>> Finally, i call mpirun with the -x option (to exports the DISPLAY
>>> variable to all nodes) :
>>> mpirun -np 4 -hostfile myfiles -x DISPLAY run_xterm.sh
>>> MyApplication arg1
>>> arg2
>>>
>>> Here run_xterm.sh is a shell script which creates a useful title for
>>> the xterm window
>>> and calls the application with all its arguments (-hold leaves the
>>> xterm open after the program terminates):
>>> #!/bin/sh -f
>>>
>>> # feedback for command line
>>> echo "Running on node `hostname`"
>>>
>>> # for version 1.2 use undocumented env variable
>>> # for version 1.3 use documented env variable
>>> export ID=$OMPI_COMM_WORLD_RANK
>>> if [ X$ID = X ]; then
>>> export ID=$OMPI_MCA_ns_nds_vpid
>>> fi
>>>
>>> export TITLE="node #$ID"
>>> # start terminal
>>> xterm -T "$TITLE" -hold -e $*
>>>
>>> exit 0
>>>
>>> (i have similar scripts to run gdb or valgrind in xterm windows)
>>> I know that the 'xhost +' is a horror for certain sysadmins,
>>> but i feel quite safe, because the machines listed in my hostfile
>>> are not accessible from outside our department.
>>>
>>> I haven't found any other alternative to have nice xterms when i
>>> can't
>>> use 'ssh -X'.
>>>
>>> To come back to the '--xterm' option: i just ran my xterm-script
>>> after
>>> doing the above xhost+ and DISPLAY things, and it worked - all
>>> local and
>>> remote
>>> processes created their xterm windows. (In other words, the
>>> environment
>>> was
>>> set to have my remote nodes use xterms on my workstation.)
>>>
>>> Immediately thereafter i called the same application with
>>> mpirun -np 8 -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./
>>> MPITest
>>> but still, only the local process (#2) created an xterm.
>>>
>>>
>>> Do you think it would be possible to have open MPI make its
>>> ssh-connections with '-X',
>>> or are there technical or security-related objections?
>>>
>>> Regards
>>>
>>> Jody
>>>
>>> On Mon, Feb 2, 2009 at 4:47 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>> On Feb 2, 2009, at 2:55 AM, jody wrote:
>>>>
>>>>> Hi Ralph
>>>>> The new options are great stuff!
>>>>> Following your suggestion, i downloaded and installed
>>>>>
>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>
>>>>> and tested the new options. (i have a simple cluster of
>>>>> 8 machines over tcp). Not everything worked as specified, though:
>>>>> * timestamp-output : works
>>>>
>>>> good!
>>>>
>>>>>
>>>>> * xterm : doesn't work completely -
>>>>> comma-separated rank list:
>>>>> Only for the local processes a xterm is opened. The other
>>>>> processes
>>>>> (the ones on remote machines) only output to the stdout of the
>>>>> calling window.
>>>>> (Just to be sure i started my own script for opening separate
>>>>> xterms
>>>>> - that did work for the remoties, too)
>>>>
>>>> This is a problem we wrestled with for some time. The issue is
>>>> that we
>>>> really aren't comfortable modifying the DISPLAY envar on the
>>>> remote nodes
>>>> like you do in your script. It is fine for a user to do whatever
>>>> they
>>>> want,
>>>> but for OMPI to do it...that's another matter. We can't even know
>>>> for
>>>> sure
>>>> what to do because of the wide range of scenarios that might
>>>> occur (e.g.,
>>>> is
>>>> mpirun local to you, or on a remote node connected to you via
>>>> xterm,
>>>> or...?).
>>>>
>>>> What you (the user) need to do is ensure that X11 is setup
>>>> properly so
>>>> that
>>>> an Xwindow opened on the remote host is displayed on your screen.
>>>> In this
>>>> case, I believe you have to enable xforwarding - I'm not an xterm
>>>> expert,
>>>> so
>>>> I can't advise you on how to do this. Suspect you may already
>>>> know - in
>>>> which case, can you please pass it along and I'll add it to our
>>>> docs? :-)
>>>>
>>>>>
>>>>>
>>>>> If a '-1' is given instead of a list of ranks, it fails (locally &
>>>>> with remotes):
>>>>> [jody_at_localhost neander]$ mpirun -np 4 --xterm -1 ./MPITest
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Sorry! You were supposed to get help about:
>>>>> orte-odls-base:xterm-rank-out-of-bounds
>>>>> from the file:
>>>>> help-odls-base.txt
>>>>> But I couldn't find any file matching that name. Sorry!
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpirun was unable to start the specified application as it
>>>>> encountered an error
>>>>> on node localhost. More information may be available above.
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> Fixed as of r20398 - this was a bug, had an if statement out of
>>>> sequence.
>>>>
>>>>
>>>>>
>>>>> * output-filename : doesn't work here:
>>>>> [jody_at_localhost neander]$ mpirun -np 4 --output-filename gnagna
>>>>> ./MPITest
>>>>> [jody_at_localhost neander]$ ls -l gna*
>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-02 09:07 gnagna.%10lu
>>>>>
>>>>> There is output from the processes on remote machines on stdout,
>>>>> but
>>>>> none
>>>>> from the local ones.
>>>>
>>>> Fixed as of r20400 - had a format statement syntax that was okay
>>>> in some
>>>> compilers, but not others.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> A question about installing: i installed the usual way (configure,
>>>>> make all install),
>>>>> but the new man-files apparently weren't copied to their
>>>>> destination:
>>>>> If i do 'man mpirun' i get shown the contents of an old man-file
>>>>> (without the new options).
>>>>> I had to do ' less /opt//openmpi-1.4a1r20394/share/man/man1/
>>>>> mpirun.1'
>>>>> to see them.
>>>>
>>>> Strange - the install should put them in the right place, but I
>>>> wonder if
>>>> you updated your manpath to point at it?
>>>>
>>>>>
>>>>>
>>>>> About the xterm-option : when the application ends all xterms are
>>>>> closed immediately.
>>>>> (when doing things 'by hand' i used the -hold option for xterm)
>>>>> Would it be possible to add this feature for your xterm option?
>>>>> Perhaps by adding a '!' at the end of the rank list?
>>>>
>>>> Done! A "!" at the end of the list will activate -hold as of
>>>> r20398.
>>>>
>>>>>
>>>>>
>>>>> About orte_iof: with the new version it works, but no matter which
>>>>> rank i specify,
>>>>> it only prints out rank0's output:
>>>>> [jody_at_localhost ~]$ orte-iof --pid 31049 --rank 4 --stdout
>>>>> [localhost]I am #0/9 before the barrier
>>>>>
>>>>
>>>> The problem here is that the option name changed from "rank" to
>>>> "ranks"
>>>> since you can now specify any number of ranks as comma-separated
>>>> ranges.
>>>> I
>>>> have updated orte-iof so it will gracefully fail if you provide an
>>>> unrecognized cmd line option and output the "help" detailing the
>>>> accepted
>>>> options.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jody
>>>>>
>>>>> On Sun, Feb 1, 2009 at 10:49 PM, Ralph Castain <rhc_at_[hidden]>
>>>>> wrote:
>>>>>>
>>>>>> I'm afraid we discovered a bug in optimized builds with r20392.
>>>>>> Please
>>>>>> use
>>>>>> any tarball with r20394 or above.
>>>>>>
>>>>>> Sorry for the confusion
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On Feb 1, 2009, at 5:27 AM, Jeff Squyres wrote:
>>>>>>
>>>>>>> On Jan 31, 2009, at 11:39 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> For anyone following this thread:
>>>>>>>>
>>>>>>>> I have completed the IOF options discussed below.
>>>>>>>> Specifically, I
>>>>>>>> have
>>>>>>>> added the following:
>>>>>>>>
>>>>>>>> * a new "timestamp-output" option that timestamp's each line of
>>>>>>>> output
>>>>>>>>
>>>>>>>> * a new "output-filename" option that redirects each proc's
>>>>>>>> output to
>>>>>>>> a
>>>>>>>> separate rank-named file.
>>>>>>>>
>>>>>>>> * a new "xterm" option that redirects the output of the
>>>>>>>> specified
>>>>>>>> ranks
>>>>>>>> to a separate xterm window.
>>>>>>>>
>>>>>>>> You can obtain a copy of the updated code at:
>>>>>>>>
>>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>> openmpi-1.4a1r20392.tar.gz
>>>>>>>
>>>>>>> Sweet stuff. :-)
>>>>>>>
>>>>>>> Note that the URL/tarball that Ralph cites is a nightly
>>>>>>> snapshot and
>>>>>>> will
>>>>>>> expire after a while -- we only keep the most 5 recent nightly
>>>>>>> tarballs
>>>>>>> available. You can find Ralph's new IOF stuff in any 1.4a1
>>>>>>> nightly
>>>>>>> tarball
>>>>>>> after the one he cited above. Note that the last part of the
>>>>>>> tarball
>>>>>>> name
>>>>>>> refers to the subversion commit number (which increases
>>>>>>> monotonically);
>>>>>>> any
>>>>>>> 1.4 nightly snapshot tarball beyond "r20392" will contain this
>>>>>>> new IOF
>>>>>>> stuff. Here's where to get our nightly snapshot tarballs:
>>>>>>>
>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>
>>>>>>> Don't read anything into the "1.4" version number -- we've
>>>>>>> just bumped
>>>>>>> the
>>>>>>> version number internally to be different than the current
>>>>>>> stable
>>>>>>> series
>>>>>>> (1.3). We haven't yet branched for the v1.4 series; hence,
>>>>>>> "1.4a1"
>>>>>>> currently refers to our development trunk.
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users