Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Handling output of processes
From: jody (jody.xha_at_[hidden])
Date: 2009-02-05 10:23:42


Hi Ralph

Thanks - i downloaded and installed openmpi-1.4a1r20435 and
now everything works as it should:
--output-filename : all processes write their outputs to the correct files
--xterm : all specified processes opened their xterms

I started my application with --xterm as i wrote in a previous mail:
- call 'xhost +<remote_node>' for all nodes in my hostfile
- export DISPLAY=<my_workstation>:0.0
- call
    mpirun -np 8 -x DISPLAY --hostfile testhosts --xterm
--ranks=2,3,4,5! ./MPITest

combining --xterm with --output-filename also worked.

Thanks again!
  Jody

On Tue, Feb 3, 2009 at 11:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> Hi Jody
>
> Well, the problem with both the output filename and the xterm option was
> that I wasn't passing them back to the remote daemons under the ssh launch
> environment. I should have that corrected now - things will hopefully work
> with any tarball of r20407 or above.
>
> Let me know...
> Ralph
>
> On Feb 3, 2009, at 11:34 AM, Ralph Castain wrote:
>
>> Ah! I know the problem - forgot you are running under ssh, so the
>> environment doesn't get passed.
>>
>> I'll have to find a way to pass the output filename to the backend
>> nodes...should have it later today.
>>
>>
>> On Feb 3, 2009, at 11:09 AM, jody wrote:
>>
>>> Hi Ralph
>>>>>
>>>>> --output-filename
>>>>> It creates files, but only for the local processes:
>>>>> [jody_at_localhost neander]$ mpirun -np 8 -hostfile testhosts
>>>>> --output-filename gnana ./MPITest
>>>>> ... output ...
>>>>> [jody_at_localhost neander]$ ls -l gna*
>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.0
>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.1
>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-03 18:02 gnana.2
>>>>> ( i set slots=3 on my workstation)
>>>>>
>>>>
>>>> Did you give a location that is on an NFS mount?
>>>
>>> Yes, i started mpirun on a drive which all the remote nodes mount as NFS
>>> drives.
>>>>
>>>> I'm willing to bet the files are being created - they are on your remote
>>>> nodes. The daemons create their own local files for output from their
>>>> local
>>>> procs. We decided to do this for scalability reasons - if we have mpirun
>>>> open all the output files, then you could easily hit the file descriptor
>>>> limit on that node and cause the job not to launch.
>>>>
>>>> Check your remote nodes and see if the files are there.
>>>
>>> Where would i have to look? They are not in my home directories on the
>>> nodes.
>>>
>>>>
>>>> I can fix that easily enough - we'll just test to see if the xterm
>>>> option
>>>> has been set, and add the -X to ssh if so.
>>>>
>>>> Note that you can probably set this yourself right now by -mca
>>>> plm_rsh_agent
>>>> "ssh -X"
>>>
>>> I tried this, but it didn't work, though we may be getting there:
>>>
>>> [jody_at_localhost neander]$ mpirun -np 8 -mca plm_rsh_agent "ssh -X"
>>> -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest
>>> Warning: No xauth data; using fake authentication data for X11
>>> forwarding.
>>> Warning: No xauth data; using fake authentication data for X11
>>> forwarding.
>>> Warning: No xauth data; using fake authentication data for X11
>>> forwarding.
>>> ...
>>> => The 3 remote processes (3,4,5) tried to get access.
>>>
>>> I remember having had an xauth problem like this in an other setup
>>> before,
>>> but i've forgotten how to solve it. I'll try to find out, and get back to
>>> you when i figured it out.
>>>
>>> BTW: calling an X-application over SSH works, e.g.
>>> ssh -X node_00 xclock
>>>
>>>
>>> Jody
>>>>
>>>>>
>>>>>
>>>>> So what i currently do to have my xterms running:
>>>>> on my workstation i call
>>>>> xhost + <hostname> for all
>>>>> machines in my hostfile, to allow them to use X on my workstation.
>>>>> Then i set my DISPLAY variable to point to my workstation
>>>>> export DISPLAY=<mymachine>:0.0
>>>>> Finally, i call mpirun with the -x option (to exports the DISPLAY
>>>>> variable to all nodes) :
>>>>> mpirun -np 4 -hostfile myfiles -x DISPLAY run_xterm.sh MyApplication
>>>>> arg1
>>>>> arg2
>>>>>
>>>>> Here run_xterm.sh is a shell script which creates a useful title for
>>>>> the xterm window
>>>>> and calls the application with all its arguments (-hold leaves the
>>>>> xterm open after the program terminates):
>>>>> #!/bin/sh -f
>>>>>
>>>>> # feedback for command line
>>>>> echo "Running on node `hostname`"
>>>>>
>>>>> # for version 1.2 use undocumented env variable
>>>>> # for version 1.3 use documented env variable
>>>>> export ID=$OMPI_COMM_WORLD_RANK
>>>>> if [ X$ID = X ]; then
>>>>> export ID=$OMPI_MCA_ns_nds_vpid
>>>>> fi
>>>>>
>>>>> export TITLE="node #$ID"
>>>>> # start terminal
>>>>> xterm -T "$TITLE" -hold -e $*
>>>>>
>>>>> exit 0
>>>>>
>>>>> (i have similar scripts to run gdb or valgrind in xterm windows)
>>>>> I know that the 'xhost +' is a horror for certain sysadmins,
>>>>> but i feel quite safe, because the machines listed in my hostfile
>>>>> are not accessible from outside our department.
>>>>>
>>>>> I haven't found any other alternative to have nice xterms when i can't
>>>>> use 'ssh -X'.
>>>>>
>>>>> To come back to the '--xterm' option: i just ran my xterm-script after
>>>>> doing the above xhost+ and DISPLAY things, and it worked - all local
>>>>> and
>>>>> remote
>>>>> processes created their xterm windows. (In other words, the environment
>>>>> was
>>>>> set to have my remote nodes use xterms on my workstation.)
>>>>>
>>>>> Immediately thereafter i called the same application with
>>>>> mpirun -np 8 -hostfile testhosts --xterm 2,3,4,5! -x DISPLAY ./MPITest
>>>>> but still, only the local process (#2) created an xterm.
>>>>>
>>>>>
>>>>> Do you think it would be possible to have open MPI make its
>>>>> ssh-connections with '-X',
>>>>> or are there technical or security-related objections?
>>>>>
>>>>> Regards
>>>>>
>>>>> Jody
>>>>>
>>>>> On Mon, Feb 2, 2009 at 4:47 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>
>>>>>> On Feb 2, 2009, at 2:55 AM, jody wrote:
>>>>>>
>>>>>>> Hi Ralph
>>>>>>> The new options are great stuff!
>>>>>>> Following your suggestion, i downloaded and installed
>>>>>>>
>>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>>>
>>>>>>> and tested the new options. (i have a simple cluster of
>>>>>>> 8 machines over tcp). Not everything worked as specified, though:
>>>>>>> * timestamp-output : works
>>>>>>
>>>>>> good!
>>>>>>
>>>>>>>
>>>>>>> * xterm : doesn't work completely -
>>>>>>> comma-separated rank list:
>>>>>>> Only for the local processes a xterm is opened. The other processes
>>>>>>> (the ones on remote machines) only output to the stdout of the
>>>>>>> calling window.
>>>>>>> (Just to be sure i started my own script for opening separate xterms
>>>>>>> - that did work for the remoties, too)
>>>>>>
>>>>>> This is a problem we wrestled with for some time. The issue is that we
>>>>>> really aren't comfortable modifying the DISPLAY envar on the remote
>>>>>> nodes
>>>>>> like you do in your script. It is fine for a user to do whatever they
>>>>>> want,
>>>>>> but for OMPI to do it...that's another matter. We can't even know for
>>>>>> sure
>>>>>> what to do because of the wide range of scenarios that might occur
>>>>>> (e.g.,
>>>>>> is
>>>>>> mpirun local to you, or on a remote node connected to you via xterm,
>>>>>> or...?).
>>>>>>
>>>>>> What you (the user) need to do is ensure that X11 is setup properly so
>>>>>> that
>>>>>> an Xwindow opened on the remote host is displayed on your screen. In
>>>>>> this
>>>>>> case, I believe you have to enable xforwarding - I'm not an xterm
>>>>>> expert,
>>>>>> so
>>>>>> I can't advise you on how to do this. Suspect you may already know -
>>>>>> in
>>>>>> which case, can you please pass it along and I'll add it to our docs?
>>>>>> :-)
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If a '-1' is given instead of a list of ranks, it fails (locally &
>>>>>>> with remotes):
>>>>>>> [jody_at_localhost neander]$ mpirun -np 4 --xterm -1 ./MPITest
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> Sorry! You were supposed to get help about:
>>>>>>> orte-odls-base:xterm-rank-out-of-bounds
>>>>>>> from the file:
>>>>>>> help-odls-base.txt
>>>>>>> But I couldn't find any file matching that name. Sorry!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun was unable to start the specified application as it
>>>>>>> encountered an error
>>>>>>> on node localhost. More information may be available above.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> Fixed as of r20398 - this was a bug, had an if statement out of
>>>>>> sequence.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> * output-filename : doesn't work here:
>>>>>>> [jody_at_localhost neander]$ mpirun -np 4 --output-filename gnagna
>>>>>>> ./MPITest
>>>>>>> [jody_at_localhost neander]$ ls -l gna*
>>>>>>> -rw-r--r-- 1 jody morpho 549 2009-02-02 09:07 gnagna.%10lu
>>>>>>>
>>>>>>> There is output from the processes on remote machines on stdout, but
>>>>>>> none
>>>>>>> from the local ones.
>>>>>>
>>>>>> Fixed as of r20400 - had a format statement syntax that was okay in
>>>>>> some
>>>>>> compilers, but not others.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> A question about installing: i installed the usual way (configure,
>>>>>>> make all install),
>>>>>>> but the new man-files apparently weren't copied to their destination:
>>>>>>> If i do 'man mpirun' i get shown the contents of an old man-file
>>>>>>> (without the new options).
>>>>>>> I had to do ' less
>>>>>>> /opt//openmpi-1.4a1r20394/share/man/man1/mpirun.1'
>>>>>>> to see them.
>>>>>>
>>>>>> Strange - the install should put them in the right place, but I wonder
>>>>>> if
>>>>>> you updated your manpath to point at it?
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> About the xterm-option : when the application ends all xterms are
>>>>>>> closed immediately.
>>>>>>> (when doing things 'by hand' i used the -hold option for xterm)
>>>>>>> Would it be possible to add this feature for your xterm option?
>>>>>>> Perhaps by adding a '!' at the end of the rank list?
>>>>>>
>>>>>> Done! A "!" at the end of the list will activate -hold as of r20398.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> About orte_iof: with the new version it works, but no matter which
>>>>>>> rank i specify,
>>>>>>> it only prints out rank0's output:
>>>>>>> [jody_at_localhost ~]$ orte-iof --pid 31049 --rank 4 --stdout
>>>>>>> [localhost]I am #0/9 before the barrier
>>>>>>>
>>>>>>
>>>>>> The problem here is that the option name changed from "rank" to
>>>>>> "ranks"
>>>>>> since you can now specify any number of ranks as comma-separated
>>>>>> ranges.
>>>>>> I
>>>>>> have updated orte-iof so it will gracefully fail if you provide an
>>>>>> unrecognized cmd line option and output the "help" detailing the
>>>>>> accepted
>>>>>> options.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> On Sun, Feb 1, 2009 at 10:49 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>> I'm afraid we discovered a bug in optimized builds with r20392.
>>>>>>>> Please
>>>>>>>> use
>>>>>>>> any tarball with r20394 or above.
>>>>>>>>
>>>>>>>> Sorry for the confusion
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 1, 2009, at 5:27 AM, Jeff Squyres wrote:
>>>>>>>>
>>>>>>>>> On Jan 31, 2009, at 11:39 AM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>>> For anyone following this thread:
>>>>>>>>>>
>>>>>>>>>> I have completed the IOF options discussed below. Specifically, I
>>>>>>>>>> have
>>>>>>>>>> added the following:
>>>>>>>>>>
>>>>>>>>>> * a new "timestamp-output" option that timestamp's each line of
>>>>>>>>>> output
>>>>>>>>>>
>>>>>>>>>> * a new "output-filename" option that redirects each proc's output
>>>>>>>>>> to
>>>>>>>>>> a
>>>>>>>>>> separate rank-named file.
>>>>>>>>>>
>>>>>>>>>> * a new "xterm" option that redirects the output of the specified
>>>>>>>>>> ranks
>>>>>>>>>> to a separate xterm window.
>>>>>>>>>>
>>>>>>>>>> You can obtain a copy of the updated code at:
>>>>>>>>>>
>>>>>>>>>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r20392.tar.gz
>>>>>>>>>
>>>>>>>>> Sweet stuff. :-)
>>>>>>>>>
>>>>>>>>> Note that the URL/tarball that Ralph cites is a nightly snapshot
>>>>>>>>> and
>>>>>>>>> will
>>>>>>>>> expire after a while -- we only keep the most 5 recent nightly
>>>>>>>>> tarballs
>>>>>>>>> available. You can find Ralph's new IOF stuff in any 1.4a1 nightly
>>>>>>>>> tarball
>>>>>>>>> after the one he cited above. Note that the last part of the
>>>>>>>>> tarball
>>>>>>>>> name
>>>>>>>>> refers to the subversion commit number (which increases
>>>>>>>>> monotonically);
>>>>>>>>> any
>>>>>>>>> 1.4 nightly snapshot tarball beyond "r20392" will contain this new
>>>>>>>>> IOF
>>>>>>>>> stuff. Here's where to get our nightly snapshot tarballs:
>>>>>>>>>
>>>>>>>>> http://www.open-mpi.org/nightly/trunk/
>>>>>>>>>
>>>>>>>>> Don't read anything into the "1.4" version number -- we've just
>>>>>>>>> bumped
>>>>>>>>> the
>>>>>>>>> version number internally to be different than the current stable
>>>>>>>>> series
>>>>>>>>> (1.3). We haven't yet branched for the v1.4 series; hence, "1.4a1"
>>>>>>>>> currently refers to our development trunk.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Squyres
>>>>>>>>> Cisco Systems
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>