Open MPI logo

MTT Devel Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all MTT Devel mailing list

Subject: Re: [MTT devel] Analysis of hung jobs.
From: Ethan Mallove (ethan.mallove_at_[hidden])
Date: 2009-10-08 10:46:35


On Thu, Oct/08/2009 03:18:07PM, Ashley Pittman wrote:
> On Thu, 2009-10-08 at 09:51 -0400, Ethan Mallove wrote:
>
> > $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336
> > ...
> > full job report for job 6336
> >
> > Attaching to job 6336
> > mpirun resource manager requires pdsh to be installed
> > Use of uninitialized value in printf at padb line 729.
> > Use of uninitialized value in printf at padb line 729.
> > DEBUG (verbose): 0: There are 0 processes over 0 hosts
> > Fatal problem setting up the resource manager: mpirun
> >
> > I assume it's referring to the below "pdsh"?
> >
> > http://sourceforge.net/projects/pdsh
>
> Yes, you'll need to able to ssh freely around from the node where
> padb/pdsh is running to all compute nodes as well. For debian I had to
> add "export PDSH_RCMD_TYPE=ssh" to my .bashrc to tell it to use ssh
> rather than rsh.
>
> Could you update to r283 as well, the "mpirun" resource manager is new
> and I discovered this morning that it didn't like digits in hostnames.
> As an added benefit it won't use pdsh or ssh if all processes are local.

It looks like it's using a bad option to pdsh?

  $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=24303
  ...
  padb version 3.n (Revision 283)
  full job report for job 24303

  Attaching to job 24303
  Use of uninitialized value in string ne at padb line 2720.
  Job has 1 process(es)
  Job spans 0 host(s)
  DEBUG (verbose): 0: There are 1 processes over 0 hosts
  DEBUG (verbose): 0: Remote process data available on frontend
  DEBUG (show_cmd): 0: pdsh -w padb --inner --outer="burl-ct-v20z-0:52314"
  einner: pdsh: illegal option -- -
  einner: Usage: pdsh [-options] command ...
  einner: -S return largest of remote command return values
  einner: -h output usage menu and quit
  einner: -V output version information and quit
  einner: -q list the option settings and quit
  einner: -b disable ^C status feature (batch mode)
  einner: -d enable extra debug information from ^C status
  einner: -l user execute remote commands as user
  einner: -t seconds set connect timeout (default is 10 sec)
  einner: -u seconds set command timeout (no default)
  einner: -f n use fanout of n nodes
  einner: -w host,host,... set target node list on command line
  einner: -x host,host,... set node exclusion list on command line
  einner: -R name set rcmd module to name
  einner: -N disable hostname: labels on output lines
  einner: -L list info on all loaded modules and exit
  einner: available rcmd modules: rsh,exec (default: rsh)
  Unexpected EOF from Inner stdout (connecting)
  Unexpected EOF from Inner stderr (connecting)
  Unexpected exit from parallel command (state=connecting)
  result from parallel command is 256 (state=connecting)
  Bad exit code from parallel command (exit_code=1)
  DEBUG (verbose): 5: Completed command

-Ethan

>
> Ashley,
>
> --
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
>
> _______________________________________________
> mtt-devel mailing list
> mtt-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel