As mentioned in another thread I've recently ported padb, a command line
job inspection tool (kinda like a parallel debugger) to orte and
OpenMPI. Padb is an existing stable product which has worked for a
number of years on Slurm and RMS, orte support is new and not widely
tested yet although it works for all cases I've tried.
For those who haven't used it padb is a open source command-line tool
which among other things can collect stack traces, display MPI message
queues and present a lot of process information about parallel jobs to
the user is an accessible way.
Ideally padb will find it's place within the day to day workings of
OpenMPI developers and become a recommended tool for users as well, it
also has a mode where it can be launched automatically to gather
information about job hangs without human intervention, I'd be willing
to work with the OpenMPI team to integrate this into the MTT code if
I would encourage you to download it and try it out, if it works for you
and you like it that's great, if not let me know and I'll do what I can
to fix it. There is a website and public mailing lists for padb issues
or I am happy to discuss orte specific issues on this list.
The website is at http://padb.pittman.org.uk and I welcome any feedback,
either here, off-list or on either of the padb mailing lists.
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing