Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] simplest way to check message queues
From: Brock Palen (brockp_at_[hidden])
Date: 2010-09-01 18:01:32

We have ddt, but we do not have licenses to attach to the number of cores these jobs run at.

I tried padb, but it fails,


ssh to root node for running MPI job:
/tmp/padb -Q -a

[] [[22211,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer
[] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in file util/comm/comm.c at line 62
[] [[22211,0],0] ORTE_ERROR_LOG: Unreachable in file orte-ps.c at line 799
[] [[22211,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer
einner: --------------------------------------------------------------------------
einner: orterun was unable to launch the specified application as it could not access
einner: or execute an executable:
Unexpected EOF from Inner stdout (connecting)
Unexpected EOF from Inner stderr (connecting)
Unexpected exit from parallel command (state=connecting)
Bad exit code from parallel command (exit_code=131)

Brock Palen
Center for Advanced Computing

On Sep 1, 2010, at 5:32 PM, Ashley Pittman wrote:

> On 1 Sep 2010, at 21:13, Brock Palen wrote:
>> I have a code for a user (namd if anyone cares) that on a specific case will lock up, a quick ltrace shows the processes doing Iprobes over and over, so this makes me think that a process someplace is blocking on communication.
>> What is the best way to look at message queues? To see what process is stuck and to drill into.
> The only three programs I know which can do this are TotalView, DDT and Padb. Totalview and DDT are graphical parallel debuggers and are commercial projects, Padb is a command-line tool and is open-source
> Ashley (padb developer)
> --
> Ashley Pittman, Bath, UK.
> Padb - A parallel job inspection tool for cluster computing
> _______________________________________________
> users mailing list
> users_at_[hidden]