Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] simplest way to check message queues
From: Ashley Pittman (ashley_at_[hidden])
Date: 2010-09-02 17:55:46

On 2 Sep 2010, at 15:56, Brock Palen wrote:

> Ashly still having trouble using padb with openmpi/1.4.2
> [dianawon_at_nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
> [] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer
> [] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file util/comm/comm.c at line 62
> [] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file orte-ps.c at line 799
> [] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer
> No active jobs could be found for user 'dianawon'
> The job is running, I get this error running just orte-ps,

If orte-ps isn't running correctly then there is very little padb can do, if that is the case try using the "mpirun" resource manager interface rather than "orte", this will cause padb to use the MPIR interface and try to get the information directly from the mpirun process before launching itself via pdsh. It doesn't scale as well as the orte integration (pdsh runs out of file descriptors eventually) but is more generic and might get you to somewhere that works. If your job spans more than 32 nodes you may need to set the FANOUT variable for pdsh to work.


Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing