Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Could following situations caused by RDMAmcaparameters?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-23 06:56:20


On Apr 22, 2009, at 11:43 PM, Tsung Han Shie wrote:

> Unfortunately, after I thoroughly examined entire cluster, I found a
> bad node with busted hard drive. That's the reason why this job
> hanged.
> Also, when this job is sent with one bad node among the machinefile,
> neither the openmpi nor my program gives me any error messages.
> That's why I can't find the reason for job hanged.

Interesting. Sorry OMPI didn't provide more diagnostics. :-\

Did you get the information that you needed about the OpenFabrics
optimization stuff?

Note that we released OMPI 1.3.2 yesterday that fixed the
mpi_leave_pinned stuff, but also note that the treatment of the
mpi_leave_pinned MCA parameter changed slightly. Please see this FAQ
entry for details:

     http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2

Also, since you did apparently mean v1.1.3, note that that version is
ancient. Much has happened in Open MPI to improve scalability and
performance (and diagnostics!) since the 1.1 series. If it's possible
for you to upgrade, I encourage you to do so.

-- 
Jeff Squyres
Cisco Systems