Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-04-08 14:52:18


> -----Original Message-----
> From: Prakash Velayutham [mailto:Prakash.Velayutham_at_[hidden]]
> Sent: Saturday, April 08, 2006 2:45 PM
> To: Jeff Squyres (jsquyres); users_at_[hidden]
> Subject: Re: [OMPI users] Open MPI and Torque error
>
> >>> jsquyres_at_[hidden] 04/08/06 7:10 AM >>>
> I am also curious as to why this would not work -- I was not under the
> impression that tm_init() would fail from a non
> mother-superior node...?
>
> What others say is that it will fail this way inside a Open MPI job as
> Open MPI's RTE is taking the only TM connection available. But the

Note that Open RTE does not hold a TM connection open because of the
one-TM-connection-per-MOM restriction (which was only recently
alleviated with Garrick's patch). Open RTE's TM support opens a TM
connection, does its thing, and then closes the connection.

> strange thing is that it works from Mother Superior without Garrick's
> patch (actually, regardless of the patch, the behaviour is
> the same, but
> I have not rigorously tested the patch in itself, so cannot comment
> about that), which I think should have failed according to the above
> contention.

Based on my explanation above, the behavior you have observed makes
sense.

> FWIW: It has been our experience with both Torque and the various
> flavors of PBS that you can repeatedly call tm_init() and
> tm_finalize()
> within a single process, so I would be surprised if that was
> the issue.
> Indeed, I'd have to double check, but I'm pretty sure that our MPI
> processes do not call tm_init() (I believe that only mpirun does).

 
> But I am running my code using mpirun, so is this expected
> behaviour? I
> am attaching my simple code below:

Yes. What I am saying is that only Open MPI's mpirun invokes tm_init()
-- the MPI processes do not invoke tm_init(). Hence, there is no
possibility of a TM connection contention from the MPI processes.

Even if you launch an MPI process on the same node as mpirun, there are
synchronization points that guarantee that MPI_INIT will not complete
until the TM connections from mpirun have completed and been
tm_finalized().

This is why I, too, am curious as to why your tm_init() is failing. You
might have to dive a bit deeper in the TM library to figure it out. :-\

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems