I ran our application using the final version of openmpi-1.7.5 again
with coll_ml_priority = 90.
Then, coll/ml was actually activated and I got these error messages
as shown below:
List manager is empty.
COLL-ML lmngr failed.
COLL-ML mca_coll_ml_allocate_block exited wi
Unfortunately coll/ml seems to still have some problems ...
And, it also means coll/ml was not activated on my test run with
coll_ml_priority = 27. So, the slowdown was due to the expensive
connectivity computation as you pointed out, I guess.
> On Mar 20, 2014, at 5:56 PM, tmishima_at_[hidden] wrote:
> > Hi Ralph, congratulations on releasing new openmpi-1.7.5.
> > By the way, opnempi-1.7.5rc3 has been slowing down our application
> > with smaller size of testing data, where the time consuming part
> > of our application is so called sparse solver. It's negligible
> > with medium or large size data - more practical one, so I have
> > been defering this problem.
> > However, this slowdown disappears in the final version of
> > openmpi-1.7.5. After some investigations, I found coll_ml caused
> > this slowdown. The final version seems to set coll_ml_priority as zero
> > again.
> > Could you explain briefly about the advantage of coll_ml? In what kind
> > of situation it's effective and so on ...
> I'm not really the one to speak about coll/ml as I wasn't involved in it
- Nathan would be the one to ask. It is supposed to be significantly faster
for most collectives, but I imagine it would
> depend on the precise collective being used and the size of the data. We
did find and fix a number of problems right at the end (which is why we
dropped the priority until we can better test/debug
> it), and so we might have hit something that was causing your slow down.
> > In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3,
> > although its priority is lower than tuned as described in the message
> > of changeset 30790:
> > We are initially setting the priority lower than
> > tuned until this has had some time to soak in the trunk.
> Were you actually seeing coll/ml being used? It shouldn't have been.
However, coll/ml was getting called during the collective initialization
phase so it could set itself up, even if it wasn't being
> used. One part of its setup is a somewhat expensive connectivity
computation - one of our last-minute cleanups was removal of a static 1MB
array in that procedure. Changing the priority to 0
> completely disables the coll/ml component, thus removing it from even the
initialization phase. My guess is that you were seeing a measurable "hit"
by that procedure on your small data tests, which
> probably ran fairly quickly - and not seeing it on the other tests
because the setup time was swamped by the computation time.
> > Tetsuya
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list