Hello.
Sorry for the delay in confirming the minimum load that would trigger
the RnR error; the holidays here were a significant interruption.
On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote:
> What's the smallest number of nodes that are needed to reproduce this
> problem? Does it happen with just two HCAs, one process per node?
Our nodes with these HCAs are dual-socket, 4 Intel cores/socket.
Working with the users, it turns out we were unable to reproduce the
issue with anything less than 3 nodes and 17 processes total, with no
nodes oversubscribed. So two nodes were running with 8 processes each
and the third with 1 process.
It could be some sort of race condition or timing issue that could
theoretically be triggered for less than this, but we weren't able to
provoke it.
> Let's get you to the latest firmware GA of this card.
Just as a reminder, I responded to the firmware part of this earlier:
http://www.open-mpi.org/community/lists/users/2011/12/18014.php
Thank you,
V. Ram
--
http://www.fastmail.fm - Access your email from home and the web
|