>> What version of GM are you running?
> # rpm -qa |egrep "^gm-[0-9]+|^gm-devel"
> Is this too old?
Nope, that's just fine.
>> A mismatch between the list
>> of nodes actually configured onto the Myrinet fabric and the machine file
>> a common source of errors like this. The mismatch could be caused by
>> failure or other mapping issues.
> Could you elaborate on the mapping issues you mentioned? What are they?
If you have 3 nodes, A,B,C and the mapper on node C dies for some reason
(very unusual, but maybe killed by mistake, say), then node B gets rebooted,
then when node B comes back up, it will have routes to only node A and
itself, though A and C will still have routes everywhere. The map versions
on A and B will match, but C will have an old map version. Thus, an MPI job
spanning A,B,C would fail, even though all 3 nodes show up in gm_board_info
from node A.
>> Why GM instead of MX, by the way?
> We have a few MX cards in-house, but no MX switch due to its current
> market price. So we're only able to perform MX testing using
> direct-connection cables, which is not very exciting :) On the
> contrary, we've already had GM boards and a switch and found it
> sufficient for OpenMPI testing purposes. Would be great to upgrade to
> MX in the near future.
MX is just a different software stack, the hardware is the same. MX works
with both 2G and 10G, but GM does not work with the 10G cards. I see from
your gm_board_info output that you are using D-cards, which MX supports
(anything D or later is supported by MX, but not B or C cards). Switches
don't care about MX vs. GM. MX will give better performance for most MPI
applications than GM, and hardware too old for MX is fairly uncommon.