Hi Yevgeny and list,
----- Original Message -----
> From: Yevgeny Kliteynik <kliteyn_at_[hidden]>
> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you.
> One question though, just to make sure we're on the same page: so the jobs
> do run OK on
> the older HCAs, as long as they run *only* on the older HCAs, right?
Yes, correct. They run on the newer hosts using the newer (ConnectX) HCAs as long as the jobs stay on the same (newer) HCA type, and they run on the older HCAs (mthca) so long as the jobs stay on the same HCA type as well. IOW, as long as the jobs run on homogeneous IB hardware, they run successfully to completion. We've successfully done stuff like Checkpoint/Restart using the BLCR functionality, and it all seems to work well and in a seemingly robust way.
> Please make sure that the jobs are using only IB with "--mca btl
> openib,self" parameters.
The system is in use right now, so I will have to test this and get back you, but I can also say with certainty that we don't specify --mca parameters unless a user needs to run on Ethernet-only (to avoid the IB errors we're discussing). Otherwise, it is at the Open MPI 1.5.3 default behavior. The users are also all using the systemwide Open MPI installation, so this isn't an issue of an erroneous local configuration lying around from multiple parallel installs, or interfering copies of different builds, etc.
Other than the mandatory iw_cm kernel module, we are not building/using any iWarp or DAPL/uDAPL functionality. We are also not running IP on the IB network.