On 11-Jul-11 5:23 PM, Bill Johnstone wrote:
> Hi Yevgeny and list,
> ----- Original Message -----
>> From: Yevgeny Kliteynik<kliteyn_at_[hidden]>
>> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you.
> Thank you.
This MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thingy implies that openib
btl didn't get the transport type from verbs at all.
It works on mthcas because OMPI compares transport type on the
endpoints and sees that it is the same transport type on all of
them (never mind the fact that it is an UNKNOWN transport),
but once you have ConnectX as well, OMPI compares UNKNOWN
transport to IB and complaints that these are different transports.
>> One question though, just to make sure we're on the same page: so the jobs
>> do run OK on
>> the older HCAs, as long as they run *only* on the older HCAs, right?
> Yes, correct. They run on the newer hosts using the newer (ConnectX) HCAs as long as the jobs stay on the same (newer) HCA type, and they run on the older HCAs (mthca) so long as the jobs stay on the same HCA type as well. IOW, as long as the jobs run on homogeneous IB hardware, they run successfully to completion. We've successfully done stuff like Checkpoint/Restart using the BLCR functionality, and it all seems to work well and in a seemingly robust way.
>> Please make sure that the jobs are using only IB with "--mca btl
>> openib,self" parameters.
While I'm trying to find an old HCA somewhere, could you please
post here the output of "ibv_devinfo -v" on mthca?
> The system is in use right now, so I will have to test this and get back you, but I can also say with certainty that we don't specify --mca parameters unless a user needs to run on Ethernet-only (to avoid the IB errors we're discussing). Otherwise, it is at the Open MPI 1.5.3 default behavior. The users are also all using the systemwide Open MPI installation, so this isn't an issue of an erroneous local configuration lying around from multiple parallel installs, or interfering copies of different builds, etc.
> Other than the mandatory iw_cm kernel module, we are not building/using any iWarp or DAPL/uDAPL functionality. We are also not running IP on the IB network.