I have an application which is run using openmpi and uses infiniband flags.
The application is a forecast model simulation. A frequent problem arises
that the Infiniband mezzanine cards of servers become faulty (don't know
the reason why it happens so frequent), the model simulation becomes very
slow or even remain stuck, I have to manually remove the nodes from the
hostlist one by one to check which nodes has faulty infiniband so that I
can run the model on the rest of the nodes. Is there any way to check
during job run that which node is having communication problem over
infiniband aur is delaying the application.