We hit a problem recently with memory errors when scaling a code to 1000 cores.
Switching to SRQ and some guess of queue values selected appears to let the code run.
This is a ConnectX fabric, should I switch them to XRC queues? And should I use the same queue size/count? That a safe assumption?
When should I use one queue type over the other?
Is there a way to get stat feedback on the use of your shared queues (SRQ or XRC) ?
Example, using code 'not from here' and would like to know, "hey you are always running out of your queue of size X" Or " your queue of size Y is never used"
We are kinda blind for a lot of our applications :-)
CAEN Advanced Computing