We hit a problem recently with memory errors when scaling a code to 1000 cores.
Switching to SRQ and some guess of queue values selected appears to let the code run.
S,4096,128:S,12288,128:S,65536,12
Two questions,
This is a ConnectX fabric, should I switch them to XRC queues? And should I use the same queue size/count? That a safe assumption?
X,4096,128:X,12288,128:X,65536,12
When should I use one queue type over the other?
Is there a way to get stat feedback on the use of your shared queues (SRQ or XRC) ?
Example, using code 'not from here' and would like to know, "hey you are always running out of your queue of size X" Or " your queue of size Y is never used"
We are kinda blind for a lot of our applications :-)
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp_at_[hidden]
(734)936-1985
|