Yo folks

Does anyone have a suggestion as to what might be causing this? It's in 1.2.4 release, if that helps. We are trying to test the cluster, so it could be hardware problems - we just want to narrow it down if we can. Any debug suggestions would also be welcome.


Begin forwarded message:

From: Craig Idler <cwi@lanl.gov>
Date: August 28, 2008 9:43:11 AM MDT
To: tlcc-install@lanl.gov
Cc: Trent D'Hooge <tdhooge@llnl.gov>
Subject: error on QCD run

I've seen the following error a couple of times now during a QCD multi-node run. Does this indicate a MPI driver issue or maybe a IB network problem?


Input file generated. Current time is: Thu Aug 28 00:38:47 2008 UTC
Starting executable preplat via "mpirun -np 512 ./preplat"

[0,1,452][btl_openib_component.c:1338:btl_openib_component_progress] from loa126 to: loa119 error polling HP CQ with status LOCAL QP OPERATION ERROR s
tatus number 2 for wr_id 141710328 opcode -1
mlx4: local QP operation err (QPN 8800ae, WQE index bfab0000, vendor syndrome 6f, opcode = 5e)
mpirun noticed that job rank 0 with PID 10676 on node loa031 exited on signal 15 (Terminated).
510 additional processes aborted (not shown)
mpirun finished with code 36608


Thanks for any insight.