Date: August 28, 2008 9:43:11 AM MDT
Subject: error on QCD run
I've seen the following error a couple of times now during a QCD multi-node run. Does this indicate a MPI driver issue or maybe a IB network problem?
--------
Input file generated. Current time is: Thu Aug 28 00:38:47 2008 UTC
Starting executable preplat via "mpirun -np 512 ./preplat"
[0,1,452][btl_openib_component.c:1338:btl_openib_component_progress] from loa126 to: loa119 error polling HP CQ with status LOCAL QP OPERATION ERROR s
tatus number 2 for wr_id 141710328 opcode -1
mlx4: local QP operation err (QPN 8800ae, WQE index bfab0000, vendor syndrome 6f, opcode = 5e)
mpirun noticed that job rank 0 with PID 10676 on node loa031 exited on signal 15 (Terminated).
510 additional processes aborted (not shown)
mpirun finished with code 36608
--------
Thanks for any insight.
Craig