Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Fwd: error on QCD run
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-08-28 12:02:29


Yo folks

Does anyone have a suggestion as to what might be causing this? It's
in 1.2.4 release, if that helps. We are trying to test the cluster, so
it could be hardware problems - we just want to narrow it down if we
can. Any debug suggestions would also be welcome.

Thanks
Ralph

Begin forwarded message:

> From: Craig Idler <cwi_at_[hidden]>
> Date: August 28, 2008 9:43:11 AM MDT
> To: tlcc-install_at_[hidden]
> Cc: Trent D'Hooge <tdhooge_at_[hidden]>
> Subject: error on QCD run
>
> I've seen the following error a couple of times now during a QCD
> multi-node run. Does this indicate a MPI driver issue or maybe a IB
> network problem?
>
> --------
>
> Input file generated. Current time is: Thu Aug 28 00:38:47 2008 UTC
> Starting executable preplat via "mpirun -np 512 ./preplat"
>
> [0,1,452][btl_openib_component.c:1338:btl_openib_component_progress]
> from loa126 to: loa119 error polling HP CQ with status LOCAL QP
> OPERATION ERROR s
> tatus number 2 for wr_id 141710328 opcode -1
> mlx4: local QP operation err (QPN 8800ae, WQE index bfab0000, vendor
> syndrome 6f, opcode = 5e)
> mpirun noticed that job rank 0 with PID 10676 on node loa031 exited
> on signal 15 (Terminated).
> 510 additional processes aborted (not shown)
> mpirun finished with code 36608
>
> --------
>
> Thanks for any insight.
>
> Craig
>
>
>
>
>
>
>
>
>
>
>
>
>