Hi,
I've been working on a random segmentation fault that seems to occur during a collective communication when using the openib btl (see [OMPI users] [openib] segfault when using openib btl).
During my tests, I've come across different issues reported by OpenMPI-1.4.2:
1/
[[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 560618664 opcode 1 vendor error 105 qp_idx 3
2/
[[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 162858496 opcode 1 vendor error 136 qp_idx
0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 485900928 opcode 0 vendor error 249
qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing.
Local host: p'"
MPI process PID: 20743
Error number: 3 (IBV_EVENT_QP_ACCESS_ERR)
This error may indicate connectivity problems within the fabric; please contact your system administrator.
--------------------------------------------------------------------------
I'd like to know what these two errors mean and where they come from.
Thanks for your help,
Eloi
--
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone: +32 10 487 959
|