Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] HP CQ with status LOCAL LENGTH ERROR
From: Sangamesh B (forum.san_at_[hidden])
Date: 2008-12-29 04:40:13


Hello all,

The MPI-Blast-PIO-1.5.0 is installed with Open MPI 1.2.8 + intel 10
compilers on Rocks-4.3 + Voltaire Infiniband + Voltaire Grid stack OFA
roll.

The 8 process parallel job is submitted through SGE:

$ cat sge_submit.sh
#!/bin/bash

#$ -N OMPI-Blast-Job

#$ -S /bin/bash

#$ -cwd

#$ -e err.$JOB_ID.$JOB_NAME

#$ -o out.$JOB_ID.$JOB_NAME

#$ -pe orte 8
export LD_LIBRARY_PATH=/opt/openmpi_intel/1.2.8/lib:/opt/intel/cce/10.1.018/lib:/opt/gridengine/lib/lx26-amd64

#$ -V

/opt/openmpi_intel/1.2.8/bin/mpirun -np $NSLOTS
/opt/apps/mpiblast-150-pio_OMPI/bin/mpiblast -p blastp -d
Mtub_CDC1551_.faa -i 586_seq.fasta -o test8.out

Everytime it is failing with folowing error message:

$ cat err.117.OMPI-Blast-Job
[0,1,7][btl_openib_component.c:1371:btl_openib_component_progress]
from compute-0-5.local to: compute-0-11.local error polling HP CQ with
status LOCAL LENGTH ERROR status number 1 for wr_id 11990008 opcode 42
4 0.481518 Bailing out with signal 15
[compute-0-5.local:25702] MPI_ABORT invoked on rank 4 in communicator
MPI_COMM_WORLD with errorcode 0
5 0.487255 Bailing out with signal 15
[compute-0-5.local:25703] MPI_ABORT invoked on rank 5 in communicator
MPI_COMM_WORLD with errorcode 0
6 0.658543 Bailing out with signal 15
[compute-0-5.local:25704] MPI_ABORT invoked on rank 6 in communicator
MPI_COMM_WORLD with errorcode 0
0 0.481974 Bailing out with signal 15
[compute-0-11.local:25698] MPI_ABORT invoked on rank 0 in communicator
MPI_COMM_WORLD with errorcode 0
1 0.660788 Bailing out with signal 15
[compute-0-11.local:25699] MPI_ABORT invoked on rank 1 in communicator
MPI_COMM_WORLD with errorcode 0
2 0.67406 Bailing out with signal 15
[compute-0-11.local:25700] MPI_ABORT invoked on rank 2 in communicator
MPI_COMM_WORLD with errorcode 0
3 0.680739 Bailing out with signal 15
[compute-0-11.local:25701] MPI_ABORT invoked on rank 3 in communicator
MPI_COMM_WORLD with errorcode 0

This happens only with MPIBlast. The parallel gromacs jobs run very well.

Let me know why this error is appering & how to resolve it? Is it due
to Rocks Gridstack OFA roll?

Thanks,
Sangamesh