Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Arif Ali (aali_at_[hidden])
Date: 2007-01-19 18:19:09


-----Original Message-----
From: Gleb Natapov [mailto:glebn_at_[hidden]]
Sent: Fri 19/01/2007 18:33
To: Arif Ali
Cc: Open MPI Users; Galen Shipman; Brad Benton; Pavel Shamis; Russell Slack; Barry Evans
Subject: Re: [OMPI users] OpenMPI/OpenIB/IMB hangs[Scanned]
 
On Fri, Jan 19, 2007 at 05:51:49PM +0000, Arif Ali wrote:
> >>I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed
> >>miserably.
> >>
> >
> >Can you describe what happened there? Is it failing in a different way?
> >
> Here's the output
>
> #---------------------------------------------------
> # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
> #---------------------------------------------------
> # Date : Fri Jan 19 17:33:52 2007
> # Machine : ppc64# System : Linux
> # Release : 2.6.16.21-0.8-ppc64
> # Version : #1 SMP Mon Jul 3 18:25:39 UTC 2006
>
> #
> # Minimum message length in bytes: 0
> # Maximum message length in bytes: 4194304
> #
> # MPI_Datatype : MPI_BYTE
> # MPI_Datatype for reductions : MPI_FLOAT
> # MPI_Op : MPI_SUM
> #
> #
>
> # List of Benchmarks to run:
>
> # PingPong
> # PingPing
> # Sendrecv
> # Exchange
> # Allreduce
> # Reduce
> # Reduce_scatter
> # Allgather
> # Allgatherv
> # Alltoall
> # Bcast
> # Barrier
>
> #---------------------------------------------------
> # Benchmarking PingPong
> # #processes = 2
> # ( 58 additional processes waiting in MPI_Barrier)
> #---------------------------------------------------
> #bytes #repetitions t[usec] Mbytes/sec
> 0 1000 1.76 0.00
> 1 1000 1.88 0.51
> 2 1000 1.89 1.01
> 4 1000 1.91 2.00
> 8 1000 1.88 4.05
> 16 1000 2.02 7.55
> 32 1000 2.05 14.88
> [0,1,4][btl_openib_component.c:1153:btl_openib_component_progress] from
> node03 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
> status number 10 for wr_id 268969528 opcode 128
> [0,1,28][btl_openib_component.c:1153:btl_openib_component_progress] from
> node09 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
> status number 10 for wr_id 268906808 opcode 128
> [0,1,58][btl_openib_component.c:1153:btl_openib_component_progress] from
> node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
> status number 10 for wr_id 268919352 opcode 256614836
> [0,1,0][btl_openib_component.c:1153:btl_openib_component_progress] from
> node02 to: node03 error polling HP CQ with status WORK REQUEST FLUSHED
> ERROR status number 5 for wr_id 276070200 opcode 0
> [0,1,59][btl_openib_component.c:1153:btl_openib_component_progress] from
> node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
> status number 10 for wr_id 268919352 opcode 256614836
> mpirun noticed that job rank 0 with PID 0 on node node02 exited on
> signal 15 (Terminated).
> 55 additional processes aborted (not shown)
does this happen with btl_openib_flags=1? Does this also happen without
this setting. This doesn't happen with OpenMPI-1.2b3 right?

That's Correct, I tried all the flags that was suggested, and a few more, which I listed in previous mails

Yes, correct, this doesn't happen with 1.2b3