Thanks for your reply,
but the program is running on TCP interconnect with same datasize and also on IB with small datasize say 1MB. So i dont think problem is in OpenMPI, it has to do something with IB logic, which probably doesnt work well with threads.
I also tried the program with MPI_THREAD_SERIALIZED, but in vain.
When is the version 1.3 scheduled to be released? Would it fix such issues?
Correct me, if i am wrong
-Neeraj
On Wed, 31 Oct 2007 05:31:32 -0700 Open MPI Users wrote
THREAD_MULTIPLE support does not work in the 1.2 series. Try turning
it off.
On Oct 30, 2007, at 12:17 AM, Neeraj Chourasia wrote:
> Hi folks,
>
> I have been seeing some nasty behaviour in MPI_Send/Recv
> with large dataset(8 MB), when used with OpenMP and Openmpi
> together with IB Interconnect. Attached is a program.
>
> The code first calls MPI_Init_thread() followed by openmp
> thread creation API. The program works fine, if we do single side
> comm unication [Thread 0 of process 0 sending some data to any
> thread of process 1], but it hangs if both side tries to send some
> data (8 MB) using IB Interconnect
>
> Interesting to note that program works fine, if we send
> short data(1 MB or below).
>
> I see this with
>
> openmpi-1.2 or openmpi-1.2.4 (compiled with --enable-mpi-
> threads)
> ofed 1.2
> 2.6.9-42.4sp.XCsmp
> icc (Intel Compiler)
>
> compiled as
> mpicc -O3 -openmp temp.c
> run as
> mpirun -np 2 -hostfile nodelist a.out
>
> The error i am getting is
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------
>
> [0,1,1][btl_openib_component.c:
> 1199:btl_openib_component_progress] from n129 to: n115 error
> polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for
> wr_id 6391728 opcode 0
> [0,1,1][btl_openib_component.c:1199:btl_openib_component_progress]
> from n129 to: n115 error polling LP CQ with status WORK REQUEST
> FLUSHED ERROR status number 5 for wr_id 7058304 opcode 128
> [0,1,0][btl_openib_component.c:1199:btl_openib_component_progress]
> from n115 to: n129 [0,1,0][btl_openib_component.c:
> 1199:btl_openib_component_progress] from n115 to: n129 error
> polling LP CQ with status WORK REQUEST FLUSHED ERROR status number
> 5 for wr_id 6854256 opcode 128
> error polling LP CQ with status LOCAL LENGTH ERROR status number 1
> for wr_id 6920112 opcode 0
>
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> -------------------
>
>
> Anyone else seeing similar? Any ideas for workarounds?
> As a point of reference, program works fine, if we force
> openmpi to select TCP interconnect using --mca btl tcp,self.
>
> -Neeraj
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
![]() |