Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Re :Re: OpenMP and OpenMPI Issue
From: Jack Galloway (jackg_at_[hidden])
Date: 2012-07-20 13:33:06


Jeff Squyres <jsquyres <at> cisco.com> writes:

>
> On Oct 31, 2007, at 9:52 PM, Neeraj Chourasia wrote:
>
> > but the program is running on TCP interconnect with same
> > datasize and also on IB with small datasize say 1MB. So i dont
> > think problem is in OpenMPI, it has to do something with IB logic,
> > which probably doesnt work well with threads.
>
> Open MPi's TCP nominally supports threads, but I'd be surprised if it
> works consistently (i.e., it has not been tested thoroughly). The
> Open MPI IB code definitely does not yet work with threads.
>
> > I also tried the program with MPI_THREAD_SERIALIZED, but in vain.
>
> Open MPI currently treats this as no different than THREAD_SINGLE;
> the problem is that you'll still have multiple different threads
> calling MPI simultaneously with your program.
>
> > When is the version 1.3 scheduled to be released? Would it fix
> > such issues?
>
> No. We had been planning to make THREAD_MULTIPLE support available
> in the 1.3 series, but there honestly has not been enough customer
> demand for it such that we could not justify putting the resources /
> spending the time to finish it in Open MPI. THREAD_MULTIPLE is
> still on the long-term roadmap, but it will not be included in the
> 1.4 series.
>

This is an old thread, and I'm curious if there is support now for this? I have
a large code that I'm running, a hybrid MPI/OpenMP code, that is having trouble
over our infiniband network. I'm running a fairly large problem (uses about
18GB), and part way in, I get the following errors:

[[929,1],0][btl_openib_component.c:3238:handle_wc] from tebow to: tebow416 error
polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 103761776
opcode 128 vendor error 105 qp_idx 3
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 29873 on
node tebow exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

This seems very similar to the question that originated this thread, and since
we're now on version 1.4.5 I was wondering if there was any better help for this
(compiler options, run-time flags or anything), or if someone had encountered
this problem and solved it.

Thanks,
Jack