Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] weird problem with passing data between nodes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:22:00

Yes, this does sound like the classic "assuming MPI buffering" case.
Check out this magazine column that I wrote a long time ago about this

It's #1 on the top 10 list of All-Time Favorite Evils to Avoid in
Parallel. :-)

One comment on Mattijs's email: please don't use bsend. Bsend is
evil. :-)

On Jun 13, 2008, at 5:27 AM, Mattijs Janssens wrote:

> Sounds like a typical deadlock situation. All processors are waiting
> for one
> another.
> Not a specialist but from what I know if the messages are small
> enough they'll
> be offloaded to kernel/hardware and there is no deadlock. That why
> it might
> work for small messages and/or certain mpi implementations.
> Solutions:
> - come up with a global communication schedule such that if one
> processor
> sends the receiver is receiving.
> - use mpi_bsend. Might be slower.
> - use mpi_isend, mpi_irecv (but then you'll have to make sure the
> buffers stay
> valid for the duration of the communication)
> On Friday 13 June 2008 01:55, zach wrote:
>> I have a weird problem that shows up when i use LAM or OpenMPI but
>> not
>> I have a parallelized code working on a really large matrix. It
>> partitions the matrix column-wise and ships them off to processors,
>> so, any given processor is working on a matrix with the same number
>> of
>> rows as the original but reduced number of columns. Each processor
>> needs to send a single column vector entry
>> from its own matrix to the adjacent processor and visa versa as part
>> of the algorithm.
>> I have found that depending on the number of rows of the matrix -or,
>> the size of the vector being sent using MPI_Send, MPI_Recv, the
>> simulation will hang.
>> It is only until i reduce this dimension to a certain max number will
>> the sim run properly. I have also found that this magic number
>> differs
>> depending on the system I am using, eg my home quad-core box or
>> remote
>> cluster.
>> As i mentioned i have not had this issue with mpich. I would like to
>> understand why it is happening rather than just defect over to mpich
>> to get by.
>> Any help would be appreciated!
>> zach
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems