Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] weird problem with passing data between nodes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:22:00


Yes, this does sound like the classic "assuming MPI buffering" case.
Check out this magazine column that I wrote a long time ago about this
topic:

     http://cw.squyres.com/columns/2004-08-CW-MPI-Mechanic.pdf

It's #1 on the top 10 list of All-Time Favorite Evils to Avoid in
Parallel. :-)

One comment on Mattijs's email: please don't use bsend. Bsend is
evil. :-)

On Jun 13, 2008, at 5:27 AM, Mattijs Janssens wrote:

> Sounds like a typical deadlock situation. All processors are waiting
> for one
> another.
>
> Not a specialist but from what I know if the messages are small
> enough they'll
> be offloaded to kernel/hardware and there is no deadlock. That why
> it might
> work for small messages and/or certain mpi implementations.
>
> Solutions:
> - come up with a global communication schedule such that if one
> processor
> sends the receiver is receiving.
> - use mpi_bsend. Might be slower.
> - use mpi_isend, mpi_irecv (but then you'll have to make sure the
> buffers stay
> valid for the duration of the communication)
>
> On Friday 13 June 2008 01:55, zach wrote:
>> I have a weird problem that shows up when i use LAM or OpenMPI but
>> not
>> MPICH.
>>
>> I have a parallelized code working on a really large matrix. It
>> partitions the matrix column-wise and ships them off to processors,
>> so, any given processor is working on a matrix with the same number
>> of
>> rows as the original but reduced number of columns. Each processor
>> needs to send a single column vector entry
>> from its own matrix to the adjacent processor and visa versa as part
>> of the algorithm.
>>
>> I have found that depending on the number of rows of the matrix -or,
>> the size of the vector being sent using MPI_Send, MPI_Recv, the
>> simulation will hang.
>> It is only until i reduce this dimension to a certain max number will
>> the sim run properly. I have also found that this magic number
>> differs
>> depending on the system I am using, eg my home quad-core box or
>> remote
>> cluster.
>>
>> As i mentioned i have not had this issue with mpich. I would like to
>> understand why it is happening rather than just defect over to mpich
>> to get by.
>>
>> Any help would be appreciated!
>> zach
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems