Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi job is blocked
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-25 03:52:34


+1

Additionally, if you're trying to debug your machines/network/setup, you might want to use something simpler, like the ring programs in the examples/ directory.

On Sep 25, 2012, at 9:43 AM, jody wrote:

> Hi Richard
>
> When a collective call hangs, this usually means that one (or more)
> processes did not reach this command.
> Are you sure that all processes reach the allreduce statement?
>
> If something like this happens to me, i insert print statements just
> before the MPI-call so i can see which processes made
> it to this point and which ones did not.
>
> Hope this helps a bit
> Jody
>
> On Tue, Sep 25, 2012 at 8:20 AM, Richard <codemonkee_at_[hidden]> wrote:
>> I have 3 computers with the same Linux system. I have setup the mpi cluster
>> based on ssh connection.
>> I have tested a very simple mpi program, it works on the cluster.
>>
>> To make my story clear, I name the three computer as A, B and C.
>>
>> 1) If I run the job with 2 processes on A and B, it works.
>> 2) if I run the job with 3 processes on A, B and C, it is blocked.
>> 3) if I run the job with 2 processes on A and C, it works.
>> 4) If I run the job with all the 3 processes on A, it works.
>>
>> Using gdb I found the line at which it is blocked, it is here
>>
>> #7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578,
>> recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780,
>> comm=0x627380)
>> at pallreduce.c:105
>> 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,
>>
>> It seems that there is a communication problem between some computers. But
>> the above series of test cannot tell me what
>> exactly it is. Can anyone help me? thanks.
>>
>> Richard
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/