Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi job is blocked
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-09-25 05:54:53


Hav you disabled firewalls on your nodes (e.g., iptables)?

On Sep 25, 2012, at 11:08 AM, Richard wrote:

> sometimes the following message jumped out when I run the ring program, but not always.
> I do not know this ip address 192.168.122.1, it is not in my list of hosts.
>
>
> [[53402,1],6][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111
>
>
>
>
>
> At 2012-09-25 16:53:50,Richard <codemonkee_at_[hidden]> wrote:
>
> if I tried the ring program, the first round of pass is fine, but the second round is blocked at some node.
> here is the message printed out
>
> Process 0 sending 10 to 1, tag 201 (3 processes! in ring)
> Process 0 sent to 1
> rank 1, message 10,start===========
> rank 1, message 10,end-------------
> rank 2, message 10,start===========
> Process 0 decremented value: 9
> rank 0, message 9,start===========
> rank 0, message 9,end-------------
> rank 2, message 10,end-------------
> rank 1, message 9,start===========
>
> I have added some printf statements in the ring_c.c as follows:
> 60 printf("rank %d, message %d,start===========\n", rank, message);
> 61 MPI_Send(&message, 1, MPI_INT, ! next, tag, MPI_COMM_WORLD);
> 62 printf("rank %d, message %d,end-------------\n", rank, message);
>
>
>
> At 2012-09-25 16:30:01,Richard <codemonkee_at_[hidden]> wrote:
> Hi Jody,
> thanks for your suggestion and you are right. if I use the ring example, the same happened.
> I have put a printf statement, it seems that all the three processed have reached the line
> calling "PMPI_Allreduce", any further suggestion?
>
> Thanks.
> Richard
>
>
>
> Message: 12
> Date: Tue, 25 Sep 2012 09:43:09 +0200
> From: jody <
> jody.xha_at_[hidden]
> >
> Subject: Re: [OMPI users] mpi job is blocked
> To: Open MPI Users <
> users_at_[hidden]
> >
> Message-ID:
> <
> CAKbzMGfL0tXDYU82HksoHrwh34CbpwbKmrKwC5DcDBT7A7wTxw_at_[hidden]
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Richard
>
> When a collective call hangs, this usually means that one (or more)
> processes did not reach this command.
> Are you sure that all processes reach the allreduce statement?
>
> If something like this happens to me, i insert print statements just
> before the MPI-call so i can see which processes made
> it to this point and which ones did not.
>
> Hope this helps a bit
> Jody
>
> On Tue, Sep 25, 2012 at 8:20 AM, Richard <
> codemonkee_at_[hidden]
> > wrote:
> > I have 3 computers with the same Linux system. I have setup the mpi cluster
> > based on ssh connection.
> > I have tested a very simple mpi program, it works on the cluster.
> >
> > To make my story clear, I name the three computer as A, B and C.
> >
> > 1) If I run the job with 2 processes on A and B, it works.
> > 2) if I run the job with 3 processes on A, B and C, it is blocked.
> > 3) if I run the job with 2 processes on A and C, it works.
> > 4) If I run the job with all the 3 processes on A, it works.
> >
> > Using gdb I found the line at which it is blocked, it is here
> >
> > #7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578,
> > recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780,
> > comm=0x627380)
> > at pallreduce.c:105
> > 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,
> >
> > It seems that there is a communication problem between some computers. But
> > the above series of test cannot tell me what
> > exactly it is. Can anyone help me? thanks.
> >
> > Richard
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> >
> users_at_[hidden]
>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/