I have 3 computers with the same Linux system. I have setup the mpi cluster based on ssh connection.
I have tested a very simple mpi program, it works on the cluster.
To make my story clear, I name the three computer as A, B and C.
1) If I run the job with 2 processes on A and B, it works.
2) if I run the job with 3 processes on A, B and C, it is blocked.
3) if I run the job with 2 processes on A and C, it works.
4) If I run the job with all the 3 processes on A, it works.
Using gdb I found the line at which it is blocked, it is here
#7 0x00002ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, comm=0x627380)
105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,
It seems that there is a communication problem between some computers. But the above series of test cannot tell me what
exactly it is. Can anyone help me? thanks.