The system I use is a PS3 cluster, with 16 PS3s and a PowerPC as a
headnode, they are connected by a high speed switch.
There are point-to-point communication functions( MPI_Send and
MPI_Recv ), the data size is about 40KB, and a lot of computings which
will consume a long time(about 1 sec)in a loop.The co-processor in PS3
can take care of the computation, the main processor take care of
point-to-point communication,so the computing and communication can
overlap.The communication funtions should return much faster than
My question is that after some circles, the time consumed by
communication functions in a PS3 will increase heavily, and the whole
cluster's sync state will corrupt.When I decrease the computing time,
this situation just disappeare.I am very confused about this.
I think there is a mechanism in OpenMPI that cause this case, does
everyone get this situation before?
I use "mpirun --mca btl tcp, self -np 17 --hostfile ...", is there
something i should added?