Thank you for your suggestion, I tried this solution, but
it doesn't work. In fact, the headnode doesn't participate the computing and
communication, it only malloc a large a memory, and when the loop in every PS3
is over, the headnode gather the data from every PS3.
The strange thing is that sometimes the program can work
well, but when reboot the system, without any change to the program, it can't
work, so I think it should be some mechanism in OpenMPI that can configure to
let the program work well.
Try -np 16 and not running on the head node.
On Jul 6, 2009, at 7:08 PM, Zou, Lin (GE, Research, Consultant)
The system I use is a PS3 cluster,
with 16 PS3s and a PowerPC as a headnode, they are connected by a high speed
There are point-to-point
communication functions( MPI_Send and MPI_Recv ), the data size is about 40KB, and a lot
of computings which will consume a long time(about 1 sec)in a loop.The co-processor in PS3 can take care of the
computation, the main processor take care of point-to-point communication,so
the computing and communication can overlap.The communication funtions
should return much faster than computing
My question is
that after some circles, the time consumed by communication
functions in a PS3 will increase
heavily, and the whole cluster's sync state will corrupt.When I decrease the
computing time, this situation just disappeare.I am very confused about
I think there
is a mechanism in OpenMPI that cause this case, does everyone get this situation before?
I use "mpirun --mca btl tcp, self -np 17 --hostfile
...", is there something i should added?