I tried your suggestion, insert MPI_Barrier every few iterations, but it doesn't work, in fact it became even slower.....
i want to try tracing the communication avtivity, can you give me some more details about how to use mpitrace.
Thank you for your attention.
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff Squyres
Sent: 2009Äê7ÔÂ7ÈÕ 20:42
To: Open MPI Users
Subject: Re: [OMPI users] Configuration problem or network problem?
You might want to use a tracing library to see where exactly your synchronization issues are occurring. It may depend on the
communication pattern between your nodes and the timing between them.
Additionally, your network switch(es) performance characteristics may come into effect here: are there retransmissions, timeouts, etc.?
It can sometimes be helpful to insert an MPI_BARRIER every few iterations just to keep all processes well-synchronized. It seems counter-intuitive, but sometimes waiting a short time in a barrier can increase overall throughput (rather than waiting progressively longer times in poorly-synchronized blocking communications).
On Jul 6, 2009, at 11:33 PM, Zou, Lin (GE, Research, Consultant) wrote:
> Thank you for your suggestion, I tried this solution, but it doesn't
> work. In fact, the headnode doesn't participate the computing and
> communication, it only malloc a large a memory, and when the loop in
> every PS3 is over, the headnode gather the data from every PS3.
> The strange thing is that sometimes the program can work well, but
> when reboot the system, without any change to the program, it can't
> work, so I think it should be some mechanism in OpenMPI that can
> configure to let the program work well.
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Doug Reeder
> Sent: 2009Äê7ÔÂ7ÈÕ 10:49
> To: Open MPI Users
> Subject: Re: [OMPI users] Configuration problem or network problem?
> Try -np 16 and not running on the head node.
> Doug Reeder
> On Jul 6, 2009, at 7:08 PM, Zou, Lin (GE, Research, Consultant) wrote:
>> Hi all,
>> The system I use is a PS3 cluster, with 16 PS3s and a PowerPC as
>> a headnode, they are connected by a high speed switch.
>> There are point-to-point communication functions( MPI_Send and
>> MPI_Recv ), the data size is about 40KB, and a lot of computings
>> which will consume a long time(about 1 sec)in a loop.The co-
>> processor in PS3 can take care of the computation, the main processor
>> take care of point-to-point communication,so the computing and
>> communication can overlap.The communication funtions should return
>> much faster than computing function.
>> My question is that after some circles, the time consumed by
>> communication functions in a PS3 will increase heavily, and the whole
>> cluster's sync state will corrupt.When I decrease the computing time,
>> this situation just disappeare.I am very confused about this.
>> I think there is a mechanism in OpenMPI that cause this case, does
>> everyone get this situation before?
>> I use "mpirun --mca btl tcp, self -np 17 --hostfile ...", is there
>> something i should added?
>> users mailing list
> users mailing list