Thanks for the answer,
I have some questions, because I am using some programs for profiling,
when you say that the cost of allreduce raise you mean about the time
only or also and the flops of this command? Is there some additional
work added at the allreduce or it's only about time? During profiling I
want to count the flops so if there is a small difference on timing
because of debug mode and declaration of the allreduce algorithm is not
so big deal, but if it changes also the flops then it is bad for me.
When I executed a program with debug mode I saw that openmpi uses some
algorithms and I looked at your code and I saw that rank 0 is not always
the root cpu (if I understood right). Finally do you have any opinion
about which is the best way to know the algorithm is used in collective
communication and the root cpu of the communication?
> Today's Topics:
> 1. Re: using specific algorithm for collective communication,
> and knowing the root cpu? (George Bosilca)
> Message: 1
> Date: Tue, 3 Nov 2009 12:09:18 -0500
> From: George Bosilca <bosilca_at_[hidden]>
> Subject: Re: [OMPI users] using specific algorithm for collective
> communication, and knowing the root cpu?
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <E59919B2-42C1-49AF-803A-AB4450609A44_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
> You can add the following MCA parameters either on the command line or
> in the $(HOME)/.openmpi/mca-params.conf file.
> On Nov 2, 2009, at 08:52 , George Markomanolis wrote:
>> Dear all,
>> I would like to ask about collective communication. With debug mode
>> enabled, I can see many info during the execution which algorithm is
>> used etc. But my question is that I would like to use a specific
>> algorithm (the simplest I suppose). I am profiling some applications
>> and I want to simulate them with another program so I must be able
>> to know for example what the mpi_allreduce is doing. I saw many
>> algorithms that depend on the message size and the number of
>> processors, so I would like to ask:
>> 1) what is the way to say at open mpi to use a simple algorithm for
>> allreduce (is there any way to say to use the simplest algorithm for
>> all the collective communication?). Basically I would like to know
>> the root cpu for every collective communication. What are the
>> disadvantages for demanding the simplest algorithm?
> coll_tuned_use_dynamic_rules=1 to allow you to manually set the
> algorithms to be used.
> coll_tuned_allreduce_algorithm=*something between 0 and 5* to describe
> the algorithm to be user. For the simplest algorithm I guess you will
> want to use 1 (star based fan-in fan-out).
> The main disadvantage is that the cost of the allreduce will raise
> which will negatively impact the overall performance of the application.
>> 2) Is there any overhead because I installed open mpi with debug
>> mode even if I just run a program without any flag with --mca?
> There are many overhead because you compile in debug mode. We do a lot
> of extra tracking of internally allocate memory, checks on most/all
> internal objects and so on. Based on previous results I would say your
> latency increase by about 2-3 micro-secs, but the impact on the
> bandwidth is minimal.
>> 3) How you could describe allreduce by words? Can we say that the
>> root cpu does reduce and then broadcast? I mean is that right for
>> your implementation? I saw that it depends on the algorithm which
>> cpu is the root, so is it possible to use an algorithm that I will
>> know every time that cpu with rank 0 is the root?
> Exactly, allreduce = reduce + bcast (and btw this is what the
> algorithm basic will do). However, there is no root in an allreduce as
> all processors execute symmetric work. Of course if one see the
> allreduce as a reduce followed by a broadcast then one has to select a
> root (I guess we pick the rank 0 in our implementation).
>> Thanks a lot,
>> users mailing list