Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RE : RE : RE : Implementation of MPI_Iprobe
From: Sébastien Boisvert (sebastien.boisvert.3_at_[hidden])
Date: 2011-09-28 13:21:43


> ________________________________________
> De : devel-bounces_at_[hidden] [devel-bounces_at_[hidden]] de la part de Jeff Squyres [jsquyres_at_[hidden]]
> Date d'envoi : 28 septembre 2011 11:18
> À : Open MPI Developers
> Objet : Re: [OMPI devel] RE : RE : Implementation of MPI_Iprobe
>
> On Sep 28, 2011, at 10:04 AM, George Bosilca wrote:
>
>>> Why not use pre-posted non-blocking receives and MPI_WAIT_ANY?
>>
>> That's not very scalable either… Might work for 256 processes, but that's about it.
>
> Just get a machine with oodles of RAM and you'll be fine.
>
> ;-)

Hello,

Each of my 256 cores has 3 GB of memory, thus my computation has 768 GB of distributed memory.

So memory is not a problem at all.

I only see the problem of starvation for the slave mode RAY_SLAVE_MODE_EXTENSION in Ray. And when there is starvation, the memory usage is just
~1.6 GB per core.

Today, I implemented some profiling in my code to check where the granularity is too large in processData(), which calls call_RAY_SLAVE_MODE_EXTENSION().

I consider anything abobe or equal to 128 microseconds to be too long for my computation.

This is what I found so far:

[1,3]<stdout>:Warning, SlaveMode= RAY_SLAVE_MODE_EXTENSION GranularityInMicroseconds= 16106
[1,3]<stdout>:Number of calls in the stack: 20
[1,3]<stdout>:0 1317227196433984 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 47
[1,3]<stdout>:1 1317227196433985 microseconds +1 from previous (0.01%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 72
[1,3]<stdout>:2 1317227196433985 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 144
[1,3]<stdout>:3 1317227196433985 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 221
[1,3]<stdout>:4 1317227196433985 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 351
[1,3]<stdout>:5 1317227196433985 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 389
[1,3]<stdout>:6 1317227196433986 microseconds +1 from previous (0.01%) in doChoice inside code/assembler/SeedExtender.cpp at line 441
[1,3]<stdout>:7 1317227196433986 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 775
[1,3]<stdout>:8 1317227196433987 microseconds +1 from previous (0.01%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 934

[1,3]<stdout>:9 1317227196433988 microseconds +1 from previous (0.01%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 960
[1,3]<stdout>:10 1317227196442360 microseconds +8372 from previous (51.98%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 989

[1,3]<stdout>:11 1317227196442651 microseconds +291 from previous (1.81%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 993
[1,3]<stdout>:12 1317227196442654 microseconds +3 from previous (0.02%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 1002
[1,3]<stdout>:13 1317227196442655 microseconds +1 from previous (0.01%) in resetStructures inside code/assembler/ExtensionData.cpp at line 72
[1,3]<stdout>:14
[1,3]<stdout>: 1317227196442656 microseconds +1 from previous (0.01%) in resetStructures inside code/assembler/ExtensionData.cpp at line 76
[1,3]<stdout>:15 1317227196447138 microseconds +4482 from previous (27.83%) in resetStructures inside code/assembler/ExtensionData.cpp at line 80
[1,3]<stdout>:16 1317227196450084 microseconds +2946 from previous (18.29%) in doChoice inside code/assembler/SeedExtender.cpp at line 883
[1,3]<stdout>:17 1317227196450087 microseconds +3 from previous (0.02%) in doChoice inside code/assembler/SeedExtender.cpp at line 886
[1,3]<stdout>:18 1317227196450087 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 888
[1,3]<stdout>:19 1317227196450089 microseconds +2 from previous (0.01%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 229
[1,3]<stdout>:End of stack

So the problem is definitely not with Open-MPI, but doing a round-robin MPI_Iprobe still helps a lot (rotating the source given to MPI_Iprobe at each call to it) when
the granularity exceeds 128 microseconds.

But I do think that George's patch (with my minor modification) would provide an MPI_Iprobe that is fair for all drained messages (the round-robin thing).

But even the patch does not change anything for my problem with MPI_ANY_SOURCE.

>
> I actually was thinking of his specific 256-process case. I agree that it doesn't scale arbitrarily.
>

I think it could scale arbitrarily with Open-MPI ;) (and with any MPI implementations respecting MPI 2.x, for that matter).

I just need to get my granularity below 128 microseconds for all the calls in RAY_SLAVE_MODE_EXTENSION
(which is Machine::call_RAY_SLAVE_MODE_EXTENSION() in my code.).

> Another approach would potentially be to break your 256 processes up into N sub-communicators of M each (where N * M = 256, obviously), and doing a doing a non-blocking receive with ANY_SOURCE and then a WAIT_ANY on all of those.
>

I am not sure that would work in my code as my architecture is like:

while(running){
    receiveMessages(); // blazing fast, receives 0 or 1 message, never more, never less, other messages will wait for the next iteration !
    processMessages(); // consume the one message received, if any, also very fast because it is done with an array mapping tags to function pointers
    processData(); // should be fast, but apparently call_RAY_SLAVE_MODE_EXTENSION is slowish sometimes...
    sendMessages(); // fast, sends at most 17 messages. In most case it is either 0 or 1 message.,..
}

If I *understand* what you said correctly, doing a WAIT_ANY inside Ray's receiveMessages would hang and/or would lower significantly the speed of the loop, which is not desirable.

I like to have my loop at ~ 200000 iterations / 100 milliseconds. This yields a very responsive system -- everyone respond within 128 microseconds with my round-robin thing.
The response time is 10 microseconds on guillimin.clumeq.ca and 100 (use to be 250) on colosse.clumeq.ca if I use MPI_ANY_SOURCE
(as reported on the list, see http://www.open-mpi.org/community/lists/users/2011/09/17321.php ),
but things get complicated in RAY_SLAVE_MODE_EXTENSION because of buggy granularity.

> The code gets a bit more complex, but it hypothetically extends your scalability.
>
> Or better yet, have your job mimic this idea -- a tree-based gathering system. Have not just 1 master, but N sub-masters. Individual compute processes report up to their sub-master, and the sub-master does whatever combinatorial work it can before reporting it to the ultimate master, etc.

Ray does have a MASTER_RANK, which is 0. But all the ranks, including 0, are slave ranks too.

In processData():

/** process data my calling current slave and master methods */
void Machine::processData(){
        MachineMethod masterMethod=m_master_methods[m_master_mode];
        (this->*masterMethod)();

        MachineMethod slaveMethod=m_slave_methods[m_slave_mode];
        (this->*slaveMethod)();
}

Obviously, m_master_mode is always RAY_MASTER_MODE_DO_NOTHING for any rank that is not MASTER_RANK, which is quite simple to implement:

void Machine::call_RAY_MASTER_MODE_DO_NOTHING(){}

So, although I understand that the tree-based gathering system you describe would act as some sort of virtual network (like routing packets on the Internet), I don't think that would be helpful
because the computation granularity in call_RAY_SLAVE_MODE_EXTENSION() is above 128 microseconds anyway (I discovered that today, my bad).

>
> It depends on your code and how much delegation is possible, how much data you're transferring over the network, how much fairness you want to guarantee, etc. My point is that there are a bunch of
> different options you can pursue outside of the "everyone sends to 1 master" model.
>

My communication model is more distributed than "everyone sends to 1 master".

My model is "everyone sends to everyone in a respectful way".

When I say "respectful way", I mean that rank A waits for the reply to its first message from rank B before sending anything else to rank B.

Because of that,

- Open-MPI buffers are happy,
- memory usage is happy, and
- byte transfer layers are not saturated at all and thus are happy too.

And destinations are mostly random because of my hash-based domain decomposition of genomic/biological data.

I will thus improve my granularity but would nonetheless agree that George's patch be merged in Open-MPI's trunk as fairness is always desirable in networking algorithms.

Thanks a lot !

Sébastien Boisvert
PhD student
http://boisvert.info

> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>