Dear MPI people,
                               I am working on a graph partitioning problem, where we have an undirected graph of p MPI processes. The edges have weights that show how much communication processes do among themselves. The cluster has multiple nodes (each node with 8 cores) and nodes are connected through Infiniband fast network. The task is the partition the graph in k partitions where k is the number of nodes such that the edge cut is minimized (across partitions/nodes edges or communication). I wrote a distributed algorithm for that. I will compare its quality with the algorithms from graph community, but right now I am writing the algorithm for MPI applications. I have seen that with 10% sparse undirected graph, I got upto 80% reduction in edge cut (for k=2).

I checked the algorithm very deeply and I have seen that the optimal partitions are residing on different nodes (I saw the IPs). Based on the NetPipe benchmark, I wrote a test application where I randomly generate undirected graphs. I run that application before doing the reduction of edge cut and after reduction of edge cut. The processes that communicate more come to the same nodes. In the algorithm, each process gets a new rank based on communication requirement. The problem I am getting is that the overall execution time of the test application should be less than the application that runs without performing edge cut reduction but it does not happen. Here is the code for test application (self explanatory) and the output of the program. Please help me diagnosing the logical bug. We can discuss more.

MPI_Barrier(comm);
   
    t0 = MPI_Wtime();
   
    for(int j=0;j<10;j++)
    {
     
    for(int i=0; i<comm_size; i++)
    {
    int target = comminfo[i].rank;
    int comm_amount = comminfo[i].comm;
    if(comm_amount > 0)
    {
        buff = new Node[MAX_COMM*NTRIALS];
        MPI_Irecv (buff, MAX_COMM * NTRIALS * sizeof(Node)/sizeof(char), MPI_CHAR, MPI_ANY_SOURCE, 4600, comm, &recv_requests[i]);
        recv_buffers.insert(recv_buffers.end(), buff);
    }
    }

    for(int i=0; i<comm_size; i++)
    {
    int target = comminfo[i].rank;
    int comm_amount = comminfo[i].comm;
    if(comm_amount > 0)
    {
        buff = new Node[comm_amount/2*NTRIALS];
        MPI_Isend((void*)buff, comm_amount/2 * NTRIALS * sizeof(Node)/sizeof(char), MPI_CHAR, target, 4600, comm, &send_requests[i]);
        send_buffers.insert(send_buffers.end(), buff);
    }
    }
   
    MPI_Waitall(comm_size, send_requests, send_status);
   
    MPI_Waitall(comm_size, recv_requests, recv_status);
   
    }
   
    t1 = MPI_Wtime();
   
    MPI_Allreduce(&t0, &min_t0, 1, MPI_DOUBLE, MPI_MIN, comm);
    MPI_Allreduce(&t1, &max_t1, 1, MPI_DOUBLE, MPI_MAX, comm);
   
    diff = (max_t1 - min_t0) / NTRIALS;

  retirm diff; // this value will be printed in main function....


.................. the output of the program is



mpprun INFO: Starting openmpi run on 2 nodes (16 ranks)...

P0 >> Without balancing: Execution took: 0.0081014822 secs

Edge_Cut_Before: 46864

Edge_Cut_After: 20097

Balancing took: 1.0993268490 secs

P0 >> After balancing: Execution took: 0.0095374639 secs



Please help me.

best regards,

Mudassar