Hi
I've got a really strange problem:
I've got an application which creates intercommunicators between a
master and some workers.
When i run it on our cluster with 11 processes it works,
when i run it with 12 processes it hangs inside MPI_Intercomm_create().
This is the hostfile:
squid_0.uzh.ch slots=3 max-slots=3
squid_1.uzh.ch slots=2 max-slots=2
squid_2.uzh.ch slots=1 max-slots=1
squid_3.uzh.ch slots=1 max-slots=1
triops.uzh.ch slots=8 max-slots=8
Actually all squid_X have 4 cores, but i managed to reduce the number of
processes needed for failure by making the above settings.
So with all available squid cores and 3 triops cores it works,
but with 4 triops cores it hangs.
On the other hand, if i use all 16 squid cores (but no triops cores)
it works, too.
If i start the application not from triopps, but froim another workstation,
i have a similar pattern of Intercomm_create failures.
Note that with the above hostfile a simple HelloMPI works also with 14
or more processes.
The frustrating thing is that this exact same code has worked before!
Does anybody have an explanation?
Thank You
I managed to simplify the application:
#include <stdio.h>
#include "mpi.h"
int main(int iArgC, char *apArgV[]) {
int iResult = 0;
int iNumProcs = 0;
int iID = -1;
MPI_Init(&iArgC, &apArgV);
MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &iID);
int iKey;
if (iID == 0) {
iKey = 0;
} else {
iKey = 1;
}
MPI_Comm commInter1;
MPI_Comm commInter2;
MPI_Comm commIntra;
MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra);
int iRankM;
MPI_Comm_rank(commIntra, &iRankM);
printf("Local rank: %d\n", iRankM);
switch (iKey) {
case 0:
printf("Creating intercomm 1 for Master (%d)\n", iID);
MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2);
break;
case 1:
printf("Creating intercomm 1 for FH (%d)\n", iID);
MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1);
}
printf("finalizing\n");
MPI_Finalize();
printf("exiting with %d\n", iResult);
return iResult;
}
|