Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Intercomm_create hangs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-01-28 08:19:48


Strange -- this almost implies a race condition somewhere.

I don't see anything wrong with your application (other than it doesn't free the communicators, but that's not an error).

Edgar -- the intercomm code is yours. Could you have a look?

On Jan 23, 2012, at 11:03 AM, jody wrote:

> Hi
> I've got a really strange problem:
>
> I've got an application which creates intercommunicators between a
> master and some workers.
>
> When i run it on our cluster with 11 processes it works,
> when i run it with 12 processes it hangs inside MPI_Intercomm_create().
>
> This is the hostfile:
> squid_0.uzh.ch slots=3 max-slots=3
> squid_1.uzh.ch slots=2 max-slots=2
> squid_2.uzh.ch slots=1 max-slots=1
> squid_3.uzh.ch slots=1 max-slots=1
> triops.uzh.ch slots=8 max-slots=8
>
> Actually all squid_X have 4 cores, but i managed to reduce the number of
> processes needed for failure by making the above settings.
>
> So with all available squid cores and 3 triops cores it works,
> but with 4 triops cores it hangs.
>
> On the other hand, if i use all 16 squid cores (but no triops cores)
> it works, too.
>
> If i start the application not from triopps, but froim another workstation,
> i have a similar pattern of Intercomm_create failures.
>
> Note that with the above hostfile a simple HelloMPI works also with 14
> or more processes.
>
> The frustrating thing is that this exact same code has worked before!
>
> Does anybody have an explanation?
> Thank You
>
> I managed to simplify the application:
>
> #include <stdio.h>
> #include "mpi.h"
>
> int main(int iArgC, char *apArgV[]) {
> int iResult = 0;
> int iNumProcs = 0;
> int iID = -1;
>
> MPI_Init(&iArgC, &apArgV);
>
> MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs);
> MPI_Comm_rank(MPI_COMM_WORLD, &iID);
>
> int iKey;
> if (iID == 0) {
> iKey = 0;
>
> } else {
> iKey = 1;
> }
>
> MPI_Comm commInter1;
> MPI_Comm commInter2;
> MPI_Comm commIntra;
>
> MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra);
>
> int iRankM;
> MPI_Comm_rank(commIntra, &iRankM);
> printf("Local rank: %d\n", iRankM);
>
> switch (iKey) {
> case 0:
> printf("Creating intercomm 1 for Master (%d)\n", iID);
> MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2);
> break;
> case 1:
> printf("Creating intercomm 1 for FH (%d)\n", iID);
> MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1);
> }
>
> printf("finalizing\n");
> MPI_Finalize();
>
> printf("exiting with %d\n", iResult);
> return iResult;
> }
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/