Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] how to make a process start and then join a MPI group
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-28 15:44:55


Ok, good.

One thing to check is that you have put different values for the
"high" value between the parent group and the children group.

On Jul 28, 2008, at 3:42 PM, Mark Borgerding wrote:

> I should've been clearer. I have observed the same behavior under
> both those versions.
> I was not using the two version in the same cluster.
>
> -- Mark
>
>
> Jeff Squyres wrote:
>> Are you mixing both v1.2.4 and v1.2.5 in a single MPI job? That
>> may have unintended side-effects -- we unfortunately do not
>> guarantee binary compatibility between any of our releases.
>>
>>
>> On Jul 28, 2008, at 10:16 AM, Mark Borgerding wrote:
>>
>>> I am using version 1.2.4 (Fedora 9) and 1.2.5 ( CentOS 5.2 )
>>>
>>>
>>> A little clarification:
>>> The children do not actually wake up when the parent *sends* data
>>> to them, but only after the parent tries to receive data from the
>>> merged intercomm.
>>>
>>>
>>> Here is the timeline:
>>>
>>> ...
>>> parent call to MPI_Comm_spawn returns
>>> parent calls MPI_Intercomm_merge
>>> children call to MPI_Init return
>>> children call MPI_Intercomm_merge
>>> parent MPI_Intercomm_merge returns
>>> (long pause inserted via parent sleep)
>>> parent sends data to kid 1
>>> (long pause inserted via parent sleep)
>>> parent starts to receive data from kid 1
>>> all children's calls to MPI_Intercomm_merge return
>>>
>>>
>>> -- Mark
>>>
>>> Aurélien Bouteiller wrote:
>>>> Ok, I'll check to see what happens. Which version of Open MPI are
>>>> you using ?
>>>>
>>>> Aurelien
>>>>
>>>> Le 27 juil. 08 à 23:13, Mark Borgerding a écrit :
>>>>
>>>>> I got something working, but I'm not 100% sure why.
>>>>>
>>>>> The children woke up and returned from their calls to
>>>>> MPI_Intercomm_merge only after
>>>>> the parent used the intercomm to send some data to the children
>>>>> via MPI_Send.
>>>>>
>>>>>
>>>>>
>>>>> Mark Borgerding wrote:
>>>>>> Perhaps I am doing something wrong. The childrens' calls to
>>>>>> MPI_Intercomm_merge never return.
>>>>>>
>>>>>> Here's the chronology (with 2 children):
>>>>>>
>>>>>> parent calls MPI_Init
>>>>>> parent calls MPI_Comm_spawn
>>>>>> child calls MPI_Init
>>>>>> child calls MPI_Init
>>>>>> parent call to MPI_Comm_spawn returns
>>>>>> (long pause inserted)
>>>>>> parent calls MPI_Intercomm_merge
>>>>>> child MPI_Init returns
>>>>>> child calls MPI_Intercomm_merge
>>>>>> child MPI_Init returns
>>>>>> child calls MPI_Intercomm_merge
>>>>>> parent MPI_Intercomm_merge returns
>>>>>> ... but the child processes never return from the
>>>>>> MPI_InterComm_merge function.
>>>>>>
>>>>>>
>>>>>> Here are some code snippets:
>>>>>>
>>>>>> ############# parent:
>>>>>>
>>>>>> MPI_Init(NULL,NULL);
>>>>>>
>>>>>> int nkids=2;
>>>>>> int errs[nkids];
>>>>>> MPI_Comm kid;
>>>>>> cerr << "parent calls MPI_Comm_spawn" << endl;
>>>>>> CHECK_MPI_CODE
>>>>>> ( MPI_Comm_spawn("test_mpi",NULL,nkids,MPI_INFO_NULL,
>>>>>> 0,MPI_COMM_WORLD,&kid,errs) );
>>>>>> cerr << "parent call to MPI_Comm_spawn returns" << endl;
>>>>>> for (k=0;k<nkids;++k)
>>>>>> CHECK_MPI_CODE( errs[k] );
>>>>>>
>>>>>> MPI_Comm allmpi;
>>>>>> cerr << "(long pause)" << endl;
>>>>>> sleep(3);
>>>>>> cerr << "parent calls MPI_Intercomm_merge\n";
>>>>>> CHECK_MPI_CODE( MPI_Intercomm_merge( kid, 0, &allmpi) );
>>>>>> cerr << "parent MPI_Intercomm_merge returns\n";
>>>>>>
>>>>>> ############### child:
>>>>>>
>>>>>> fprintf(stderr,"child calls MPI_Init \n");
>>>>>> CHECK_MPI_CODE( MPI_Init(NULL,NULL) );
>>>>>> fprintf(stderr,"child MPI_Init returns\n");
>>>>>>
>>>>>> MPI_Comm parent;
>>>>>> CHECK_MPI_CODE( MPI_Comm_get_parent(&parent) );
>>>>>>
>>>>>> fprintf(stderr,"child calls MPI_Intercomm_merge \n");
>>>>>> MPI_Comm allmpi;
>>>>>> CHECK_MPI_CODE( MPI_Intercomm_merge( parent, 1, &allmpi) );
>>>>>> fprintf(stderr,"child call to MPI_Intercomm_merge returns\n");
>>>>>> (the above line never gets executed)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Aurélien Bouteiller wrote:
>>>>>>> MPI_Intercomm_merge is what you are looking for.
>>>>>>>
>>>>>>> Aurelien
>>>>>>> Le 26 juil. 08 à 13:23, Mark Borgerding a écrit :
>>>>>>>
>>>>>>>> Okay, so I've gotten a little bit closer.
>>>>>>>>
>>>>>>>> I'm using MPI_Comm_spawn to start several children
>>>>>>>> processes. The problem is that the children are in their own
>>>>>>>> group, separate from the parent (just the like the
>>>>>>>> documentation says). I want to merge the children's group
>>>>>>>> with the parent group so I can efficiently Send/Recv data
>>>>>>>> between them..
>>>>>>>>
>>>>>>>> Is this possible?
>>>>>>>>
>>>>>>>> Plan B: I guess if there is no elegant way to merge all those
>>>>>>>> processes into one group, I can connect sockets and make
>>>>>>>> intercomms to talk from the parent directly to each child.
>>>>>>>>
>>>>>>>> -- Mark
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mark Borgerding wrote:
>>>>>>>>> I am writing a code module that plugs into a larger
>>>>>>>>> application framework. That framework loads my code module
>>>>>>>>> as a shared object.
>>>>>>>>> So I do not control how the first process gets started, but
>>>>>>>>> I still want it to be able to start and participate in an
>>>>>>>>> MPI group.
>>>>>>>>>
>>>>>>>>> Here's roughly what I want to happen ( I think):
>>>>>>>>>
>>>>>>>>> framework app running (not under my control)
>>>>>>>>> -> framework loads mycode.so shared object into its process
>>>>>>>>> -> mycode.so starts mpi programs on several hosts
>>>>>>>>> (e.g. via system call to mpiexec )
>>>>>>>>> -> initial mycode.so process participates in the
>>>>>>>>> group he just started (e.g. he shows up in MPI_Comm_group,
>>>>>>>>> can use MPI_Send, MPI_Recv, etc. )
>>>>>>>>>
>>>>>>>>> Can this be done?
>>>>>>>>> I am running under Centos 5.2
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> --
>>>> * Dr. Aurélien Bouteiller
>>>> * Sr. Research Associate at Innovative Computing Laboratory
>>>> * University of Tennessee
>>>> * 1122 Volunteer Boulevard, suite 350
>>>> * Knoxville, TN 37996
>>>> * 865 974 6321
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Mark Borgerding
>>> 3dB Labs, Inc
>>> Innovate. Develop. Deliver.
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Mark Borgerding
> 3dB Labs, Inc
> Innovate. Develop. Deliver.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems