Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] How to specify hosts for MPI_Comm_spawn
From: Matt Hughes (matt.c.hughes+ompi_at_[hidden])
Date: 2008-07-29 17:01:18


I've found that I always have to use mpirun to start my spawner
process, due to the exact problem you are having: the need to give
OMPI a hosts file! It seems the singleton functionality is lacking
somehow... it won't allow you to spawn on arbitrary hosts. I have not
tested if this is fixed in the 1.3 series.

Try
mpiexec -np 1 -H op2-1,op2-2 spawner op2-2

mpiexec should start the first process on op2-1, and the spawn call
should start the second on op2-2. If you don't use the Info object to
set the hostname specifically, then on 1.2.x it will automatically
start on op2-2. With 1.3, the spawn call will start processes
starting with the first item in the host list.

mch

2008/7/29 Mark Borgerding <markb_at_[hidden]>:
> Yes. The host names are listed in the host file.
> e.g.
> "op2-1 slots=8"
> and there is an IP address for op2-1 in the /etc/hosts directory
> I've read the FAQ. Everything in there seems to assume I am starting the
> process group with mpirun or one of its brothers. This is not the case .
>
> I've created and attached a sample source file that demonstrates my problem.
> It participates in a MPI_Group in one of two ways: either from mpiexec or
> via MPI_Comm_spawn
>
> Case 1 works: I can run it on the remote node op2-1 by using mpiexec
> mpiexec -np 3 -H op2-1 spawner
>
> Case 2 works: I can run it on the current host with MPI_Comm_spawn
> ./spawner `hostname`
>
> Case 3 does not work: I cannot use MPI_Comm_spawn to start a group on a
> remote node.
> ./spawner op2-1
>
> The output from case 3 is:
> <QUOTE>
> I am going to spawn 2 children on op2-1
> --------------------------------------------------------------------------
> Some of the requested hosts are not included in the current allocation for
> the
> application:
> ./spawner
> The requested hosts were:
> op2-1
>
> Verify that you have mapped the allocated resources properly using the
> --host specification.
> --------------------------------------------------------------------------
> [gardner:32745] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> base/rmaps_base_support_fns.c at line 225
> [gardner:32745] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmaps_rr.c
> at line 478
> [gardner:32745] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> base/rmaps_base_map_job.c at line 210
> [gardner:32745] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmgr_urm.c
> at line 372
> [gardner:32745] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> communicator/comm_dyn.c at line 608
>
> </QUOTE>
>
> Ralph Castain wrote:
>>
>> OMPI doesn't care what your hosts are named - many of us use names that
>> have no numeric pattern or any other discernible pattern to them.
>>
>> OMPI_MCA_rds_hostfile should point to a file that contains a list of the
>> hosts - have you ensured that it does, and that the hostfile format is
>> correct? Check the FAQ on the open-mpi.org site:
>>
>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run
>>
>> There are several explanations there pertaining to hostfiles.
>>
>>
>> On Jul 29, 2008, at 11:57 AM, Mark Borgerding wrote:
>>
>>> I listed the node names in the path named in ompi_info --param rds
>>> hostfile -- no luck.
>>> I also tried copying that file to another location and setting
>>> OMPI_MCA_rds_hostfile_path -- no luck.
>>>
>>> The remote hosts are named op2-1 and op2-2. Could this be another case
>>> of the problem I saw a few days ago where the hostnames were assumed to
>>> contain a numeric pattern?
>>>
>>> -- Mark
>>>
>>>
>>>
>>> Ralph Castain wrote:
>>>>
>>>> For the 1.2 release, I believe you will find the enviro param is
>>>> OMPI_MCA_rds_hostfile_path - you can check that with "ompi_info".
>>>>
>>>>
>>>> On Jul 29, 2008, at 11:10 AM, Mark Borgerding wrote:
>>>>
>>>>> Umm ... what -hostfile file?
>>>>>
>>>>> I am not starting anything via mpiexec/orterun so there is no
>>>>> "-hostfile" argument AFAIK.
>>>>> Is there some other way to communicate this? An environment variable or
>>>>> mca param?
>>>>>
>>>>>
>>>>> -- Mark
>>>>>
>>>>>
>>>>> Ralph Castain wrote:
>>>>>>
>>>>>> Are the hosts where you want the children to go in your -hostfile
>>>>>> file? All of the hosts you intend to use have to be in that file, even if
>>>>>> they don't get used until the comm_spawn.
>>>>>>
>>>>>>
>>>>>> On Jul 29, 2008, at 9:08 AM, Mark Borgerding wrote:
>>>>>>
>>>>>>> I've tried lots of different values for the "host" key in the info
>>>>>>> handle.
>>>>>>> I've tried hardcoding the hostname+ip entries in the /etc/hosts file
>>>>>>> -- no luck. I cannot get my MPI_Comm_spawn children to go anywhere else on
>>>>>>> the network.
>>>>>>>
>>>>>>> mpiexec can start groups on the other machines just fine. It seems
>>>>>>> like there is some initialization that is done by orterun but not by
>>>>>>> MPI_Comm_spawn.
>>>>>>>
>>>>>>> Is there a document that describes how the default process management
>>>>>>> works?
>>>>>>> I do not have infiniband, myrinet or any specialized rte, just ssh.
>>>>>>> All the machines are CentOS 5.2 (openmpi 1.2.5)
>>>>>>>
>>>>>>>
>>>>>>> -- Mark
>>>>>>>
>>>>>>> Ralph Castain wrote:
>>>>>>>>
>>>>>>>> The string "localhost" may not be recognized in the 1.2 series for
>>>>>>>> comm_spawn. Do a "hostname" and use that string instead - should work.
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On Jul 28, 2008, at 10:38 AM, Mark Borgerding wrote:
>>>>>>>>
>>>>>>>>> When I add the info parameter in MPI_Comm_spawn, I get the error
>>>>>>>>> "Some of the requested hosts are not included in the current
>>>>>>>>> allocation for the application:
>>>>>>>>> [...]
>>>>>>>>> Verify that you have mapped the allocated resources properly using
>>>>>>>>> the
>>>>>>>>> --host specification."
>>>>>>>>>
>>>>>>>>> Here is a snippet of my code that causes the error:
>>>>>>>>>
>>>>>>>>> MPI_Info info;
>>>>>>>>> MPI_Info_create( &info );
>>>>>>>>> MPI_Info_set(info,"host","localhost");
>>>>>>>>> MPI_Comm_spawn( cmd , MPI_ARGV_NULL , nkids , info , 0 ,
>>>>>>>>> MPI_COMM_SELF , &kid , errs );
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mark Borgerding wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks, I don't know how I missed that. Perhaps I got thrown off
>>>>>>>>>> by
>>>>>>>>>> "Portable programs not requiring detailed control over process
>>>>>>>>>> locations should use MPI_INFO_NULL."
>>>>>>>>>>
>>>>>>>>>> If there were a computing equivalent of Maslow's Hierarchy of
>>>>>>>>>> Needs, functioning would be more fundamental than portability :)
>>>>>>>>>>
>>>>>>>>>> -- Mark
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Take a look at the man page for MPI_Comm_spawn. It should explain
>>>>>>>>>>> that you need to create an MPI_Info key that has the key of "host" and a
>>>>>>>>>>> value that contains a comma-delimited list of hosts to be used for the child
>>>>>>>>>>> processes.
>>>>>>>>>>>
>>>>>>>>>>> Hope that helps
>>>>>>>>>>> Ralph
>>>>>>>>>>>
>>>>>>>>>>> On Jul 28, 2008, at 8:54 AM, Mark Borgerding wrote:
>>>>>>>>>>>
>>>>>>>>>>>> How does openmpi decide which hosts are used with
>>>>>>>>>>>> MPI_Comm_spawn? All the docs I've found talk about specifying hosts on the
>>>>>>>>>>>> mpiexec/mpirun command and so are not applicable.
>>>>>>>>>>>> I am unable to spawn on anything but localhost (which makes for
>>>>>>>>>>>> a pretty uninteresting cluster).
>>>>>>>>>>>>
>>>>>>>>>>>> When I run
>>>>>>>>>>>> ompi_info --param rds hostfile
>>>>>>>>>>>> It reports MCA rds: parameter
>>>>>>>>>>>> "rds_hostfile_path" (current value:
>>>>>>>>>>>> "/usr/lib/openmpi/1.2.5-gcc/etc/openmpi-default-hostfile")
>>>>>>>>>>>> I tried changing that file but it has no effect.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am using
>>>>>>>>>>>> openmpi 1.2.5
>>>>>>>>>>>> CentOS 5.2
>>>>>>>>>>>> ethernet TCP
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- Mark
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mark Borgerding
>>>>>>>>> 3dB Labs, Inc
>>>>>>>>> Innovate. Develop. Deliver.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <mpi.h>
>
> /*
> *(new BSD license)
> *
> Copyright (c) 2008 Mark Borgerding
>
> All rights reserved.
>
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions are met:
>
> * Redistributions of source code must retain the above copyright notice,
> this list of conditions and the following disclaimer.
> * Redistributions in binary form must reproduce the above copyright
> notice, this list of conditions and the following disclaimer in the
> documentation and/or other materials provided with the distribution.
> * Neither the author nor the names of any contributors may be used to
> endorse or promote products derived from this software without specific
> prior written permission.
> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
> IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
> CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
> EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
> PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
> OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
> WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
> OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
> ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> *
> */
>
> int main(int argc, char ** argv)
> {
> MPI_Comm parent;
> MPI_Comm allmpi;
> MPI_Info info;
> MPI_Comm icom;
> MPI_Status status;
> int i,k,rank,size,length,count;
> char name[256];
>
> MPI_Init(NULL,NULL);
> MPI_Comm_get_parent(&parent);
>
> if ( parent == MPI_COMM_NULL ) {
> MPI_Comm_size(MPI_COMM_WORLD,&size);
> if (size>1) {
> fprintf(stderr,"I think I was started by orterun\n");
> MPI_Comm_dup(MPI_COMM_WORLD,&allmpi);
> }else{
> if (argc<2) {
> fprintf(stderr,"please provide a host argument (will be
> placed in MPI_Info for MPI_Comm_spawn\n");
> return 1;
> }
> fprintf(stderr,"I am going to spawn 2 children on %s\n",argv[1]);
> int errs[2];
>
> MPI_Info_create( &info );
> MPI_Info_set(info,"host",argv[1]);
>
> MPI_Comm_spawn(argv[0],MPI_ARGV_NULL,2,info,0,MPI_COMM_WORLD,&icom,errs);
> MPI_Intercomm_merge( icom, 0, &allmpi);
> MPI_Info_free(&info);
> }
> }else{
> fprintf(stderr,"I was started by MPI_Comm_spawn\n");
> MPI_Intercomm_merge( parent, 1, &allmpi);
> }
>
> MPI_Comm_rank(allmpi,&rank);
> MPI_Comm_size(allmpi,&size);
> MPI_Get_processor_name(name,&length);
> fprintf(stderr,"Hello my name is %s. I am %d of %d\n",name,rank,size);
>
> if (rank==0) {
> int k;
> float buf[128];
> memset(buf,0,sizeof(buf));
> fprintf(stderr,"rank zero sending data to all others\n");
> for (k=1;k<size;++k)
> MPI_Send( buf , 128 , MPI_FLOAT, k, 42 , allmpi);
> fprintf(stderr,"rank zero data from all others\n");
>
> for (k=1;k<size;++k) {
> MPI_Recv( buf , 128 , MPI_FLOAT, k, 42 , allmpi,&status);
> MPI_Get_count( &status, MPI_FLOAT, &count);
> if (count!= 128) {
> fprintf(stderr,"short read from %d (count=%d)\n",k,count);
> exit(1);
> }
> }
> }else{
> float buf[128];
> MPI_Recv( buf , 128 , MPI_FLOAT, 0, 42 , allmpi,&status);
> MPI_Get_count( &status, MPI_FLOAT, &count);
> if (count!= 128) {
> fprintf(stderr,"short read from 0 (count=%d)\n",count);
> exit(1);
> }
> MPI_Send( buf , 128 , MPI_FLOAT, 0, 42 , allmpi);
> }
> fprintf(stderr,"Exiting %s (%d of %d)\n",name,rank,size);
>
> MPI_Comm_free( &allmpi);
> MPI_Finalize();
> }
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>