Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-08-31 17:37:26


I see - well, I hope to work on it this weekend and may get it fixed. If I do, I can provide you with a patch for the 1.6 series that you can use until the actual release is issued, if that helps.

On Aug 31, 2012, at 2:33 PM, Brian Budge <brian.budge_at_[hidden]> wrote:

> Hi Ralph -
>
> This is true, but we may not know until well into the process whether
> we need MPI at all. We have an SMP/NUMA mode that is designed to run
> faster on a single machine. We also may build our application on
> machines where there is no MPI, and we simply don't build the code
> that runs the MPI functionality in that case. We have scripts all
> over the place that need to start this application, and it would be
> much easier to be able to simply run the program than to figure out
> when or if mpirun needs to be starting the program.
>
> Before, we went so far as to fork and exec a full mpirun when we run
> in clustered mode. This resulted in an additional process running,
> and we had to use sockets to get the data to the new master process.
> I very much like the idea of being able to have our process become the
> MPI master instead, so I have been very excited about your work around
> this singleton fork/exec under the hood.
>
> Once I get my new infrastructure designed to work with mpirun -n 1 +
> spawn, I will try some previous openmpi versions to see if I can find
> a version with this singleton functionality in-tact.
>
> Thanks again,
> Brian
>
> On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> not off the top of my head. However, as noted earlier, there is absolutely no advantage to a singleton vs mpirun start - all the singleton does is immediately fork/exec "mpirun" to support the rest of the job. In both cases, you have a daemon running the job - only difference is in the number of characters the user types to start it.
>>
>>
>> On Aug 30, 2012, at 8:44 AM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>
>>> In the event that I need to get this up-and-running soon (I do need
>>> something working within 2 weeks), can you recommend an older version
>>> where this is expected to work?
>>>
>>> Thanks,
>>> Brian
>>>
>>> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>> Thanks!
>>>>
>>>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>> Yeah, I'm seeing the hang as well when running across multiple machines. Let me dig a little and get this fixed.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Aug 28, 2012, at 4:51 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>
>>>>>> Hmmm, I went to the build directories of openmpi for my two machines,
>>>>>> went into the orte/test/mpi directory and made the executables on both
>>>>>> machines. I set the hostsfile in the env variable on the "master"
>>>>>> machine.
>>>>>>
>>>>>> Here's the output:
>>>>>>
>>>>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>>>>>> ./simple_spawn
>>>>>> Parent [pid 97504] starting up!
>>>>>> 0 completed MPI_Init
>>>>>> Parent [pid 97504] about to spawn!
>>>>>> Parent [pid 97507] starting up!
>>>>>> Parent [pid 97508] starting up!
>>>>>> Parent [pid 30626] starting up!
>>>>>> ^C
>>>>>> zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn
>>>>>>
>>>>>> I had to ^C to kill the hung process.
>>>>>>
>>>>>> When I run using mpirun:
>>>>>>
>>>>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>>>>>> mpirun -np 1 ./simple_spawn
>>>>>> Parent [pid 97511] starting up!
>>>>>> 0 completed MPI_Init
>>>>>> Parent [pid 97511] about to spawn!
>>>>>> Parent [pid 97513] starting up!
>>>>>> Parent [pid 30762] starting up!
>>>>>> Parent [pid 30764] starting up!
>>>>>> Parent done with spawn
>>>>>> Parent sending message to child
>>>>>> 1 completed MPI_Init
>>>>>> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
>>>>>> 0 completed MPI_Init
>>>>>> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
>>>>>> 2 completed MPI_Init
>>>>>> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
>>>>>> Child 1 disconnected
>>>>>> Child 0 received msg: 38
>>>>>> Child 0 disconnected
>>>>>> Parent disconnected
>>>>>> Child 2 disconnected
>>>>>> 97511: exiting
>>>>>> 97513: exiting
>>>>>> 30762: exiting
>>>>>> 30764: exiting
>>>>>>
>>>>>> As you can see, I'm using openmpi v 1.6.1. I just barely freshly
>>>>>> installed on both machines using the default configure options.
>>>>>>
>>>>>> Thanks for all your help.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>> Looks to me like it didn't find your executable - could be a question of where it exists relative to where you are running. If you look in your OMPI source tree at the orte/test/mpi directory, you'll see an example program "simple_spawn.c" there. Just "make simple_spawn" and execute that with your default hostfile set - does it work okay?
>>>>>>>
>>>>>>> It works fine for me, hence the question.
>>>>>>>
>>>>>>> Also, what OMPI version are you using?
>>>>>>>
>>>>>>> On Aug 28, 2012, at 4:25 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> I see. Okay. So, I just tried removing the check for universe size,
>>>>>>>> and set the universe size to 2. Here's my output:
>>>>>>>>
>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>>>>>>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file
>>>>>>>> base/plm_base_receive.c at line 253
>>>>>>>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified
>>>>>>>> application failed to start in file dpm_orte.c at line 785
>>>>>>>>
>>>>>>>> The corresponding run with mpirun still works.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>> I see the issue - it's here:
>>>>>>>>>
>>>>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>>>>>>>
>>>>>>>>>> if(!flag) {
>>>>>>>>>> std::cerr << "no universe size" << std::endl;
>>>>>>>>>> return -1;
>>>>>>>>>> }
>>>>>>>>>> universeSize = *puniverseSize;
>>>>>>>>>> if(universeSize == 1) {
>>>>>>>>>> std::cerr << "cannot start slaves... not enough nodes" << std::endl;
>>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The universe size is set to 1 on a singleton because the attribute gets set at the beginning of time - we haven't any way to go back and change it. The sequence of events explains why. The singleton starts up and sets its attributes, including universe_size. It also spins off an orte daemon to act as its own private "mpirun" in case you call comm_spawn. At this point, however, no hostfile has been read - the singleton is just an MPI proc doing its own thing, and the orte daemon is just sitting there on "stand-by".
>>>>>>>>>
>>>>>>>>> When your app calls comm_spawn, then the orte daemon gets called to launch the new procs. At that time, it (not the original singleton!) reads the hostfile to find out how many nodes are around, and then does the launch.
>>>>>>>>>
>>>>>>>>> You are trying to check the number of nodes from within the singleton, which won't work - it has no way of discovering that info.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 28, 2012, at 2:38 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>>> echo hostsfile
>>>>>>>>>> localhost
>>>>>>>>>> budgeb-sandybridge
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>> Hmmm...what is in your "hostsfile"?
>>>>>>>>>>>
>>>>>>>>>>> On Aug 28, 2012, at 2:33 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ralph -
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for confirming this is possible. I'm trying this and currently
>>>>>>>>>>>> failing. Perhaps there's something I'm missing in the code to make
>>>>>>>>>>>> this work. Here are the two instantiations and their outputs:
>>>>>>>>>>>>
>>>>>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>>>>>>>>>>> cannot start slaves... not enough nodes
>>>>>>>>>>>>
>>>>>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe
>>>>>>>>>>>> master spawned 1 slaves...
>>>>>>>>>>>> slave responding...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The code:
>>>>>>>>>>>>
>>>>>>>>>>>> //master.cpp
>>>>>>>>>>>> #include <mpi.h>
>>>>>>>>>>>> #include <boost/filesystem.hpp>
>>>>>>>>>>>> #include <iostream>
>>>>>>>>>>>>
>>>>>>>>>>>> int main(int argc, char **args) {
>>>>>>>>>>>> int worldSize, universeSize, *puniverseSize, flag;
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Comm everyone; //intercomm
>>>>>>>>>>>> boost::filesystem::path curPath =
>>>>>>>>>>>> boost::filesystem::absolute(boost::filesystem::current_path());
>>>>>>>>>>>>
>>>>>>>>>>>> std::string toRun = (curPath / "slave_exe").string();
>>>>>>>>>>>>
>>>>>>>>>>>> int ret = MPI_Init(&argc, &args);
>>>>>>>>>>>>
>>>>>>>>>>>> if(ret != MPI_SUCCESS) {
>>>>>>>>>>>> std::cerr << "failed init" << std::endl;
>>>>>>>>>>>> return -1;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
>>>>>>>>>>>>
>>>>>>>>>>>> if(worldSize != 1) {
>>>>>>>>>>>> std::cerr << "too many masters" << std::endl;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>>>>>>>>>
>>>>>>>>>>>> if(!flag) {
>>>>>>>>>>>> std::cerr << "no universe size" << std::endl;
>>>>>>>>>>>> return -1;
>>>>>>>>>>>> }
>>>>>>>>>>>> universeSize = *puniverseSize;
>>>>>>>>>>>> if(universeSize == 1) {
>>>>>>>>>>>> std::cerr << "cannot start slaves... not enough nodes" << std::endl;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> char *buf = (char*)alloca(toRun.size() + 1);
>>>>>>>>>>>> memcpy(buf, toRun.c_str(), toRun.size());
>>>>>>>>>>>> buf[toRun.size()] = '\0';
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL,
>>>>>>>>>>>> 0, MPI_COMM_SELF, &everyone,
>>>>>>>>>>>> MPI_ERRCODES_IGNORE);
>>>>>>>>>>>>
>>>>>>>>>>>> std::cerr << "master spawned " << universeSize-1 << " slaves..."
>>>>>>>>>>>> << std::endl;
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Finalize();
>>>>>>>>>>>>
>>>>>>>>>>>> return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> //slave.cpp
>>>>>>>>>>>> #include <mpi.h>
>>>>>>>>>>>>
>>>>>>>>>>>> int main(int argc, char **args) {
>>>>>>>>>>>> int size;
>>>>>>>>>>>> MPI_Comm parent;
>>>>>>>>>>>> MPI_Init(&argc, &args);
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Comm_get_parent(&parent);
>>>>>>>>>>>>
>>>>>>>>>>>> if(parent == MPI_COMM_NULL) {
>>>>>>>>>>>> std::cerr << "slave has no parent" << std::endl;
>>>>>>>>>>>> }
>>>>>>>>>>>> MPI_Comm_remote_size(parent, &size);
>>>>>>>>>>>> if(size != 1) {
>>>>>>>>>>>> std::cerr << "parent size is " << size << std::endl;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> std::cerr << "slave responding..." << std::endl;
>>>>>>>>>>>>
>>>>>>>>>>>> MPI_Finalize();
>>>>>>>>>>>>
>>>>>>>>>>>> return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Any ideas? Thanks for any help.
>>>>>>>>>>>>
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>> It really is just that simple :-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Aug 22, 2012, at 8:56 AM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Okay. Is there a tutorial or FAQ for setting everything up? Or is it
>>>>>>>>>>>>>> really just that simple? I don't need to run a copy of the orte
>>>>>>>>>>>>>> server somewhere?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if my current ip is 192.168.0.1,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 0 > echo 192.168.0.11 > /tmp/hostfile
>>>>>>>>>>>>>> 1 > echo 192.168.0.12 >> /tmp/hostfile
>>>>>>>>>>>>>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile
>>>>>>>>>>>>>> 3 > ./mySpawningExe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At this point, mySpawningExe will be the master, running on
>>>>>>>>>>>>>> 192.168.0.1, and I can have spawned, for example, childExe on
>>>>>>>>>>>>>> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and
>>>>>>>>>>>>>> childExe2 on 192.168.0.12?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>> Sure, that's still true on all 1.3 or above releases. All you need to do is set the hostfile envar so we pick it up:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OMPI_MCA_orte_default_hostfile=<foo>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Aug 21, 2012, at 7:23 PM, Brian Budge <brian.budge_at_[hidden]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi. I know this is an old thread, but I'm curious if there are any
>>>>>>>>>>>>>>>> tutorials describing how to set this up? Is this still available on
>>>>>>>>>>>>>>>> newer open mpi versions?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>> Hi Elena
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm copying this to the user list just to correct a mis-statement on my part
>>>>>>>>>>>>>>>>> in an earlier message that went there. I had stated that a singleton could
>>>>>>>>>>>>>>>>> comm_spawn onto other nodes listed in a hostfile by setting an environmental
>>>>>>>>>>>>>>>>> variable that pointed us to the hostfile.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is incorrect in the 1.2 code series. That series does not allow
>>>>>>>>>>>>>>>>> singletons to read a hostfile at all. Hence, any comm_spawn done by a
>>>>>>>>>>>>>>>>> singleton can only launch child processes on the singleton's local host.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This situation has been corrected for the upcoming 1.3 code series. For the
>>>>>>>>>>>>>>>>> 1.2 series, though, you will have to do it via an mpirun command line.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for the confusion - I sometimes have too many code families to keep
>>>>>>>>>>>>>>>>> straight in this old mind!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/4/08 5:10 AM, "Elena Zhebel" <ezhebel_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you very much for the explanations.
>>>>>>>>>>>>>>>>>> But I still do not get it running...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For the case
>>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
>>>>>>>>>>>>>>>>>> everything works.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For the case
>>>>>>>>>>>>>>>>>> ./my_master.exe
>>>>>>>>>>>>>>>>>> it does not.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I did:
>>>>>>>>>>>>>>>>>> - create my_hostfile and put it in the $HOME/.openmpi/components/
>>>>>>>>>>>>>>>>>> my_hostfile :
>>>>>>>>>>>>>>>>>> bollenstreek slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> octocore01 slots=8 max_slots=8
>>>>>>>>>>>>>>>>>> octocore02 slots=8 max_slots=8
>>>>>>>>>>>>>>>>>> clstr000 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr001 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr002 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr003 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr004 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr005 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr006 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> clstr007 slots=2 max_slots=3
>>>>>>>>>>>>>>>>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in .tcshrc and
>>>>>>>>>>>>>>>>>> then source .tcshrc)
>>>>>>>>>>>>>>>>>> - in my_master.cpp I did
>>>>>>>>>>>>>>>>>> MPI_Info info1;
>>>>>>>>>>>>>>>>>> MPI_Info_create(&info1);
>>>>>>>>>>>>>>>>>> char* hostname =
>>>>>>>>>>>>>>>>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
>>>>>>>>>>>>>>>>>> MPI_Info_set(info1, "host", hostname);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0,
>>>>>>>>>>>>>>>>>> MPI_ERRCODES_IGNORE);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - After I call the executable, I've got this error message
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bollenstreek: > ./my_master
>>>>>>>>>>>>>>>>>> number of processes to run: 1
>>>>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> Some of the requested hosts are not included in the current allocation for
>>>>>>>>>>>>>>>>>> the application:
>>>>>>>>>>>>>>>>>> ./childexe
>>>>>>>>>>>>>>>>>> The requested hosts were:
>>>>>>>>>>>>>>>>>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Verify that you have mapped the allocated resources properly using the
>>>>>>>>>>>>>>>>>> --host specification.
>>>>>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>>>>>>>>>>>>>>>>> base/rmaps_base_support_fns.c at line 225
>>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>>>>>>>>>>>>>>>>> rmaps_rr.c at line 478
>>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>>>>>>>>>>>>>>>>> base/rmaps_base_map_job.c at line 210
>>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>>>>>>>>>>>>>>>>> rmgr_urm.c at line 372
>>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>>>>>>>>>>>>>>>>> communicator/comm_dyn.c at line 608
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Did I miss something?
>>>>>>>>>>>>>>>>>> Thanks for help!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>> From: Ralph H Castain [mailto:rhc_at_[hidden]]
>>>>>>>>>>>>>>>>>> Sent: Tuesday, December 18, 2007 3:50 PM
>>>>>>>>>>>>>>>>>> To: Elena Zhebel; Open MPI Users <users_at_[hidden]>
>>>>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 12/18/07 7:35 AM, "Elena Zhebel" <ezhebel_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks a lot! Now it works!
>>>>>>>>>>>>>>>>>>> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and pass
>>>>>>>>>>>>>>>>>> MPI_Info
>>>>>>>>>>>>>>>>>>> Key to the Spawn function!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One more question: is it necessary to start my "master" program with
>>>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No, it isn't necessary - assuming that my_master_host is the first host
>>>>>>>>>>>>>>>>>> listed in your hostfile! If you are only executing one my_master.exe (i.e.,
>>>>>>>>>>>>>>>>>> you gave -n 1 to mpirun), then we will automatically map that process onto
>>>>>>>>>>>>>>>>>> the first host in your hostfile.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you want my_master.exe to go on someone other than the first host in the
>>>>>>>>>>>>>>>>>> file, then you have to give us the -host option.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are there other possibilities for easy start?
>>>>>>>>>>>>>>>>>>> I would say just to run ./my_master.exe , but then the master process
>>>>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>>>>> know about the available in the network hosts.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can set the hostfile parameter in your environment instead of on the
>>>>>>>>>>>>>>>>>> command line. Just set OMPI_MCA_rds_hostfile_path = my.hosts.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can then just run ./my_master.exe on the host where you want the master
>>>>>>>>>>>>>>>>>> to reside - everything should work the same.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Just as an FYI: the name of that environmental variable is going to change
>>>>>>>>>>>>>>>>>> in the 1.3 release, but everything will still work the same.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>> From: Ralph H Castain [mailto:rhc_at_[hidden]]
>>>>>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 5:49 PM
>>>>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>; Elena Zhebel
>>>>>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 12/17/07 8:19 AM, "Elena Zhebel" <ezhebel_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you for your answer.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm using OpenMPI 1.2.3. , compiler glibc232, Linux Suse 10.0.
>>>>>>>>>>>>>>>>>>>> My "master" executable runs only on the one local host, then it spawns
>>>>>>>>>>>>>>>>>>>> "slaves" (with MPI::Intracomm::Spawn).
>>>>>>>>>>>>>>>>>>>> My question was: how to determine the hosts where these "slaves" will be
>>>>>>>>>>>>>>>>>>>> spawned?
>>>>>>>>>>>>>>>>>>>> You said: "You have to specify all of the hosts that can be used by
>>>>>>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>>>>>>> in the original hostfile". How can I specify the host file? I can not
>>>>>>>>>>>>>>>>>>>> find it
>>>>>>>>>>>>>>>>>>>> in the documentation.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hmmm...sorry about the lack of documentation. I always assumed that the MPI
>>>>>>>>>>>>>>>>>>> folks in the project would document such things since it has little to do
>>>>>>>>>>>>>>>>>>> with the underlying run-time, but I guess that fell through the cracks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are two parts to your question:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. how to specify the hosts to be used for the entire job. I believe that
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> somewhat covered here:
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That FAQ tells you what a hostfile should look like, though you may already
>>>>>>>>>>>>>>>>>>> know that. Basically, we require that you list -all- of the nodes that both
>>>>>>>>>>>>>>>>>>> your master and slave programs will use.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. how to specify which nodes are available for the master, and which for
>>>>>>>>>>>>>>>>>>> the slave.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You would specify the host for your master on the mpirun command line with
>>>>>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This directs Open MPI to map that specified executable on the specified
>>>>>>>>>>>>>>>>>> host
>>>>>>>>>>>>>>>>>>> - note that my_master_host must have been in my_hostfile.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Inside your master, you would create an MPI_Info key "host" that has a
>>>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>>>> consisting of a string "host1,host2,host3" identifying the hosts you want
>>>>>>>>>>>>>>>>>>> your slave to execute upon. Those hosts must have been included in
>>>>>>>>>>>>>>>>>>> my_hostfile. Include that key in the MPI_Info array passed to your Spawn.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We don't currently support providing a hostfile for the slaves (as opposed
>>>>>>>>>>>>>>>>>>> to the host-at-a-time string above). This may become available in a future
>>>>>>>>>>>>>>>>>>> release - TBD.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>>>>>>>>>>>>>>>>>>>> Behalf Of Ralph H Castain
>>>>>>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 3:31 PM
>>>>>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster
>>>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 12/12/07 5:46 AM, "Elena Zhebel" <ezhebel_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm working on a MPI application where I'm using OpenMPI instead of
>>>>>>>>>>>>>>>>>>>>> MPICH.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In my "master" program I call the function MPI::Intracomm::Spawn which
>>>>>>>>>>>>>>>>>>>> spawns
>>>>>>>>>>>>>>>>>>>>> "slave" processes. It is not clear for me how to spawn the "slave"
>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>> over the network. Currently "master" creates "slaves" on the same
>>>>>>>>>>>>>>>>>>>>> host.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If I use 'mpirun --hostfile openmpi.hosts' then processes are spawn
>>>>>>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> network as expected. But now I need to spawn processes over the
>>>>>>>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>> my own executable using MPI::Intracomm::Spawn, how can I achieve it?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm not sure from your description exactly what you are trying to do,
>>>>>>>>>>>>>>>>>>>> nor in
>>>>>>>>>>>>>>>>>>>> what environment this is all operating within or what version of Open
>>>>>>>>>>>>>>>>>>>> MPI
>>>>>>>>>>>>>>>>>>>> you are using. Setting aside the environment and version issue, I'm
>>>>>>>>>>>>>>>>>>>> guessing
>>>>>>>>>>>>>>>>>>>> that you are running your executable over some specified set of hosts,
>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> want to provide a different hostfile that specifies the hosts to be
>>>>>>>>>>>>>>>>>>>> used for
>>>>>>>>>>>>>>>>>>>> the "slave" processes. Correct?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If that is correct, then I'm afraid you can't do that in any version
>>>>>>>>>>>>>>>>>>>> of Open
>>>>>>>>>>>>>>>>>>>> MPI today. You have to specify all of the hosts that can be used by
>>>>>>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>>>>>>> in the original hostfile. You can then specify a subset of those hosts
>>>>>>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>>>>>> used by your original "master" program, and then specify a different
>>>>>>>>>>>>>>>>>>>> subset
>>>>>>>>>>>>>>>>>>>> to be used by the "slaves" when calling Spawn.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> But the system requires that you tell it -all- of the hosts that are
>>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>>> to be used at the beginning of the job.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> At the moment, there is no plan to remove that requirement, though
>>>>>>>>>>>>>>>>>>>> there has
>>>>>>>>>>>>>>>>>>>> been occasional discussion about doing so at some point in the future.
>>>>>>>>>>>>>>>>>>>> No
>>>>>>>>>>>>>>>>>>>> promises that it will happen, though - managed environments, in
>>>>>>>>>>>>>>>>>>>> particular,
>>>>>>>>>>>>>>>>>>>> currently object to the idea of changing the allocation on-the-fly. We
>>>>>>>>>>>>>>>>>>>> may,
>>>>>>>>>>>>>>>>>>>> though, make a provision for purely hostfile-based environments (i.e.,
>>>>>>>>>>>>>>>>>>>> unmanaged) at some time in the future.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks in advance for any help.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users