To be honest I'm totally lost in the naming scheme, which got me confused about the RFC you're referring to. We had an MCA parameter to start a vm, so I thought VM is some kind of special virtualized environment and not the entire ORTE. Based on the behavior of the trunk and the RFC you referred to, it seems that ORTE is now a VM (and only that). What is the real truth? Why did we had a need for orte_vm_launch and why this need suddenly disappeared?
I'm really amazed. Open MPI is the only MPI library doing everything in the reverse way, and all this blessed by the community. We had features that no other MPI implementations supported (but were in the MPI standard), but we removed them (sic). Meanwhile, the other MPI implemented them
Thus, their features list increases while our decreases. Clearly all successful projects should be inspired by our growing strategy.
PS: Thanks for the fix regarding the --host. We have encountered another issue. A job that terminates abnormally (MPI_Abort or segfault), will leave daemons behind. Usually it is not very bothersome, except that now with the new VM, our entire cluster is full with useless processes, at a point where after a while we have to reboot the machines to liberate pids.
On Dec 14, 2011, at 10:08 , Ralph Castain wrote:
> On Dec 13, 2011, at 9:10 PM, George Bosilca wrote:
>> I noticed today a drastic change in how ORTE deal with the hostfile between trunk and 1.5.
>> 1. 1.5 and prior used the hostile as a suggestion, a placeholder where to pick the requested number of daemons during the launch. The current trunk spawn daemons on all the nodes provided on the host file, and then spawn the apps only on some of them.
> It was in the RFC about the revised mapping system, George, and discussed multiple times on the telecons. I even raised this specific point at least twice on those telecons.
>> 2. If a default hostfile is provided and --host was specified 1.5 and prior use the nodes to limit the number of nodes in the environment to the requested nodes. The current trunk seems to ignore the --host option if a default hostfile is available.
> I'll check that one - we should limit the operation to the --host list.
>> In my configuration the hostfile is system wide, specified in the /etc via orte_default_hostfile. It contains all the nodes in the cluster, the users are supposed to use --host to limit their mpirun to a specified subset.
>> This seems a quite significant change. I would have expected an RFC.
>> devel mailing list
> devel mailing list