Another approach, that I've seen used, is to insert a resource manager agent between each Open MPI processes (be it runtime process or application process). Of course, it depends on how you collect your resource usage / enforce the resource limitation policy.
In the case I'm referring to, the agent was implemented as a unix process, and needed all "application" processes to be direct children of one of such agent process. In this case, application processes include all Open MPI runtime processes (orted), and all user application processes. The trick was to make the deployment system of ORTE launch all orteds, using the usual launching system of the batch scheduler, under the resource manager agent; then inserting another of such agents in the command line, to ensure that node orteds were launching one resource manager agent per application process.
That's a lot of processes, but it could work without changing a lot the code base, if your setup is similar, and you can launch as many resource agents per node as you want.
Le 4 mai 2011 à 19:59, Ralph Castain a écrit :
> In that case, why not just directly launch the processes without the orted? We do it with slurm and even have the ability to do it with torque - so it could be done.
> See the orte/mca/ess/slurmd component for an example of how to do so.
> On May 4, 2011, at 4:55 PM, Tony Lam wrote:
>> Hi Thomas,
>> We need to track job resource usage in our resource manager for
>> accounting and resource policy enforcement, sharing single orted
>> process in multiple jobs makes the tracking much complicated. We don't
>> enforce other restrictions, and I'll appreciate any suggestion on how
>> to resolve this or work around this.
>> We have thought about mapping all processes from a mpirun into a
>> single job to simplify job resource tracking, but that will require much spread changes in our software.
>> On 05/04/11 15:34, Thomas Herault wrote:
>>> Could you explain why you would like one orted on top of each MPI process?
>>> There are some situations, like resource usage limitation / accounting, that are possible to solve without changing the one daemon per node deployment.
>>> Or do you enforce other kinds of restrictions on the orted process? Why wouldn't it be able to launch more than one MPI process / why would not that be desirable?
>>> Le 4 mai 2011 à 15:51, Tony Lam a écrit :
>>>> I understand a single orted is shared by all MPI processes from the same communicator on each execution host, does anyone see any problem that MPI/OMPI may have problem with each process has its owner orted? My guess it is less efficient in terms of MPI communication and memory foot print, but for simplification of our integration with OMPI, launching one orted for each MPI process is much easier to do.
>>>> I will appreciate if someone can confirm this setup will or will not work.
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
> devel mailing list