Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-08-11 16:17:13


I can't speak to what you get from sacct, but I can say that things will definitely be different when launched directly via srun vs indirectly thru mpirun. The reason is that mpirun uses srun to launch the orte daemons, which then fork/exec all the application processes under them (as opposed to launching those app procs thru srun). This means two things:

1. Slurm has no direct knowledge or visibility into the application procs themselves when launched by mpirun. Slurm only sees the ORTE daemons. I'm sure that Slurm rolls up all the resources used by those daemons and their children, so the totals should include them

2. Since all Slurm can do is roll everything up, the resources shown in sacct will include those used by the daemons and mpirun as well as the application procs. Slurm doesn't include their daemons or the slurmctld in their accounting. so the two numbers will be significantly different. If you are attempting to limit overall resource usage, you may need to leave some slack for the daemons and mpirun.

You should also see an extra "step" in the mpirun-launched job as mpirun itself generally takes the first step, and the launch of the daemons occupies a second step.

As for the strange numbers you are seeing, it looks to me like you are hitting a mismatch of unsigned vs signed values. When adding them up, that could cause all kinds of erroneous behavior.

On Aug 6, 2013, at 11:55 PM, Christopher Samuel <samuel_at_[hidden]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/08/13 16:19, Christopher Samuel wrote:
>
>> Anyone seen anything similar, or any ideas on what could be going
>> on?
>
> Sorry, this was with:
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
>
> Since those initial tests we've started enforcing memory limits (the
> system is not yet in full production) and found that this causes jobs
> to get killed.
>
> We tried the cgroups gathering method, but jobs still die with mpirun
> and now the numbers don't seem to right for mpirun or srun either:
>
> mpirun (killed):
>
> [samuel_at_barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize
> JobID MaxRSS MaxVMSize
> - ------------ ---------- ----------
> 94564
> 94564.batch -523362K 0
> 94564.0 394525K 0
>
> srun:
>
> [samuel_at_barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize
> JobID MaxRSS MaxVMSize
> - ------------ ---------- ----------
> 94565
> 94565.batch 998K 0
> 94565.0 88663K 0
>
>
> All the best,
> Chris
> - --
> Christopher Samuel Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: samuel_at_[hidden] Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/ http://twitter.com/vlsci
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93
> KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk
> =jYrC
> -----END PGP SIGNATURE-----
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel