Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
From: Gus Correa (gus_at_[hidden])
Date: 2014-03-27 17:31:20


On 03/27/2014 04:10 PM, Reuti wrote:
> Hi,
>
> Am 27.03.2014 um 20:15 schrieb Gus Correa:
>
>> <snip>
>>> Awesome, but now here is my concern.
>> If we have OpenMPI-based applications launched as batch jobs
>> via a batch scheduler like SLURM, PBS, LSF, etc.
>> (which decides the placement of the app and dispatches it to the compute hosts),
>> then will including "--report-bindings --bind-to-core" cause problems?
>
> Do all of them have an internal bookkeeping of granted cores to slots -
i.e. not only the number of scheduled slots per job per node, but also
which
core was granted to which job? Does Open MPI read this information would be
the next question then.
>
>
>> I don't know all resource managers and schedulers.
>>
>> I use Torque+Maui here.
>> OpenMPI is built with Torque support,
and will use the nodes and cpus/cores provided by Torque.
>
> Same question here.
>

Hi Reuti

On Torque the answer is "it depends".
If you configure it with cpuset enabled (which is *not* the default)
then the job can run only on those cpus/cores listed under
/dev/cpuset/bla/bla/job_number/bla/bla.
Otherwise, processes and threads are free to run on any cores inside
the nodes Torque assigned to the job.
However, process placement and binding is deferred to MPI.
What I like about this is that they (Torque and OMPI)
coexist without interfering with each other.

My quick reading of some Slurm documents suggested that
it is configured by default with cpuset enabled,
and if I understood right "srun" does core binding by default as well
(which you can override with other types of binding).
However, I don't understand clearly the interplay
between srun and mpirun.
Does srun replace mpirun perhaps,
and takes over process placement and binding?
Or do they coexist in harmony?
However, I am not a Slurm user, so what I wrote
above are just wild guesses, and may be completely wrong.
In any case, this discouraged me a bit of trying Slurm.

IMHO, the resource manager has no business in
enforcing process/thread placement and binding,
and minimally should have an option to
turn it off at the user request, and let MPI and other tools do it.
As you certainly know, besides MPI (OMPI in particular),
OpenMP has its own mechanisms for thread binding as well,
and so do hwloc, taskset, numactl, etc.
I think these are are the natural baby-sitters of process,
thread, cpu, core, NUMA, etc.
The resource manager should keep baby-sitting the jobs and the
coarse-grained resources, as it always did.
Otherwise those children will be spoiled by too much attention.
One tool for each small task, keep it simple,
aren't these principles that made Unix' success and longevity?
However, this job baby-sitting job may be high-paid,
hence there is more and more people applying for it.

Gus Correa

>
>> My understanding is that Torque delegates to OpenMPI
the process placement and binding
(beyond the list of nodes/cpus available for
>> the job).
>>
>> My guess is that OpenPBS behaves the same as Torque.
>>
>> SLURM and SGE/OGE *probably* have pretty much the same behavior.
>
> SGE/OGE: no, any binding request is only a soft request.
> UGE: here you can request a hard binding. But I have no clue whether
this information is read by Open MPI too.
>
> If in doubt: use only complete nodes for each job
(which is often done for massively parallel jobs anyway).
>
> -- Reuti
>
>
>> A cursory reading of the SLURM web page suggested to me that it
>> does core binding by default, but don't quote me on that.
>>
>> I don't know what LSF does, but I would guess there is a
>> way to do the appropriate bindings, either at the resource
manager level, or at the OpenMPI level (or a combination of both).
>>
>>
>> Certainly I can test this, but concerned there may be a case where inclusion of
>> --bind-to-core would cause an unexpected problem I did not account for.
>>>
>>> --john
>>>
>>
>> Well, testing and failing is part of this game!
>> Would the GE manager buy that? :)
>>
>> I hope this helps,
>> Gus Correa
>>
>>>
>>> -----Original Message-----
>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
>>> Sent: Thursday, March 27, 2014 2:06 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
>>>
>>> Hi John
>>>
>>> Take a look at the mpiexec/mpirun options:
>>>
>>> -report-bindings (this one should report what you want)
>>>
>>> and maybe also also:
>>>
>>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>>>
>>> and similar, if you want more control on where your MPI processes run.
>>>
>>> "man mpiexec" is your friend!
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
>>> On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
>>>> When a piece of software built against OpenMPI fails, I will see an
>>>> error referring to the rank of the MPI task which incurred the failure.
>>>> For example:
>>>>
>>>> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>>>>
>>>> with errorcode 1.
>>>>
>>>> Unfortunately, I do not have access to the software code, just the
>>>> installation directory tree for OpenMPI. My question is: Is there a
>>>> flag that can be passed to mpirun, or an environment variable set,
>>>> which would reveal the mapping of ranks to the hosts they are on?
>>>>
>>>> I do understand that one could have multiple MPI ranks running on the
>>>> same host, but finding a way to determine which rank ran on what host
>>>> would go a long way in help troubleshooting problems which may be
>>>> central to the host. Thanks!
>>>>
>>>> --john
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>