Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
From: Reuti (reuti_at_[hidden])
Date: 2014-03-27 16:10:17


Hi,

Am 27.03.2014 um 20:15 schrieb Gus Correa:

> <snip>
>> Awesome, but now here is my concern.
> If we have OpenMPI-based applications launched as batch jobs
> via a batch scheduler like SLURM, PBS, LSF, etc.
> (which decides the placement of the app and dispatches it to the compute hosts),
> then will including "--report-bindings --bind-to-core" cause problems?

Do all of them have an internal bookkeeping of granted cores to slots - i.e. not only the number of scheduled slots per job per node, but also which core was granted to which job? Does Open MPI read this information would be the next question then.

> I don't know all resource managers and schedulers.
>
> I use Torque+Maui here.
> OpenMPI is built with Torque support, and will use the nodes and cpus/cores provided by Torque.

Same question here.

> My understanding is that Torque delegates to OpenMPI the process placement and binding (beyond the list of nodes/cpus available for
> the job).
>
> My guess is that OpenPBS behaves the same as Torque.
>
> SLURM and SGE/OGE *probably* have pretty much the same behavior.

SGE/OGE: no, any binding request is only a soft request.
UGE: here you can request a hard binding. But I have no clue whether this information is read by Open MPI too.

If in doubt: use only complete nodes for each job (which is often done for massively parallel jobs anyway).

-- Reuti

> A cursory reading of the SLURM web page suggested to me that it
> does core binding by default, but don't quote me on that.
>
> I don't know what LSF does, but I would guess there is a
> way to do the appropriate bindings, either at the resource manager level, or at the OpenMPI level (or a combination of both).
>
>
> Certainly I can test this, but concerned there may be a case where inclusion of
> --bind-to-core would cause an unexpected problem I did not account for.
>>
>> --john
>>
>
> Well, testing and failing is part of this game!
> Would the GE manager buy that? :)
>
> I hope this helps,
> Gus Correa
>
>>
>> -----Original Message-----
>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gus Correa
>> Sent: Thursday, March 27, 2014 2:06 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mapping ranks to hosts (from MPI error messages)
>>
>> Hi John
>>
>> Take a look at the mpiexec/mpirun options:
>>
>> -report-bindings (this one should report what you want)
>>
>> and maybe also also:
>>
>> -bycore, -bysocket, -bind-to-core, -bind-to-socket, ...
>>
>> and similar, if you want more control on where your MPI processes run.
>>
>> "man mpiexec" is your friend!
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 03/27/2014 01:53 PM, Sasso, John (GE Power & Water, Non-GE) wrote:
>>> When a piece of software built against OpenMPI fails, I will see an
>>> error referring to the rank of the MPI task which incurred the failure.
>>> For example:
>>>
>>> MPI_ABORT was invoked on rank 1236 in communicator MPI_COMM_WORLD
>>>
>>> with errorcode 1.
>>>
>>> Unfortunately, I do not have access to the software code, just the
>>> installation directory tree for OpenMPI. My question is: Is there a
>>> flag that can be passed to mpirun, or an environment variable set,
>>> which would reveal the mapping of ranks to the hosts they are on?
>>>
>>> I do understand that one could have multiple MPI ranks running on the
>>> same host, but finding a way to determine which rank ran on what host
>>> would go a long way in help troubleshooting problems which may be
>>> central to the host. Thanks!
>>>
>>> --john
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users