Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rankfiles on really big nodes broken?
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-01-20 15:25:25


I don't see anything in the code that limits the number of procs in a rankfile. Are the attached rankfiles the ones you are trying to use? I'm wondering if there is a syntax error that is causing the problem. It would help if you could provide the complete error message output.

At one time, there was a limit on the number of procs on a node - nothing to do with rankfile. That was fixed, though, and there is no real limit any more. I don't recall the precise release number where it changed in the 1.5 series - you might try updating to 1.5.4 as I'm sure it doesn't exist there.

On Jan 20, 2012, at 12:43 PM, Paul Kapinos wrote:

> Hello, Open MPI developer!
>
> Now, we have a really nice toy: 2 Tb RAM, 16 sockets, 128 cores.
> (4x smaller Bull S6010 coupled by BCS chips to a single image machine)
>
> On a such big box, process pinning is vital.
>
> So we tried to use the Open MPI capabilities to pin te processes. But it seem that the rankfile infrastructure does not work properly: we always get "Error: Invalid argument" message on the 128-core node, also if the rankfile was OK.
> On a smaller node (up to 32 cores/ 64 threads) the very same rankfile (with changed node name of course) works well.
>
> I believe, this computer dimension is a bit too big for the pinning infrasructure now. A bug?
>
> Best wishes,
>
> Paul Kapinos
>
> P.S. see the attached .tgz for some logzz
>
> ------------------------------------------------------------------------------
> Rankfiles
> Rankfiles provide a means for specifying detailed information about how process ranks should be mapped to nodes and how they should be bound. Consider the following:
> ....
> ------------------------------------------------------------------------------
> Open RTE: 1.5.3
> Open RTE SVN revision: r24532
> Open RTE release date: Mar 16, 2011
> OPAL: 1.5.3
> OPAL SVN revision: r24532
> OPAL release date: Mar 16, 2011
> Ident string: 1.5.3
>
>
>
> --
> Dipl.-Inform. Paul Kapinos - High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23, D 52074 Aachen (Germany)
> Tel: +49 241/80-24915
> <rankfiles128.tgz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users