Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rankfiles on really big nodes broken?
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2012-01-23 09:05:12


Hello Ralph,
Yes, the rankfiles in rankfiles128.tgz are the rankfiles which are used,
and linuxbsc*.txt files contain the output produced.

It would surprise me if the rankfile3 is incorrect - the very same files
(exept the node name of course) rankfile1, rankfile2 worked on smaller
machines, cf. runme.sh, the rankfile* files ant the output files.

The behaviour "it works on small box but does not work on thick box" was
the quell of mu assumption that there is a error somewhere..

  For the complete error message on the thick node see linuxbsc269.txt file.

Updating to newer 1.5.x is a good idea; but it is always a bit
tedious... Would 1.5.5 arrive the next time?

Best wishes,
Paul Kapinos

Ralph Castain wrote:
> I don't see anything in the code that limits the number of procs in a rankfile.
> Are the attached rankfiles the ones you are trying to use?
> I'm wondering if there is a syntax error that is causing the problem.
> It would help if you could provide the complete error message output.
>
> At one time, there was a limit on the number of procs on a node -
> nothing to do with rankfile. That was fixed, though, and there
> is no real limit any more. I don't recall the precise release number
> where it changed in the 1.5 series - you might try updating
> to 1.5.4 as I'm sure it doesn't exist there.

>
>
> On Jan 20, 2012, at 12:43 PM, Paul Kapinos wrote:
>
>> Hello, Open MPI developer!
>>
>> Now, we have a really nice toy: 2 Tb RAM, 16 sockets, 128 cores.
>> (4x smaller Bull S6010 coupled by BCS chips to a single image machine)
>>
>> On a such big box, process pinning is vital.
>>
>> So we tried to use the Open MPI capabilities to pin te processes. But it seem that the rankfile infrastructure does not work properly: we always get "Error: Invalid argument" message on the 128-core node, also if the rankfile was OK.
>> On a smaller node (up to 32 cores/ 64 threads) the very same rankfile (with changed node name of course) works well.
>>
>> I believe, this computer dimension is a bit too big for the pinning infrasructure now. A bug?
>>
>> Best wishes,
>>
>> Paul Kapinos
>>
>> P.S. see the attached .tgz for some logzz
>>
>> ------------------------------------------------------------------------------
>> Rankfiles
>> Rankfiles provide a means for specifying detailed information about how process ranks should be mapped to nodes and how they should be bound. Consider the following:
>> ....
>> ------------------------------------------------------------------------------
>> Open RTE: 1.5.3
>> Open RTE SVN revision: r24532
>> Open RTE release date: Mar 16, 2011
>> OPAL: 1.5.3
>> OPAL SVN revision: r24532
>> OPAL release date: Mar 16, 2011
>> Ident string: 1.5.3
>>
>>
>>
>> --
>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>> RWTH Aachen University, Center for Computing and Communication
>> Seffenter Weg 23, D 52074 Aachen (Germany)
>> Tel: +49 241/80-24915
>> <rankfiles128.tgz>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915