Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7
From: Derbunovich Andrei (abderbunovich_at_[hidden])
Date: 2013-04-24 09:01:43

Thank you to everybody for suggestions and comments.

I have used relatively small number of nodes (4400). It looks like that
the main issue that I didn't disable dynamic components opening in my
openmpi build while keeping MPI installation directory on network file
system. Oh my god!

I didn't check suggestion about using debrujin routed component yet.


-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: Tuesday, April 23, 2013 10:07 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7

On Apr 23, 2013, at 10:45 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> On Tue, Apr 23, 2013 at 10:17:46AM -0700, Ralph Castain wrote:
>> On Apr 23, 2013, at 10:09 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>>> On Tue, Apr 23, 2013 at 12:21:49PM +0400, ????????????????????
???????????? wrote:
>>>> Hi,
>>>> Nathan, could you please advise what is expected startup time for
>>>> OpenMPI job at such scale (128K ranks)? I'm interesting in
>>>> 1) time from mpirun start to completion of MPI_Init()
>>> It takes less than a minute to run:
>>> mpirun -n 131072 /bin/true
>>>> 2) time from MPI_Init() start to completion of MPI_Init()
>>> A simple MPI application took about about 1.25 mins to run. If you
want to see our setup you can take a look at
>>>>> From my experience for 52800 rank job
>>>> 1) took around 20 min
>>>> 2) took around 12 min
>>>> that actually looks like a hung.
>>> How many nodes? I have never seen launch times that bad on Cielo. You
could try adding -mca routed debruijn -novm and see if that helps. It will
reduce the amount of communication between compute nodes and the login
>> I believe the debrujin module was turned off a while ago due to a bug
>> that wasn't fixed. However, try using
> Was it turned off or was the priority lowered? If it was lowered then
-mca routed debruijn should work. The -novm is to avoid the bug (as I
understand it). I am working on fixing the bug now in hope it will be
ready for 1.7.2.

Pretty sure it is ompi_ignored and thus, not in the tarball

> -Nathan
> _______________________________________________
> users mailing list
> users_at_[hidden]

users mailing list