Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7
From: Дербунович Андрей (abderbunovich_at_[hidden])
Date: 2013-04-23 04:21:49


Nathan, could you please advise what is expected startup time for OpenMPI
job at such scale (128K ranks)? I'm interesting in
1) time from mpirun start to completion of MPI_Init()
2) time from MPI_Init() start to completion of MPI_Init()

>From my experience for 52800 rank job
1) took around 20 min
2) took around 12 min
that actually looks like a hung.

Any advice how to improve startup times of large scale jobs would be very
much appreciated.

Best regards,

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Nathan Hjelm
Sent: Tuesday, April 23, 2013 2:47 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7

On Mon, Apr 22, 2013 at 03:17:16PM -0700, Mike Clark wrote:
> Hi,
> I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National
Lab (Titan), and am running in an issue whereby MPI_Init seems to hang
indefinitely, but this issue only arises at large scale, e.g., when
running on 18560 compute nodes (with two MPI processes per node). The
application runs successfully on 4600 nodes, and we are currently trying
to test a 9000 node job to see if this fails or runs.
> We are launching our job using something like the following
> # mpirun command

> mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2
--bind-to core --bind-to numa $app $args"
> # Print and Run the Command

> echo $mpicmd
> $mpicmd >& $output
> Are there any issues that I should be aware of when running OpenMPI on
37120 processes or when running on the Cray Gemini Interconnect?

We have only tested Open MPI up to 131072 ranks on 8192 nodes. Have you
tried running DDT on the process to see where it is hung up?

I have a Titan account so I can help with debugging. I would like to get
this issue fixed in 1.7.2.

users mailing list