Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-04-24 06:53:34

On Apr 23, 2013, at 8:45 PM, Mike Clark <mclark_at_[hidden]> wrote:

> Hi,
> Just to follow up on this. We have managed to get OpenMPI to run at large
> scale, to do so we had to use aprun instead of using openmpi's mpirun
> command.

In general, using direct launch will be faster than going thru mpirun. However, Pasha and I were able to eliminate most of the penalty - you might want to talk to him about how to do it.

> While this has allowed us to now run at the full scale of Titan, we have
> found a huge drop in MPI_Alltoall performance when running at 18K
> processors. E.g., performance per node has decreased by a factor 200X
> versus running at 4.6K nodes. Is there any obvious explanation for this
> that I could have overlooked such as a buffer size or option that needs to
> be set (configure option or environment variable) when running at such
> large scale? We are running inter-communicator one-way sending if this
> makes any difference.

Regardless of how it is launched, each process is going to read the default MCA parameter file to get any tuning info, so that wouldn't explain the difference. There is a connection time difference as processes have to create direct interprocess sockets for any connection setup that is required. I don't recall if the Cray's transport requires such support (IB does). When launched by mpirun, this time is reduced since the messages are routed along pre-existing connections.

However, once that handshake is established, it shouldn't impact performance.

You might check to see that you got the BTLs you were expecting. It's possible that the info exchange (modex) during MPI_Init isn't clearly communicating and so a higher-speed transport disqualified itself. The mechanism is very different when launched via mpirun vs direct-launched.

> Yours optimistically,
> Mike.
> On 4/22/13 3:17 PM, "Mike Clark" <mclark_at_[hidden]> wrote:
>> Hi,
>> I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National
>> Lab (Titan), and am running in an issue whereby MPI_Init seems to hang
>> indefinitely, but this issue only arises at large scale, e.g., when
>> running on 18560 compute nodes (with two MPI processes per node). The
>> application runs successfully on 4600 nodes, and we are currently trying
>> to test a 9000 node job to see if this fails or runs.
>> We are launching our job using something like the following
>> # mpirun command
>> mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2
>> --bind-to core --bind-to numa $app $args"
>> # Print and Run the Command
>> echo $mpicmd
>> $mpicmd >& $output
>> Are there any issues that I should be aware of when running OpenMPI on
>> 37120 processes or when running on the Cray Gemini Interconnect?
>> We are using OpenMPI 1.7.1 (1.7.x is required for Cray Gemini support)
>> and gcc 4.7.2.
>> Thanks,
>> Mike.
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information. Any unauthorized review, use, disclosure or distribution
> is prohibited. If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]