I have finally solved the issue, or as it should be said, discovered my oversight. And it's a mistake that will have je mad at myself for a while. I'm new to MPI, though, and not versed in the MPP communications of LS-DYNA at all though, so it was an oversight easily made.
The key to fixing the entire situation was the test input file I was using. LS-DYNA accepts input files that contain all of the data that tell LS-DYNA what to do with the simulation. So I would invoke mpi as such: mpirun -np 16 mppLSDYNA input=myfile.k . That's not related to the issue but important to differentiate between the mpi program input and ls dyna application input.
Anyhow, I made up a simple collision simulation in LSDYNA to use as a test file (~15 kB) because our typical jobs have very large files (50-150MB) that have very long run times (often 7+
days). Therefore I chose a simple analysis that could be executed fast so I could see data from all parts of it and how OpenMPI behaved during the entire simulation...and that's where the problem was.
(I have read in various places that MPI_Allreduce is LS-DYNA's heavy hitter in the MPI communications and that is why I hypothesize the following:) The MPP communications of LS-DYNA do an MPI_Allreduce to coordinate for EVERY or very nearly every iteration of the program. My executable file ran so fast that it was completing 5000 iterations within a single second on a single core (I found this out very recently, minutes ago in-fact, when I was testing mpirun with only two cores locally). And that was where my network tie ups were happening.
I started measuring throughputs of the 16 core, 8 core, 4 core, and 2 core jobs over the network and was shocked to see that 16 cores was capping my network out at 120 Mbits/sec. 8 cores was also using
120 Mbits/sec, 4 cores used 75 Mbits/sec and 2 cores used around 30 or 40 Mbits/sec.
Needless to say, it finally clicked in my brain a few minutes ago, and I started up a 16 core job of our standard issue file once I realized that the communications were just happening too often, and not that they were taking a long time. I had the right idea initially, because typically the issue of the subroutines taking a long time is worrisome, but, with very repetitive and iterative programs comes the need for them to coordinate on a continuous and rapid basis. The 16 core job file I started up typically takes 100-120 hours, and typically runs on 8 cores for that amount of time (SMP). When I started this OpenMPI job, LS-DYNA gave me an estimate of 43 hours! This earns OpenMPI some great respect, quite a powerful program once setup correctly.
As a side note, the throughput of this job was around 17 Mbits/sec.
All in all, easily fixed, just a
few days of frustration. Thank you all again for all of your help. It was paramount in enabling me to discover the issue. Thanks again.
--- On Wed, 7/14/10, Eugene Loh <email@example.com> wrote:
From: Eugene Loh <firstname.lastname@example.org>
Subject: [OMPI users] LS-DYNA profiling [was: OpenMPI Hangs, No Error]
To: "Open MPI Users" <email@example.com>
Date: Wednesday, July 14, 2010, 2:26 PM
I started today reading e-mail quickly and out of order. So, I'm going
back to an earlier message now, but still with the new Subject heading,
which better reflects where you are in your progress. I'm extracting
some questions from this thread, from bottom/old to top/new:
1) What tools to use? As others have commented, the issue may have
less to do with the kernel or inefficient data movement and more to do
with what's going on in your application -- that is, some processes are
reaching synchronization points earlier than others. For
application-level tools, there is an OMPI FAQ category at
and specifically a list
of some tools you could "download for free". Anyhow, sounds like
you're getting traction with Sun Studio -- and personally I think
that's a good choice. :^)
2) Spread processes on multiple nodes or colocate on one node? It
depends, but I agree with Ralph that it's not surprising if running on
a single node is faster, presumably due to faster communication
performance. (But, I'd temper his statement that it would *always* be
3) I agree with David and Ralph that you're probably spending more
time waiting on messages than actually moving data. If you use Sun
Studio with Sun ClusterTools (now Oracle Message Passing Toolkit),
it'll also break apart "MPI wait" time from "MPI work" time (time spent
moving data) and you could see this rather clearly. But, any of the
MPI tracing tools will be able to show you other indications of this
problem. Most graphically, you might look at an MPI timeline. E.g.,
you might see one or more functions stuck in a long MPI_Barrier call,
with other processes lingering in computation before the last process
enters the barrier and all processes are released from that barrier.
Or, one function stuck in a long MPI_Recv, released when the message
arrives, as indicated by a message line joining the sending and
4) I could imagine the distributed memory version of LS-DYNA running
faster than the shared-memory version since it has better data locality.
5) The I/O portion might be very, very significant. Maybe processes
are waiting on other processes that are not computing but doing a lot
of I/O. You say runs are faster when all processes are on the same
node versus when they're on different nodes. One experiment would be
to compare all processes running on node A versus running all processes
on node B versus distributing processes among both A and B. This would
be a way of discriminating between "all on one node is faster than
distributed" versus "one node is faster than another". Alternatively,
if look at the timeline and see one process stuck in a long MPI call,
look at the process it's waiting on. Does the call stack of the
"laggard" suggest what that laggard is doing? Computation or I/O? You
might need to know a little about LS-DYNA to tell.
6) Functions with "_" at the end are probably Fortran names, which
should make sense for LS-DYNA (which I think is basically Fortran).
7) Regarding two sets of functions for "everything." I think this is
no surprise. There are "inclusive" and "exclusive" times. If you look
at "inclusive" times (time spent in a function and all its callees),
the time for a function will be almost exactly equal to the time spent
in a wrapper that calls that function. I think this is what you're
seeing. If you were to look instead at "exclusive" times (time spent
in a function, excluding time spent in its callees), the difference
between a "real" function and a wrapper that calls it would be very
clear. If the numbers are percentages, then it's clear they are
"inclusive" since they add up to well over 100%.
8) Regarding your earlier problems using Studio: no, you shouldn't
need to build OMPI specially for Studio use. (I'm interested in
hearing more about the problems you encountered and how you resolved
them. You can send to me off line since I suspect most of the mail
list would not be interested.)
I hope at least some of those comments are helpful. Good luck.
Robert Walters wrote:
|I think I forgot to mention earlier that the application
I am using is pre-compiled. It is a finite element software called
LS-DYNA. It is not open source and I likely cannot obtain the code it
uses for MPP. This version I am using was specifically compiled, by the
parent company, for OpenMPI 1.4.2 MPP operations.
I recently installed the Sun Studio 12.1 to attempt to analyze the
situation. It seems to work partially. It will record various processes
individually, which is cryptic. The function it fails on, though, is
the MPI Tracing. It errors that "no MPI tracing data file in
experiment, MPI Timeline and MPI Charts will not be available".
Sometime during the analysis (about 10,000 iterations later, the
VT_MAX_FLUSHES complains that there are too many i/o flushes and its
not happy. I've increased this number in the environmental variable and
killed the analysis before it had a chance to error but still no MPI
Trace data is recorded. Not sure if you guys have heard of that
happening or know any way to fix it...Did OpenMPI need to be
configured/built for Sun Studio use?
I also noticed that from the data I do get back, there are two sets of
functions for everything. There is mpi_recv and then my_recv_, both
with the same % utilization time. The mpi one comes from your program's
library and the my_recv_ one comes from my program. Is that typical or
should the program I'm using be saying mpi_recv only? This data may be
enough to help me see what's wrong so I will pass it along. Keep in
mind this is percent time of total run time and not percent of MPI
communication. I attached the information in a picture rather than me
attempting to format a nice table in this nasty e-mail application.
I blacked out items that are related to LS-DYNA but afterward I just
realized that I think every function with an _ at the end represents a
command issuing from LS-DYNA.
These are my big spenders. The processes I did not include are in the
bottom 4%. The processes that would be above these were the LS-DYNA
applications at 100%. Like I mentioned earlier, there are two instances
of every MPI command, and they carry the same percent usage. It's
curious that this version, built for OpenMPI, uses different functions.
Just for a little more background info, OpenMPI is being launched from
a local hard drive on each machine, but the LS-DYNA job files, and
related data output files, are on a mounted drive on that machine,
where the mounted drive is located on a different machine also in the
cluster. We were thinking that might be an issue but it isn't writing
enough data for me to think that would significantly decrease MPP
I would like to make one last mention. That is that OpenMPI running 8
cores on a single node, with all the communication, works flawlessly.
It works much faster than the Shared Memory Parallel (SMP) version of
LS-DYNA that we currently have used scaled to 8 cores. LS-DYNA seems to
be approximately 25% faster (don't quote me on that) when using the
OpenMPI installation than when using the standard SMP, which is
awesome. My point being that OpenMPI seems to be working fine, even
with the screwy mounted drive. This leads me to continue to point at
Anyhow, let me know if anything seems weird on the OpenMPI
communication subroutines. I don't have any numbers to lean on from
David Zhang <firstname.lastname@example.org>
Subject: Re: [OMPI users] OpenMPI Hangs, No Error
To: "Open MPI Users" <email@example.com>
Date: Tuesday, July 13, 2010, 9:42 PM
Like Ralph says, the slow down may not
be coming from the kernel, but rather on waiting for messages. What
MPI send/recv commands are you using?
On Tue, Jul 13, 2010 at 11:53 AM,
Ralph Castain <firstname.lastname@example.org>
I'm afraid that having 2 cores on a single
machine will always outperform having 1 core on each machine if any
communication is involved.
The most likely thing that is happening is that OMPI
is polling waiting for messages to arrive. You might look closer at
your code to try and optimize it better so that number-crunching can
get more attention.
On Jul 13, 2010, at 12:22 PM, Robert Walters wrote:
|Following up. The sysadmin opened ports for machine to
machine communication and OpenMPI is running successfully with no
errors in connectivity_c, hello_c, or ring_c. Since, I have started to
implement our MPP software (finite element analysis) that we have, and
upon running a simple, 1 core on machine1, 1 core on machine2, job, I
notice it is considerably slower than a 2 core job on a single machine.
A quick look at top shows me kernel usage is almost twice what cpu
usage is! On a 16 core job, (8 cores per node so 2 nodes total) test,
OpenMPI was consuming ~65% of the cpu for kernel related items rather
than number-crunching related items...Granted, we are running on GigE,
but this is a finite element code we are running with no heavy data
transfer within it. I'm looking into benchmarking tools, but my
sysadmin is not very open to installing third party softwares. Do you
have any suggestions for what I can use that would be "big name" or
guaranteed safe tools I can use to figure out what's causing the hold
up with all the kernel usage? I'm pretty sure its network traffic but I
have no way of telling (as far as I know because I'm not a Linux whiz)
with the standard tools in RHEL.
-----Inline Attachment Follows-----