Jeff & George,
> Hum; interesting. I can't think of any reason why that would be a
problem offhand. The
> mca_btl_sm_component_progress() function is the shared memory
> opal_progress() and mca_bml_r2_progress() are likely mainly
dispatching off to this
> Does OSS interfere with shared memory between processes in any
way? (I'm not enough
> of a kernel guy to know what the ramifications of ptrace and
Open|SS shouldn't interfere with shared memory. We use the pthread
library to access some TLS, but no shared memory...
> There might be one reason to slowdown the application quite a bit.
If the fact that you're
> using timer interact with the libevent (the library we're using to
internally manage any kind
> of events), then we might end-up in the situation where we call the
poll for every iteration
> in the event library. And this is really expensive.
I did contemplate the notion that maybe we were getting into the
"progress monitoring" part of OpenMPI every time the timer interrupted
the process (1000s of times per second). Can either of you see any
mechanism by which that might happen?
> A quick way to figure out if this is that case is to run Open MPI
without support for shared
> memory (--mca btl ^sm). This way we will call poll on a regular
basis anyway, and if there
> is no difference between a normal run and a OSS one, we know at
least where to start
> looking ...
I ran SMG2000 on an 8-CPU Yellowrail node in the two configurations
and recorded the wall/cpu clock times as reported by SMG2000 itself:
"mpirun -np 8 smg2000 -n 32 64 64"
Struct Interface, wall clock time = 0.042348 seconds
Struct Interface, cpu clock time = 0.040000 seconds
SMG Setup, wall clock time =0.732441 seconds
SMG Setup, cpu clock time = 0.730000 seconds
SMG Solve, wall clock time = 6.881814 seconds
SMG Solve, cpu clock time =6.880000 seconds
"mpirun --mca btl ^sm -np 8 smg2000 -n 64 64 64"
Struct Interface, wall clock time = 0.059137 seconds
Struct Interface, cpu clock time = 0.060000 seconds
SMG Setup, wall clock time = 0.931437 seconds
SMG Setup, cpu clock time = 0.930000 seconds
SMG Solve, wall clock time = 9.107343 seconds
SMG Solve, cpu clock time = 9.110000 seconds
But running the application with the "--mac btl ^sm" option inside
Open|SS also results in an extreme slowdown. I.e. it doesn't make any
difference whether the shared memory transport is enabled or not. Open|
SS reports time spent as follows (in case this helps pinpoint what is
going on inside OpenMPI):
time in seconds. Function (defining location)
364.050000 btl_openib_component_proress (libmpi.so.0)
165.890000 mthca_poll_cq (libmthca-rdmav2.so)
122.090000 pthread_spin_lock (libpthread.so.0)
90.790000 opal_progress (libopen-pal.so.0)
48.230000 mca_bml_r2_progress (libmpi.so.0)
30.880000 ompi_request_wait_all (libmpi.so.0)
9.780000 pthread_spin_unlock (libpthread.so.0)
4.910000 mthca_free_srq_wqe (libmthca-rdmav2.so)
4.910000 mthca_unlock_cqs (libmthca-rdmav2.so)
4.730000 mthca_lock_cqs (libmthca-rdmav2.so)
0.890000 __poll (libc.so.6)
Does this help at all?
-- Bill Hachfeld, The Open|SpeedShop Project