When ever this happens we found the code to have a deadlock. users
never saw it until they cross the eager->roundevous threshold.
Yes you can disable shared memory with:
mpirun --mca btl ^sm
Or you can try increasing the eager limit.
ompi_info --param btl sm
MCA btl: parameter "btl_sm_eager_limit" (current value:
You can modify this limit at run time, I think (can't test it right
now) it is just:
mpirun --mca btl_sm_eager_limit 40960
I think you can also in tweaking these values use env Vars in place
of putting it all in the mpirun line:
Center for Advanced Computing
On Dec 5, 2008, at 12:22 PM, Justin wrote:
> We are currently using OpenMPI 1.3 on Ranger for large processor
> jobs (8K+). Our code appears to be occasionally deadlocking at
> random within point to point communication (see stacktrace below).
> This code has been tested on many different MPI versions and as far
> as we know it does not contain a deadlock. However, in the past we
> have ran into problems with shared memory optimizations within MPI
> causing deadlocks. We can usually avoid these by setting a few
> environment variables to either increase the size of shared memory
> buffers or disable shared memory optimizations all together. Does
> OpenMPI have any known deadlocks that might be causing our
> deadlocks? If are there any work arounds? Also how do we disable
> shared memory within OpenMPI?
> Here is an example of where processors are hanging:
> #0 0x00002b2df3522683 in mca_btl_sm_component_progress () from /
> #1 0x00002b2df2cb46bf in mca_bml_r2_progress () from /opt/apps/
> #2 0x00002b2df0032ea4 in opal_progress () from /opt/apps/intel10_1/
> #3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from /
> #4 0x00002b2ded109e34 in PMPI_Waitsome () from /opt/apps/intel10_1/
> users mailing list