Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock on large numbers of processors
From: Brock Palen (brockp_at_[hidden])
Date: 2008-12-05 13:36:59


When ever this happens we found the code to have a deadlock. users
never saw it until they cross the eager->roundevous threshold.

Yes you can disable shared memory with:

mpirun --mca btl ^sm

Or you can try increasing the eager limit.

ompi_info --param btl sm

MCA btl: parameter "btl_sm_eager_limit" (current value:
                           "4096")

You can modify this limit at run time, I think (can't test it right
now) it is just:

mpirun --mca btl_sm_eager_limit 40960

I think you can also in tweaking these values use env Vars in place
of putting it all in the mpirun line:

export OMPI_MCA_btl_sm_eager_limit=40960

See:
http://www.open-mpi.org/faq/?category=tuning

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp_at_[hidden]
(734)936-1985

On Dec 5, 2008, at 12:22 PM, Justin wrote:

> Hi,
>
> We are currently using OpenMPI 1.3 on Ranger for large processor
> jobs (8K+). Our code appears to be occasionally deadlocking at
> random within point to point communication (see stacktrace below).
> This code has been tested on many different MPI versions and as far
> as we know it does not contain a deadlock. However, in the past we
> have ran into problems with shared memory optimizations within MPI
> causing deadlocks. We can usually avoid these by setting a few
> environment variables to either increase the size of shared memory
> buffers or disable shared memory optimizations all together. Does
> OpenMPI have any known deadlocks that might be causing our
> deadlocks? If are there any work arounds? Also how do we disable
> shared memory within OpenMPI?
>
> Here is an example of where processors are hanging:
>
> #0 0x00002b2df3522683 in mca_btl_sm_component_progress () from /
> opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
> #1 0x00002b2df2cb46bf in mca_bml_r2_progress () from /opt/apps/
> intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
> #2 0x00002b2df0032ea4 in opal_progress () from /opt/apps/intel10_1/
> openmpi/1.3/lib/libopen-pal.so.0
> #3 0x00002b2ded0d7622 in ompi_request_default_wait_some () from /
> opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
> #4 0x00002b2ded109e34 in PMPI_Waitsome () from /opt/apps/intel10_1/
> openmpi/1.3//lib/libmpi.so.0
>
>
> Thanks,
> Justin
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>