On Feb 4, 2013, at 10:55 AM, Bharath Ramesh <bramesh_at_[hidden]> wrote:
> I am trying to debug an issue which is really weird. I have
> simple MPI hello world application (attached) that hangs when I
> try to run on our cluster using 256 nodes with 16 cores on each
> node. The cluster uses QDR IB.
> I am able to run the test over ethernet by excluding openib from
> the btl. However, what is weird is that for the same set of nodes
> xhpl completes without any error using 256 nodes and 16 cores. I
> have tried running the Pallas MPI Benchmark and it also behaves
> similarly to hello world and ends up hanging when I run it using
> 256 nodes.
Sorry for the delay; I was on travel all last week and fell behind.
I'm not sure I can parse your scenario description. Are you saying:
- hello world over IB hangs at 256*16 procs
- hello world over TCP works at 256*16 procs
- xhpl over TCP works at 256*16 procs
- IMB over ?TCP|IB? hangs at 256*16 procs
> When I attach gdb to the MPI processes and look at the backtrace
> I see that close ~1000 of the MPI processes are stuck in MPI_Send
> while the others are waiting in MPI_Finalize. I have checked to
> make sure that the ulimit setting for locked memory is unlimited.
> The number of open files per process is 131072. The default MPI
> stack provided is openmpi-1.6.1 on the system. I compiled
> openmpi-1.6.3 in my home directory and the behavior remains to be
> the same.
> I would appreciate any help in debugging this issue.
Can you try the 1.6.4rc? http://www.open-mpi.org/software/ompi/v1.6/
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/