Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problems on large clusters
From: Gilbert Grosdidier (Gilbert.Grosdidier_at_[hidden])
Date: 2011-06-21 12:04:21


Bonjour Thorsten,

  Could you please be a little bit more specific about the cluster
itself ?

  G.

Le 21 juin 11 à 17:46, Thorsten Schuett a écrit :

> Hi,
>
> I am running openmpi 1.5.3 on a IB cluster and I have problems
> starting jobs
> on larger node counts. With small numbers of tasks, it usually
> works. But now
> the startup failed three times in a row using 255 nodes. I am using
> 255 nodes
> with one MPI task per node and the mpiexec looks as follows:
>
> mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out
>
> After ten minutes, I pulled a stracktrace on all nodes and killed
> the job,
> because there was no progress. In the following, you will find the
> stack trace
> generated with gdb thread apply all bt. The backtrace looks
> basically the same
> on all nodes. It seems to hang in mpi_init.
>
> Any help is appreciated,
>
> Thorsten
>
> Thread 3 (Thread 46914544122176 (LWP 28979)):
> #0 0x00002b6ee912d9a2 in select () from /lib64/libc.so.6
> #1 0x00002b6eeabd928d in service_thread_start (context=<value
> optimized out>)
> at btl_openib_fd.c:427
> #2 0x00002b6ee835e143 in start_thread () from /lib64/libpthread.so.0
> #3 0x00002b6ee9133b8d in clone () from /lib64/libc.so.6
> #4 0x0000000000000000 in ?? ()
>
> Thread 2 (Thread 46916594338112 (LWP 28980)):
> #0 0x00002b6ee912b8b6 in poll () from /lib64/libc.so.6
> #1 0x00002b6eeabd7b8a in btl_openib_async_thread (async=<value
> optimized
> out>) at btl_openib_async.c:419
> #2 0x00002b6ee835e143 in start_thread () from /lib64/libpthread.so.0
> #3 0x00002b6ee9133b8d in clone () from /lib64/libc.so.6
> #4 0x0000000000000000 in ?? ()
>
> Thread 1 (Thread 47755361533088 (LWP 28978)):
> #0 0x00002b6ee9133fa8 in epoll_wait () from /lib64/libc.so.6
> #1 0x00002b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0,
> tv=<value optimized out>) at epoll.c:215
> #2 0x00002b6ee8773309 in opal_event_base_loop (base=0xb79050,
> flags=<value
> optimized out>) at event.c:838
> #3 0x00002b6ee875ee92 in opal_progress () at runtime/
> opal_progress.c:189
> #4 0x0000000039f00001 in ?? ()
> #5 0x00002b6ee87979c9 in std::ios_base::Init::~Init () at
> ../../.././libstdc++-v3/src/ios_init.cc:123
> #6 0x00007fffc32c8cc8 in ?? ()
> #7 0x00002b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=<value
> optimized out>, attribute_name=0x2b6ee88e5780 " \020322351n+",
> val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
> #8 0x00002b6ee86dd511 in ompi_modex_recv_key_value (key=<value
> optimized
> out>, source_proc=<value optimized out>, value=0xbb3a00, dtype=14
> '\016') at
> runtime/ompi_module_exchange.c:125
> #9 0x00002b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
> #10 0x00002b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8,
> requested=<value optimized out>, provided=0x7fffc32c917c) at
> runtime/ompi_mpi_init.c:699
> #11 0x00007fffc32c8e88 in ?? ()
> #12 0x00002b6ee77f8348 in ?? ()
> #13 0x00007fffc32c8e60 in ?? ()
> #14 0x00007fffc32c8e20 in ?? ()
> #15 0x0000000009efa994 in ?? ()
> #16 0x0000000000000000 in ?? ()
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
*---------------------------------------------------------------------*
   Gilbert Grosdidier                 Gilbert.Grosdidier_at_[hidden]
   LAL / IN2P3 / CNRS                 Phone : +33 1 6446 8909
   Faculté des Sciences, Bat. 200     Fax   : +33 1 6446 8546
   B.P. 34, F-91898 Orsay Cedex (FRANCE)
*---------------------------------------------------------------------*