Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problems on large clusters
From: Addepalli, Srirangam V (srirangam.v.addepalli_at_[hidden])
Date: 2011-06-21 12:01:06

Hello Thorsten
What type of IB interface do you have (qlogic ?). I often run into simarl issue when running 256 core jobs . It mostly happens for me as i hit a node with IB issues.
nothing related to openmpi. If you are using qlogic PSM try using ping-pong ex to check availability of all nodes.

From: users-bounces_at_[hidden] [users-bounces_at_[hidden]] On Behalf Of Thorsten Schuett [schuett_at_[hidden]]
Sent: Tuesday, June 21, 2011 10:46 AM
To: users_at_[hidden]
Subject: [OMPI users] Problems on large clusters


I am running openmpi 1.5.3 on a IB cluster and I have problems starting jobs
on larger node counts. With small numbers of tasks, it usually works. But now
the startup failed three times in a row using 255 nodes. I am using 255 nodes
with one MPI task per node and the mpiexec looks as follows:

mpiexec --mca btl self,openib --mca mpi_leave_pinned 0 ./a.out

After ten minutes, I pulled a stracktrace on all nodes and killed the job,
because there was no progress. In the following, you will find the stack trace
generated with gdb thread apply all bt. The backtrace looks basically the same
on all nodes. It seems to hang in mpi_init.

Any help is appreciated,


Thread 3 (Thread 46914544122176 (LWP 28979)):
#0 0x00002b6ee912d9a2 in select () from /lib64/
#1 0x00002b6eeabd928d in service_thread_start (context=<value optimized out>)
at btl_openib_fd.c:427
#2 0x00002b6ee835e143 in start_thread () from /lib64/
#3 0x00002b6ee9133b8d in clone () from /lib64/
#4 0x0000000000000000 in ?? ()

Thread 2 (Thread 46916594338112 (LWP 28980)):
#0 0x00002b6ee912b8b6 in poll () from /lib64/
#1 0x00002b6eeabd7b8a in btl_openib_async_thread (async=<value optimized
out>) at btl_openib_async.c:419
#2 0x00002b6ee835e143 in start_thread () from /lib64/
#3 0x00002b6ee9133b8d in clone () from /lib64/
#4 0x0000000000000000 in ?? ()

Thread 1 (Thread 47755361533088 (LWP 28978)):
#0 0x00002b6ee9133fa8 in epoll_wait () from /lib64/
#1 0x00002b6ee87745db in epoll_dispatch (base=0xb79050, arg=0xb558c0,
tv=<value optimized out>) at epoll.c:215
#2 0x00002b6ee8773309 in opal_event_base_loop (base=0xb79050, flags=<value
optimized out>) at event.c:838
#3 0x00002b6ee875ee92 in opal_progress () at runtime/opal_progress.c:189
#4 0x0000000039f00001 in ?? ()
#5 0x00002b6ee87979c9 in std::ios_base::Init::~Init () at
#6 0x00007fffc32c8cc8 in ?? ()
#7 0x00002b6ee9d20955 in orte_grpcomm_bad_get_proc_attr (proc=<value
optimized out>, attribute_name=0x2b6ee88e5780 " \020322351n+",
val=0x2b6ee875ee92, size=0x7fffc32c8cd0) at grpcomm_bad_module.c:500
#8 0x00002b6ee86dd511 in ompi_modex_recv_key_value (key=<value optimized
out>, source_proc=<value optimized out>, value=0xbb3a00, dtype=14 '\016') at
#9 0x00002b6ee86d7ea1 in ompi_proc_set_arch () at proc/proc.c:154
#10 0x00002b6ee86db1b0 in ompi_mpi_init (argc=15, argv=0x7fffc32c92f8,
requested=<value optimized out>, provided=0x7fffc32c917c) at
#11 0x00007fffc32c8e88 in ?? ()
#12 0x00002b6ee77f8348 in ?? ()
#13 0x00007fffc32c8e60 in ?? ()
#14 0x00007fffc32c8e20 in ?? ()
#15 0x0000000009efa994 in ?? ()
#16 0x0000000000000000 in ?? ()
users mailing list