Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] rcu_sched stalls on CPU
From: Simon DeDeo (simon.dedeo_at_[hidden])
Date: 2013-02-27 11:20:57


We've resolved this issue, which appears to have been an early warning of a large-scale hardware failure. Twelve hours later the machine was unable to power-on or self-test.

We are now running on a new machine, and the same jobs are finishing normally -- without having to worry about Send/Ssend/Isend buffering differences, and relying solely on blocking communication.

Simon

Research Fellow
Santa Fe Institute
http://santafe.edu/~simon

On 25 Feb 2013, at 4:04 PM, Simon DeDeo <simon.dedeo_at_[hidden]> wrote:

> I have been having some trouble tracing the source of a CPU stall with open MPI on Gentoo.
>
> My code is very simple: each process does a Monte Carlo run, saves some data to disk, and sends back a single MPI_DOUBLE to node zero, which picks the best value from all the computations (including the one it did itself).
>
> For some reason, this can cause CPUs to "stall" (see the error below, on dmesg output) -- this stall actually causes the system to crash and reboot, which seems pretty crazy.
>
> My best guess is that some of the nodes greater than zero have "MPI_Send"s out, but node zero is not finished with its own computation yet, and so has not put out an MPI_Recv. They get mad waiting? This happens when I give the Monte Carlo runs large numbers, and so the variance in end time is larger.
>
> However, the behavior seems a bit extreme, and I am wondering if something more subtle is going on. My sysadmin was trying to fix something on the machine the last time it crashed, and it trashed the kernel! So I am also in the sysadmin doghouse.
>
> Any help or advice greatly appreciated! Is it likely to be an MPI_Send/MPI_Recv problem, or is there something else going on?
>
> [ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} (detected by 17, t=60002 jiffies)
> [ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10
> [ 1273.079275] Call Trace:
> [ 1273.079277] <IRQ> [<ffffffff81099b87>] rcu_check_callbacks+0x5a7/0x600
> [ 1273.079294] [<ffffffff8103fae3>] update_process_times+0x43/0x80
> [ 1273.079298] [<ffffffff8106d796>] tick_sched_timer+0x76/0xc0
> [ 1273.079303] [<ffffffff8105329e>] __run_hrtimer.isra.33+0x4e/0x100
> [ 1273.079306] [<ffffffff81053adb>] hrtimer_interrupt+0xeb/0x220
> [ 1273.079311] [<ffffffff8101fd94>] smp_apic_timer_interrupt+0x64/0xa0
> [ 1273.079316] [<ffffffff81515f07>] apic_timer_interrupt+0x67/0x70
> [ 1273.079317] <EOI>
>
> Simon
>
> Research Fellow
> Santa Fe Institute
> http://santafe.edu/~simon
>
>