Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] debugs for jobs not starting
From: Michael Di Domenico (mdidomenico4_at_[hidden])
Date: 2012-10-12 10:03:02


turned on the daemon debugs for orted and noticed this difference

---- i get this on all the good nodes (ones that actually started xhpl)

Daemon was launched on node08 - beginning to initialize
[node08:21230] [[64354,0],1] orted_cmd: received add_local_procs
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],84]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],85]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],86]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],87]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],88]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],89]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],90]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],91]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],92]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],93]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],94]
[node08:21230] [[64354,0],0] orted_recv: received sync+nidmap from
local proc [[64354,1],95]
[node08:21230] [[64354,0],1] orted: up and running - waiting for commands!
[node08:21230] procdir: /tmp/openmpi-sessions-user_at_node08_0/28/1/1
[node08:21230] jobdir: /tmp/openmpi-sessions-user_at_node08_/44228/1
[node08:21230] top: openmpi-sessions-user_at_node08_0
[node08:21230] tmp: /tmp
[...repeats the above five lines a bunch of times...]

--- get this on the ones that do not start xhpl

Daemon was launched on node06 - beginning to initialize
[node06:11230] [[46344,0],1] orted: up and running - waiting for commands!
[node06:11230] procdir: /tmp/openmpi-sessions-user_at_node06_0/28/1/1
[node06:11230] jobdir: /tmp/openmpi-sessions-user_at_node06_/44228/1
[node06:11230] top: openmpi-sessions-user_at_node06_0
[node06:11230] tmp: /tmp
[...above lines only come out once...]

On Fri, Oct 12, 2012 at 9:27 AM, Michael Di Domenico
<mdidomenico4_at_[hidden]> wrote:
> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
>
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
>
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
>
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
>
> if i look at all the nodes allocated to my job, it does start the orte
> process though
>
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
>
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected. if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
>
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
>
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
>
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
>
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>>
>>
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>> <mdidomenico4_at_[hidden]> wrote:
>>>
>>> adding some additional info
>>>
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>>
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>>
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>>
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>> <mdidomenico4_at_[hidden]> wrote:
>>> > too add a little more detail, it looks like xhpl is not actually
>>> > starting on all nodes when i kick off the mpirun
>>> >
>>> > each time i cancel and restart the job, the nodes that do not start
>>> > change, so i can't call it a bad node
>>> >
>>> > if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>>> > get xhpl to actually run, but it's not consistent
>>> >
>>> > i'm going to check my ethernet network and make sure there's no
>>> > problems there (could this be an OOB error with mpirun?), on the nodes
>>> > that fail to start xhpl, i do see the orte process, but nothing in the
>>> > logs about why it failed to launch xhpl
>>> >
>>> >
>>> >
>>> > On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
>>> > <mdidomenico4_at_[hidden]> wrote:
>>> >> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>>> >> start when the rank count gets fairly high into the thousands.
>>> >>
>>> >> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>>> >> processes on the nodes, but it never kicks over to the next process.
>>> >>
>>> >> My question is, what debugs should I turn on to tell me what the
>>> >> system might be waiting on?
>>> >>
>>> >> I've checked a bunch of things, but I'm probably overlooking something
>>> >> trivial (which is par for me).
>>> >>
>>> >> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
>>> >> Infiniband/PSM
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users