Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] debugs for jobs not starting
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2012-10-12 13:23:37


Something doesn't make sense here. If you direct launch with srun, there is no orted involved. The orted only gets launched if you start with mpirun

Did you configure --with-pmi and point to where that include file resides? Otherwise, the procs will all think they are singletons

Sent from my iPhone

On Oct 12, 2012, at 7:27 AM, Michael Di Domenico <mdidomenico4_at_[hidden]> wrote:

> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
>
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
>
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
>
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
>
> if i look at all the nodes allocated to my job, it does start the orte
> process though
>
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
>
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected. if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
>
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
>
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
>
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
>
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>>
>>
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>> <mdidomenico4_at_[hidden]> wrote:
>>>
>>> adding some additional info
>>>
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>>
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>>
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>>
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>> <mdidomenico4_at_[hidden]> wrote:
>>>> too add a little more detail, it looks like xhpl is not actually
>>>> starting on all nodes when i kick off the mpirun
>>>>
>>>> each time i cancel and restart the job, the nodes that do not start
>>>> change, so i can't call it a bad node
>>>>
>>>> if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>>>> get xhpl to actually run, but it's not consistent
>>>>
>>>> i'm going to check my ethernet network and make sure there's no
>>>> problems there (could this be an OOB error with mpirun?), on the nodes
>>>> that fail to start xhpl, i do see the orte process, but nothing in the
>>>> logs about why it failed to launch xhpl
>>>>
>>>>
>>>>
>>>> On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
>>>> <mdidomenico4_at_[hidden]> wrote:
>>>>> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>>>>> start when the rank count gets fairly high into the thousands.
>>>>>
>>>>> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>>>>> processes on the nodes, but it never kicks over to the next process.
>>>>>
>>>>> My question is, what debugs should I turn on to tell me what the
>>>>> system might be waiting on?
>>>>>
>>>>> I've checked a bunch of things, but I'm probably overlooking something
>>>>> trivial (which is par for me).
>>>>>
>>>>> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
>>>>> Infiniband/PSM
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users