Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] debugs for jobs not starting
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2012-10-12 13:23:37

Something doesn't make sense here. If you direct launch with srun, there is no orted involved. The orted only gets launched if you start with mpirun

Did you configure --with-pmi and point to where that include file resides? Otherwise, the procs will all think they are singletons

Sent from my iPhone

On Oct 12, 2012, at 7:27 AM, Michael Di Domenico <mdidomenico4_at_[hidden]> wrote:

> what isn't working is when i fire off an MPI job with over 800 ranks,
> they don't all actually start up a process
> fe, if i do srun -n 1024 --ntasks-per-node 12 xhpl
> and then do a 'pgrep xhpl | wc -l', on all of the allocated nodes, not
> all of them have actually started xhpl
> most will read 12 started processes, but an inconsistent list of nodes
> will fail to actually start xhpl and stall the whole job
> if i look at all the nodes allocated to my job, it does start the orte
> process though
> what i need to figure out, is why the orte process starts, but fails
> to actually start xhpl on some of the nodes
> unfortunately, the list of nodes that don't start xhpl during my runs
> changes each time and no hardware errors are being detected. if i
> cancel the job and restart the job over and over, eventually one will
> actually kick off and run to completion.
> if i run the process outside of slurm just using openmpi, it seems to
> behave correctly, so i'm leaning towards a slurm interacting with
> openmpi problem.
> what i'd like to do is instrument a debug in openmpi that will tell me
> what openmpi is waiting on in order to kick off the xhpl binary
> i'm testing to see whether it's a psm related problem now, i'll check
> back if i can narrow the scope a little more
> On Thu, Oct 11, 2012 at 10:21 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I'm afraid I'm confused - I don't understand what is and isn't working. What
>> "next process" isn't starting?
>> On Thu, Oct 11, 2012 at 9:41 AM, Michael Di Domenico
>> <mdidomenico4_at_[hidden]> wrote:
>>> adding some additional info
>>> did an strace on an orted process where xhpl failed to start, i did
>>> this after the mpirun execution, so i probably missed some output, but
>>> it keeps scrolling
>>> poll([{fd=4, events=POLLIN},{fd=7, events=POLLIN},{fd=8,
>>> events=POLLIN},{fd=10, events=POLLIN},{fd=12, events=POLLIN},{fd=13,
>>> events=POLLIN},{fd=14, events=POLLIN},{fd=15, events=POLLIN},{fd=16,
>>> events=POLLIN}], 9, 1000) = 0 (Timeout)
>>> i didn't see anything useful in /proc under those file descriptors,
>>> but perhaps i missed something i don't know to look for
>>> On Thu, Oct 11, 2012 at 12:06 PM, Michael Di Domenico
>>> <mdidomenico4_at_[hidden]> wrote:
>>>> too add a little more detail, it looks like xhpl is not actually
>>>> starting on all nodes when i kick off the mpirun
>>>> each time i cancel and restart the job, the nodes that do not start
>>>> change, so i can't call it a bad node
>>>> if i disable infiniband with --mca btl self,sm,tcp on occasion i can
>>>> get xhpl to actually run, but it's not consistent
>>>> i'm going to check my ethernet network and make sure there's no
>>>> problems there (could this be an OOB error with mpirun?), on the nodes
>>>> that fail to start xhpl, i do see the orte process, but nothing in the
>>>> logs about why it failed to launch xhpl
>>>> On Thu, Oct 11, 2012 at 11:49 AM, Michael Di Domenico
>>>> <mdidomenico4_at_[hidden]> wrote:
>>>>> I'm trying to diagnose an MPI job (in this case xhpl), that fails to
>>>>> start when the rank count gets fairly high into the thousands.
>>>>> My symptom is the jobs fires up via slurm, and I can see all the xhpl
>>>>> processes on the nodes, but it never kicks over to the next process.
>>>>> My question is, what debugs should I turn on to tell me what the
>>>>> system might be waiting on?
>>>>> I've checked a bunch of things, but I'm probably overlooking something
>>>>> trivial (which is par for me).
>>>>> I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with
>>>>> Infiniband/PSM
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]