Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] srun and openmpi
From: Michael Di Domenico (mdidomenico4_at_[hidden])
Date: 2011-01-25 12:53:06


Thanks. We're only seeing it on machines with Ethernet only as the
interconnect. fortunately for us that only equates to one small
machine, but it's still annoying. unfortunately, i don't have enough
knowledge to dive into the code to help fix, but i can certainly help
test

On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> I am seeing similar issues on our slurm clusters. We are looking into the
> issue.
>
> -Nathan
> HPC-3, LANL
>
> On Tue, 11 Jan 2011, Michael Di Domenico wrote:
>
>> Any ideas on what might be causing this one?  Or atleast what
>> additional debug information someone might need?
>>
>> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
>> <mdidomenico4_at_[hidden]> wrote:
>>>
>>> I'm still testing the slurm integration, which seems to work fine so
>>> far.  However, i just upgraded another cluster to openmpi-1.5 and
>>> slurm 2.1.15 but this machine has no infiniband
>>>
>>> if i salloc the nodes and mpirun the command it seems to run and complete
>>> fine
>>> however if i srun the command i get
>>>
>>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
>>> unexpected prcoess identifier
>>>
>>> the job does not seem to run, but exhibits two behaviors
>>> running a single process per node the job runs and does not present
>>> the error (srun -N40 --ntasks-per-node=1)
>>> running multiple processes per node, the job spits out the error but
>>> does not run (srun -n40 --ntasks-per-node=8)
>>>
>>> I copied the configs from the other machine, so (i think) everything
>>> should be configured correctly (but i can't rule it out)
>>>
>>> I saw (and reported) a similar error to above with the 1.4-dev branch
>>> (see mailing list) and slurm, I can't say whether they're related or
>>> not though
>>>
>>>
>>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>>>
>>>> Yo Ralph --
>>>>
>>>> I see this was committed
>>>> https://svn.open-mpi.org/trac/ompi/changeset/24197.  Do you want to add a
>>>> blurb in README about it, and/or have this executable compiled as part of
>>>> the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)?
>>>>
>>>> Right now, it's only compiled as part of "make check" and not installed,
>>>> right?
>>>>
>>>>
>>>>
>>>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>>>>
>>>>> Run the program only once - it can be in the prolog of the job if you
>>>>> like. The output value needs to be in the env of every rank.
>>>>>
>>>>> You can reuse the value as many times as you like - it doesn't have to
>>>>> be unique for each job. There is nothing magic about the value itself.
>>>>>
>>>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>>>>
>>>>>> How early does this need to run? Can I run it as part of a task
>>>>>> prolog, or does it need to be the shell env for each rank?  And does
>>>>>> it need to run on one node or all the nodes in the job?
>>>>>>
>>>>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Well, I couldn't do it as a patch - proved too complicated as the psm
>>>>>>> system looks for the value early in the boot procedure.
>>>>>>>
>>>>>>> What I can do is give you the attached key generator program. It
>>>>>>> outputs the envar required to run your program. So if you run the attached
>>>>>>> program and then export the output into your environment, you should be
>>>>>>> okay. Looks like this:
>>>>>>>
>>>>>>> $ ./psm_keygen
>>>>>>>
>>>>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>>>>>>> $
>>>>>>>
>>>>>>> You compile the program with the usual mpicc.
>>>>>>>
>>>>>>> Let me know if this solves the problem (or not).
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>>>>>>>
>>>>>>>> Sure, i'll give it a go
>>>>>>>>
>>>>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated
>>>>>>>>> by mpirun as it is shared info - i.e., every proc has to get the same value.
>>>>>>>>>
>>>>>>>>> I can create a patch that will do this for the srun direct-launch
>>>>>>>>> scenario, if you want to try it. Would be later today, though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>>>>>>>
>>>>>>>>>> Well maybe not horray, yet.  I might have jumped the gun a bit,
>>>>>>>>>> it's
>>>>>>>>>> looking like srun works in general, but perhaps not with PSM
>>>>>>>>>>
>>>>>>>>>> With PSM i get this error, (at least now i know what i changed)
>>>>>>>>>>
>>>>>>>>>> Error obtaining unique transport key from ORTE
>>>>>>>>>> (orte_precondition_transports not present in the environment)
>>>>>>>>>> PML add procs failed
>>>>>>>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>>>>
>>>>>>>>>> Turn off PSM and srun works fine
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hooray!
>>>>>>>>>>>
>>>>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think i take it all back.  I just tried it again and it seems
>>>>>>>>>>>> to
>>>>>>>>>>>> work now.  I'm not sure what I changed (between my first and
>>>>>>>>>>>> this
>>>>>>>>>>>> msg), but it does appear to work now.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>>>>>>>>>> <mdidomenico4_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes that's true, error messages help.  I was hoping there was
>>>>>>>>>>>>> some
>>>>>>>>>>>>> documentation to see what i've done wrong.  I can't easily cut
>>>>>>>>>>>>> and
>>>>>>>>>>>>> paste errors from my cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does
>>>>>>>>>>>>> look
>>>>>>>>>>>>> like a rank communications error
>>>>>>>>>>>>>
>>>>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>>>>>>>>>>> whose
>>>>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line
>>>>>>>>>>>>> 145.
>>>>>>>>>>>>> *** MPI_INIT failure message (snipped) ***
>>>>>>>>>>>>> orte_grpcomm_modex failed
>>>>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process
>>>>>>>>>>>>> whose
>>>>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun
>>>>>>>>>>>>> which i
>>>>>>>>>>>>> have to Ctrl-C and terminate
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have mpiports defined in my slurm config and running srun
>>>>>>>>>>>>> with
>>>>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>>>>>>>>>>> getting parts to the shell
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain
>>>>>>>>>>>>> <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor
>>>>>>>>>>>>>> for it. :-/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It would really help if you included the error message.
>>>>>>>>>>>>>> Otherwise, all I can do is guess, which wastes both of our time :-(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My best guess is that the port reservation didn't get passed
>>>>>>>>>>>>>> down to the MPI procs properly - but that's just a guess.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone point me towards the most recent documentation for
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>> srun and openmpi?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts
>>>>>>>>>>>>>>> config
>>>>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting
>>>>>>>>>>>>>>> an error
>>>>>>>>>>>>>>> from openmpi during setup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm sure I'm missing a step.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>