Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] srun and openmpi
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2011-01-25 14:59:15


We are seeing the similar problem with our infiniband machines. After some investigation I discovered that we were not setting our slurm environment correctly (ref: https://computing.llnl.gov/linux/slurm/mpi_guide.html#open_mpi). Are you setting the ports in your slurm.conf and executing srun with --resv-ports?

I have yet to see if this fixes the problem for LANL. Waiting on a sysadmin to modify the slurm.conf.

-Nathan
HPC-3, LANL

On Tue, 25 Jan 2011, Michael Di Domenico wrote:

> Thanks. We're only seeing it on machines with Ethernet only as the
> interconnect. fortunately for us that only equates to one small
> machine, but it's still annoying. unfortunately, i don't have enough
> knowledge to dive into the code to help fix, but i can certainly help
> test
>
> On Mon, Jan 24, 2011 at 1:41 PM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
>> I am seeing similar issues on our slurm clusters. We are looking into the
>> issue.
>>
>> -Nathan
>> HPC-3, LANL
>>
>> On Tue, 11 Jan 2011, Michael Di Domenico wrote:
>>
>>> Any ideas on what might be causing this one?  Or atleast what
>>> additional debug information someone might need?
>>>
>>> On Fri, Jan 7, 2011 at 4:03 PM, Michael Di Domenico
>>> <mdidomenico4_at_[hidden]> wrote:
>>>>
>>>> I'm still testing the slurm integration, which seems to work fine so
>>>> far.  However, i just upgraded another cluster to openmpi-1.5 and
>>>> slurm 2.1.15 but this machine has no infiniband
>>>>
>>>> if i salloc the nodes and mpirun the command it seems to run and complete
>>>> fine
>>>> however if i srun the command i get
>>>>
>>>> [btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
>>>> unexpected prcoess identifier
>>>>
>>>> the job does not seem to run, but exhibits two behaviors
>>>> running a single process per node the job runs and does not present
>>>> the error (srun -N40 --ntasks-per-node=1)
>>>> running multiple processes per node, the job spits out the error but
>>>> does not run (srun -n40 --ntasks-per-node=8)
>>>>
>>>> I copied the configs from the other machine, so (i think) everything
>>>> should be configured correctly (but i can't rule it out)
>>>>
>>>> I saw (and reported) a similar error to above with the 1.4-dev branch
>>>> (see mailing list) and slurm, I can't say whether they're related or
>>>> not though
>>>>
>>>>
>>>> On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>> Yo Ralph --
>>>>>
>>>>> I see this was committed
>>>>> https://svn.open-mpi.org/trac/ompi/changeset/24197.  Do you want to add a
>>>>> blurb in README about it, and/or have this executable compiled as part of
>>>>> the PSM MTL and then installed into $bindir (maybe named ompi-psm-keygen)?
>>>>>
>>>>> Right now, it's only compiled as part of "make check" and not installed,
>>>>> right?
>>>>>
>>>>>
>>>>>
>>>>> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>>>>>
>>>>>> Run the program only once - it can be in the prolog of the job if you
>>>>>> like. The output value needs to be in the env of every rank.
>>>>>>
>>>>>> You can reuse the value as many times as you like - it doesn't have to
>>>>>> be unique for each job. There is nothing magic about the value itself.
>>>>>>
>>>>>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>>>>>
>>>>>>> How early does this need to run? Can I run it as part of a task
>>>>>>> prolog, or does it need to be the shell env for each rank?  And does
>>>>>>> it need to run on one node or all the nodes in the job?
>>>>>>>
>>>>>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Well, I couldn't do it as a patch - proved too complicated as the psm
>>>>>>>> system looks for the value early in the boot procedure.
>>>>>>>>
>>>>>>>> What I can do is give you the attached key generator program. It
>>>>>>>> outputs the envar required to run your program. So if you run the attached
>>>>>>>> program and then export the output into your environment, you should be
>>>>>>>> okay. Looks like this:
>>>>>>>>
>>>>>>>> $ ./psm_keygen
>>>>>>>>
>>>>>>>> OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
>>>>>>>> $
>>>>>>>>
>>>>>>>> You compile the program with the usual mpicc.
>>>>>>>>
>>>>>>>> Let me know if this solves the problem (or not).
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:
>>>>>>>>
>>>>>>>>> Sure, i'll give it a go
>>>>>>>>>
>>>>>>>>> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Ah, yes - that is going to be a problem. The PSM key gets generated
>>>>>>>>>> by mpirun as it is shared info - i.e., every proc has to get the same value.
>>>>>>>>>>
>>>>>>>>>> I can create a patch that will do this for the srun direct-launch
>>>>>>>>>> scenario, if you want to try it. Would be later today, though.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>>>>>>>>>
>>>>>>>>>>> Well maybe not horray, yet.  I might have jumped the gun a bit,
>>>>>>>>>>> it's
>>>>>>>>>>> looking like srun works in general, but perhaps not with PSM
>>>>>>>>>>>
>>>>>>>>>>> With PSM i get this error, (at least now i know what i changed)
>>>>>>>>>>>
>>>>>>>>>>> Error obtaining unique transport key from ORTE
>>>>>>>>>>> (orte_precondition_transports not present in the environment)
>>>>>>>>>>> PML add procs failed
>>>>>>>>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>>>>>>>>>
>>>>>>>>>>> Turn off PSM and srun works fine
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hooray!
>>>>>>>>>>>>
>>>>>>>>>>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think i take it all back.  I just tried it again and it seems
>>>>>>>>>>>>> to
>>>>>>>>>>>>> work now.  I'm not sure what I changed (between my first and
>>>>>>>>>>>>> this
>>>>>>>>>>>>> msg), but it does appear to work now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>>>>>>>>>>> <mdidomenico4_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes that's true, error messages help.  I was hoping there was
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> documentation to see what i've done wrong.  I can't easily cut
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> paste errors from my cluster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's a snippet (hand typed) of the error message, but it does
>>>>>>>>>>>>>> look
>>>>>>>>>>>>>> like a rank communications error
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process
>>>>>>>>>>>>>> whose
>>>>>>>>>>>>>> contact information is unknown in file rml_oob_send.c at line
>>>>>>>>>>>>>> 145.
>>>>>>>>>>>>>> *** MPI_INIT failure message (snipped) ***
>>>>>>>>>>>>>> orte_grpcomm_modex failed
>>>>>>>>>>>>>> --> Returned "A messages is attempting to be sent to a process
>>>>>>>>>>>>>> whose
>>>>>>>>>>>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This msg repeats for each rank, an ultimately hangs the srun
>>>>>>>>>>>>>> which i
>>>>>>>>>>>>>> have to Ctrl-C and terminate
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have mpiports defined in my slurm config and running srun
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>>>>>>>>>>>> getting parts to the shell
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain
>>>>>>>>>>>>>> <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure there is any documentation yet - not much clamor
>>>>>>>>>>>>>>> for it. :-/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It would really help if you included the error message.
>>>>>>>>>>>>>>> Otherwise, all I can do is guess, which wastes both of our time :-(
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My best guess is that the port reservation didn't get passed
>>>>>>>>>>>>>>> down to the MPI procs properly - but that's just a guess.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone point me towards the most recent documentation for
>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>> srun and openmpi?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I followed what i found on the web with enabling the MpiPorts
>>>>>>>>>>>>>>>> config
>>>>>>>>>>>>>>>> in slurm and using the --resv-ports switch, but I'm getting
>>>>>>>>>>>>>>>> an error
>>>>>>>>>>>>>>>> from openmpi during setup.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm sure I'm missing a step.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>