Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] srun and openmpi
From: Michael Di Domenico (mdidomenico4_at_[hidden])
Date: 2010-12-30 13:18:17


Sure, i'll give it a go

On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> Ah, yes - that is going to be a problem. The PSM key gets generated by mpirun as it is shared info - i.e., every proc has to get the same value.
>
> I can create a patch that will do this for the srun direct-launch scenario, if you want to try it. Would be later today, though.
>
>
> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>
>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>> looking like srun works in general, but perhaps not with PSM
>>
>> With PSM i get this error, (at least now i know what i changed)
>>
>> Error obtaining unique transport key from ORTE
>> (orte_precondition_transports not present in the environment)
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>>
>> Turn off PSM and srun works fine
>>
>>
>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> Hooray!
>>>
>>> On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:
>>>
>>>> I think i take it all back.  I just tried it again and it seems to
>>>> work now.  I'm not sure what I changed (between my first and this
>>>> msg), but it does appear to work now.
>>>>
>>>> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>>>> <mdidomenico4_at_[hidden]> wrote:
>>>>> Yes that's true, error messages help.  I was hoping there was some
>>>>> documentation to see what i've done wrong.  I can't easily cut and
>>>>> paste errors from my cluster.
>>>>>
>>>>> Here's a snippet (hand typed) of the error message, but it does look
>>>>> like a rank communications error
>>>>>
>>>>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>>>>> contact information is unknown in file rml_oob_send.c at line 145.
>>>>> *** MPI_INIT failure message (snipped) ***
>>>>> orte_grpcomm_modex failed
>>>>> --> Returned "A messages is attempting to be sent to a process whose
>>>>> contact information us uknown" (-117) instead of "Success" (0)
>>>>>
>>>>> This msg repeats for each rank, an ultimately hangs the srun which i
>>>>> have to Ctrl-C and terminate
>>>>>
>>>>> I have mpiports defined in my slurm config and running srun with
>>>>> -resv-ports does show the SLURM_RESV_PORTS environment variable
>>>>> getting parts to the shell
>>>>>
>>>>>
>>>>> On Thu, Dec 23, 2010 at 8:09 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>> I'm not sure there is any documentation yet - not much clamor for it. :-/
>>>>>>
>>>>>> It would really help if you included the error message. Otherwise, all I can do is guess, which wastes both of our time :-(
>>>>>>
>>>>>> My best guess is that the port reservation didn't get passed down to the MPI procs properly - but that's just a guess.
>>>>>>
>>>>>>
>>>>>> On Dec 23, 2010, at 12:46 PM, Michael Di Domenico wrote:
>>>>>>
>>>>>>> Can anyone point me towards the most recent documentation for using
>>>>>>> srun and openmpi?
>>>>>>>
>>>>>>> I followed what i found on the web with enabling the MpiPorts config
>>>>>>> in slurm and using the --resv-ports switch, but I'm getting an error
>>>>>>> from openmpi during setup.
>>>>>>>
>>>>>>> I'm using Slurm 2.1.15 and Openmpi 1.5 w/PSM
>>>>>>>
>>>>>>> I'm sure I'm missing a step.
>>>>>>>
>>>>>>> Thanks
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>