Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] srun and openmpi
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-29 10:30:06


Hi Michael

I'm told that the Qlogic contacts we used to have are no longer there. Since you obviously are a customer, can you ping them and ask (a) what that error message means, and (b) what's wrong with the values I computed?

You can also just send them my way, if that would help. We just need someone to explain the requirements on that precondition value.

Thanks
Ralph

On Apr 29, 2011, at 8:12 AM, Ralph Castain wrote:

>
> On Apr 29, 2011, at 8:05 AM, Michael Di Domenico wrote:
>
>> On Fri, Apr 29, 2011 at 10:01 AM, Michael Di Domenico
>> <mdidomenico4_at_[hidden]> wrote:
>>> On Fri, Apr 29, 2011 at 4:52 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> Hi Michael
>>>>
>>>> Please see the attached updated patch to try for 1.5.3. I mistakenly free'd the envar after adding it to the environ :-/
>>>
>>> The patch works great, i can now see the precondition environment
>>> variable if i do
>>>
>>> mpirun -n 2 -host node1 <prog>
>>>
>>> and my <prog> runs just fine, However if i do
>>>
>>> srun --resv-ports -n 2 -w node1 <prog>
>>>
>>> I get
>>>
>>> [node1:16780] PSM EP connect error (unknown connect error):
>>> [node1:16780] node1
>>> [node1:16780] PSM EP connect error (Endpoint could not be reached):
>>> [node1:16780] node1
>>>
>>> PML add procs failed
>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>
>>> I did notice a difference in the precondition env variable between the two runs
>>>
>>> mpirun -n 2 -host node1 <prog>
>>>
>>> sets precondition_transports=fbc383997ee1b668-00d40f1401d2e827 (which
>>> changes with each run (aka random))
>
> I didn't change anything about the way mpirun works, so this is expected.
>
>>
>>>
>>> srun --resv-ports -n 2 -w node1 <prog>
>>
>> this should have been "srun --resv-ports -n 1 -w node1 <prog>", i
>> can't run a 2 rank job, i get the PML error above
>>
>>>
>>> sets precondition_transports=0000184500000000-0000000100000000 (which
>>> doesn't seem to change run to run)
>
> The value would indeed look quite different. Since I can't use a random value (so each proc can compute the same result), I simply used the SLURM_JOBID and SLURM_STEPID. I would therefore have expected that the first field (based on the jobid) would remain the same, and the second would change each time you did an "srun" within the same job.
>
> I'm afraid I don't know the significance of the fields, so I can't say why psm can't make the connection. I'll have to ping someone more knowledgable to see why those values aren't acceptable.
>
>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>