Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-07-05 18:35:45


I agree with Ralph, this code should work fine (we do this internally in
orte_ras_base_node_query()). You may try adding a 'dump' of the GPR to
make sure that the node segment has information on it. Add a call like the
following to your function:

  orte_gpr.dup_segment(NULL);

or better yet

  orte_gpr.dump_segment(ORTE_NODE_SEGMENT);

that should print out the node segment that it would be reading from. This
may be a problem elsewhere, and this will help us pinpoint it.

Cheers,
Josh

> I'm running this on my mac where I expected to only get back the
> localhost. I upgraded to 1.0.2 a little while back, had been using one
> of the alphas (I think it was alpha 9 but I can't be sure) up until that
> point when this function returned '1' on my mac.
>
> -- Nathan
> Correspondence
> ---------------------------------------------------------------------
> Nathan DeBardeleben, Ph.D.
> Los Alamos National Laboratory
> Parallel Tools Team
> High Performance Computing Environments
> phone: 505-667-3428
> email: ndebard_at_[hidden]
> ---------------------------------------------------------------------
>
>
>
> Ralph H Castain wrote:
>> Rc=0 indicates that the "get" function was successful, so this means
>> that
>> there were no nodes on the NODE_SEGMENT. Were you running this in an
>> environment where nodes had been allocated to you? Or were you expecting
>> to
>> find only "localhost" on the segment?
>>
>> I'm not entirely sure, but I don't believe there have been significant
>> changes in 1.0.2 for some time. My guess is that something has changed
>> on
>> your system as opposed to in the OpenMPI code you're using. Did you do
>> an
>> update recently and then begin seeing this behavior? Your revision level
>> is
>> 1000+ behind the current repository, so my guess is that you haven't
>> updated
>> for awhile - since 1.0.2 is under maintenance for bugs only, that
>> shouldn't
>> be a problem. I'm just trying to understand why your function is doing
>> something different if the OpenMPI code your using hasn't changed.
>>
>> Ralph
>>
>>
>>
>> On 7/5/06 2:40 PM, "Nathan DeBardeleben" <ndebard_at_[hidden]> wrote:
>>
>>
>>>> Open MPI: 1.0.2
>>>> Open MPI SVN revision: r9571
>>>>
>>> The rc value returned by the 'get' call is '0'.
>>> All I'm doing is calling init with my own daemon name, it's coming up
>>> fine, then I immediately call this to figure out how many nodes are
>>> associated with this machine.
>>>
>>> -- Nathan
>>> Correspondence
>>> ---------------------------------------------------------------------
>>> Nathan DeBardeleben, Ph.D.
>>> Los Alamos National Laboratory
>>> Parallel Tools Team
>>> High Performance Computing Environments
>>> phone: 505-667-3428
>>> email: ndebard_at_[hidden]
>>> ---------------------------------------------------------------------
>>>
>>>
>>>
>>> Ralph H Castain wrote:
>>>
>>>> Hi Nathan
>>>>
>>>> Could you tell us which version of the code you are using, and print
>>>> out the
>>>> rc value that was returned by the "get" call? I see nothing obviously
>>>> wrong
>>>> with the code, but much depends on what happened prior to this call
>>>> too.
>>>>
>>>> BTW: you might want to release the memory stored in the returned
>>>> values - it
>>>> could represent a substantial memory leak.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 7/5/06 9:28 AM, "Nathan DeBardeleben" <ndebard_at_[hidden]> wrote:
>>>>
>>>>
>>>>
>>>>> I used to use this code to get the number of nodes in a cluster /
>>>>> machine / whatever:
>>>>>
>>>>>
>>>>>> int
>>>>>> get_num_nodes(void)
>>>>>> {
>>>>>> int rc;
>>>>>> size_t cnt;
>>>>>> orte_gpr_value_t **values;
>>>>>>
>>>>>> rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR,
>>>>>> ORTE_NODE_SEGMENT, NULL, NULL, &cnt,
>>>>>> &values);
>>>>>>
>>>>>> if(rc != ORTE_SUCCESS) {
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> return cnt;
>>>>>> }
>>>>>>
>>>>>>
>>>>> This now returns '0' on my MAC when it used to return 1. Is this not
>>>>> an
>>>>> acceptable way of doing this? Is there a cleaner / better way these
>>>>> days?
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>