Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r29917 - trunk/ompi/mca/rte/orte
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-12-15 14:36:10


Sure you can - just find the ompi_proc_t without setting the thread lock since (as you point out) it is already being held. Or we could pass the ompi_proc_t into the call, if necessary. Either way should work.

On Dec 15, 2013, at 11:30 AM, George Bosilca <bosilca_at_[hidden]> wrote:

> The current code is correct. If the goal is to continue to retrieve the proc_name in a call that is asking for something else, I don’t see how you can remove this circular dependency between setting and using the ompi_procs.
>
> George.
>
> On Dec 15, 2013, at 13:52 , Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Okay - then let's correct the code rather than lose the debug. I'll fix that problem.
>>
>> Thanks
>> Ralph
>>
>> On Dec 15, 2013, at 9:54 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>
>>> I understand your reasons but the code as it was in the trunk is not correct. In most of the cases when you reach one of the ompi_rte_db_fetch calls, you are setting up an ompi_proc … which means you own the ompi_proc_lock mutex. As the ompi_rte_db_fetch was calling back into the proc infrastructure to find a proc, it was deadlocking on acquiring the ompi_proc_lock mutex as locking this mutex it is the first thing ompi_proc_find is doing.
>>>
>>> A quick grep indicates that most places where the proc_hostname is used are capable of handling NULL, so avoiding a deadlock in exchange for few hostname replaced by NULL in the output seemed like a acceptable approach to me.
>>>
>>> George.
>>>
>>> On Dec 15, 2013, at 12:18 , Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> This actually creates a bit of a problem. The reason we did this was because the OMPI-layer "show-help" calls want to report the hostname of the proc. Since we don't retrieve that info by default, the show-help calls all fail due to a NULL pointer.
>>>>
>>>> Nathan tried wrapping all the show-help calls with a modex-fetch of hostname, but that isn't a good solution as the fetch might fail depending on the problem we are trying to report.
>>>>
>>>> We also noted that the modex recv's current implemented all fetched the complete RTE-level info whenever any info was requested for that proc. So the fetch of the hostname was a very low cost operation.
>>>>
>>>> So we decided to always ensure we load the hostname info if it hasn't already been done, thus keeping the show-help operations functional.
>>>>
>>>> Make sense? Or do you have an alternative method?
>>>> Ralph
>>>>
>>>>
>>>> On Dec 15, 2013, at 8:54 AM, svn-commit-mailer_at_[hidden] wrote:
>>>>
>>>>> Author: bosilca (George Bosilca)
>>>>> Date: 2013-12-15 11:54:01 EST (Sun, 15 Dec 2013)
>>>>> New Revision: 29917
>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29917
>>>>>
>>>>> Log:
>>>>> Don't be greedy, just do what we asked for.
>>>>>
>>>>> Text files modified:
>>>>> trunk/ompi/mca/rte/orte/rte_orte_module.c | 15 ---------------
>>>>> 1 files changed, 0 insertions(+), 15 deletions(-)
>>>>>
>>>>> Modified: trunk/ompi/mca/rte/orte/rte_orte_module.c
>>>>> ==============================================================================
>>>>> --- trunk/ompi/mca/rte/orte/rte_orte_module.c Sun Dec 15 11:49:27 2013 (r29916)
>>>>> +++ trunk/ompi/mca/rte/orte/rte_orte_module.c 2013-12-15 11:54:01 EST (Sun, 15 Dec 2013) (r29917)
>>>>> @@ -153,11 +153,6 @@
>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch((opal_identifier_t*)nm, key, data, type))) {
>>>>> return rc;
>>>>> }
>>>>> - /* update the hostname */
>>>>> - proct = ompi_proc_find(nm);
>>>>> - if (NULL == proct->proc_hostname) {
>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>> - }
>>>>> return OMPI_SUCCESS;
>>>>> }
>>>>>
>>>>> @@ -171,11 +166,6 @@
>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch_pointer((opal_identifier_t*)nm, key, data, type))) {
>>>>> return rc;
>>>>> }
>>>>> - /* update the hostname */
>>>>> - proct = ompi_proc_find(nm);
>>>>> - if (NULL == proct->proc_hostname) {
>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>> - }
>>>>> return OMPI_SUCCESS;
>>>>> }
>>>>>
>>>>> @@ -191,11 +181,6 @@
>>>>> OPAL_SCOPE_GLOBAL, key, kvs))) {
>>>>> return rc;
>>>>> }
>>>>> - /* update the hostname */
>>>>> - proct = ompi_proc_find(nm);
>>>>> - if (NULL == proct->proc_hostname) {
>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>> - }
>>>>> return OMPI_SUCCESS;
>>>>> }
>>>>>
>>>>> _______________________________________________
>>>>> svn mailing list
>>>>> svn_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel