Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r21548
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-01 17:45:09


On Jul 1, 2009, at 3:28 PM, George Bosilca wrote:

> I think I know why it didn't cause problems with SLURM and TORQUE.
> The routing was wrong, so the message was at one point forwarded to
> the HNP. As the HNP has direct connections with all other processes,
> it was able to correctly deliver the message. The only visible
> impact was 2 more jumps in for all messages directed to the last
> daemon, which might only have a minimal impact on performance.

Aha! Good analysis - thanks! We weren't looking at startup performance
today, just running big jobs to test for MPI issues.

>
> Based on the content of the email related to the commit, I think
> this will fix the problem. Unfortunately, our svn servers seems to
> have some troubles right now (i.e. it doesn't respond at all), so I
> can't test it. I'll do it as soon as the svn server is back online.

Okay, let me know. I'll test some more here.

Thanks again for catching it.
Ralph

>
> Thanks,
> george.
>
> On Wed, 1 Jul 2009, Ralph Castain wrote:
>
>> Believe this is now fixed with r21582 - let me know if it now works
>> for you.
>> Sorry for the problem. It was indeed miscounting the number of
>> daemons in the system, though apparently
>> this wasn't causing problems for slurm and torque (still
>> investigating why since it should have).
>> Unfortunately, just changing the index caused shared memory to
>> think everyone was remote, so the fix was a
>> tad more involved - though not particularly difficult.
>> Ralph
>> On Wed, Jul 1, 2009 at 2:06 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> Hmmm...I'll take a look. It seems to be working for me under
>> Torque and SLURM, though I cannot
>> vouch for the tree launch. The problem with letting the index
>> start at 0 is it breaks other
>> things, so I'll have to see about fixing the routing schemes,
>> or find some compromise.
>>
>> Thanks for the heads up.
>> Ralph
>> On Wed, Jul 1, 2009 at 1:49 PM, George Bosilca
>> <bosilca_at_[hidden]> wrote:
>> Ralph,
>>
>> This commit break several components in OMPI, mainly the
>> routing schemes and the tree
>> launch. The part with the problem is the reduction of the
>> number of declared daemons on
>> the second part of the commit, where you change the boundary
>> for the for loop from 0 to
>> 1. As a result the number of daemons was decreased by one (I
>> guess in order to exclude
>> the HNP), which is not something that the routing
>> implementations tolerate.
>>
>> Setting the loop boundary back to 0 seems to fix all problems.
>> Please reconsider your
>> patch.
>>
>> george.
>>
>> On Fri, 26 Jun 2009, rhc_at_[hidden] wrote:
>>
>> Author: rhc
>> Date: 2009-06-26 18:07:25 EDT (Fri, 26 Jun 2009)
>> New Revision: 21548
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/21548
>>
>> Log:
>> Cleanup some indexing bugs so that shared memory can
>> function
>>
>> Text files modified:
>> trunk/orte/util/nidmap.c | 12 +++++++-----
>> 1 files changed, 7 insertions(+), 5 deletions(-)
>>
>> Modified: trunk/orte/util/nidmap.c
>>
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> --- trunk/orte/util/nidmap.c (original)
>> +++ trunk/orte/util/nidmap.c 2009-06-26 18:07:25 EDT
>> (Fri, 26 Jun 2009)
>> @@ -341,10 +341,10 @@
>>
>> /* pack every nodename individually */
>> for (i=1; i < orte_node_pool->size; i++) {
>> + if (NULL == (node =
>>
>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
>> + continue;
>> + }
>> if (!orte_keep_fqdn_hostnames) {
>> - if (NULL == (node =
>>
>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
>> - continue;
>> - }
>> nodename = strdup(node->name);
>> if (NULL != (ptr = strchr(nodename, '.'))) {
>> *ptr = '\0';
>> @@ -553,6 +553,8 @@
>> ORTE_ERROR_LOG(rc);
>> return rc;
>> }
>> + /* set the daemon to 0 */
>> + node->daemon = 0;
>>
>> /* loop over nodes and unpack the raw nodename */
>> for (i=1; i < num_nodes; i++) {
>> @@ -570,7 +572,7 @@
>> }
>> }
>>
>> - /* unpack the daemon names */
>> + /* unpack the daemon vpids */
>> vpids = (orte_vpid_t*)malloc(num_nodes *
>> sizeof(orte_vpid_t));
>> n=num_nodes;
>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(&buf,
>> vpids, &n, ORTE_VPID))) {
>> @@ -581,7 +583,7 @@
>> * daemons in the system
>> */
>> num_daemons = 0;
>> - for (i=0; i < num_nodes; i++) {
>> + for (i=1; i < num_nodes; i++) {
>> if (NULL != (ndptr =
>> (orte_nid_t*)opal_pointer_array_get_item(&orte_nidmap,
>> i))) {
>> ndptr->daemon = vpids[i];
>> if (ORTE_VPID_INVALID != vpids[i]) {
>> _______________________________________________
>> svn mailing list
>> svn_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>
>> "We must accept finite disappointment, but we must never lose
>> infinite
>> hope."
>> Martin Luther King
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> "We must accept finite disappointment, but we must never lose infinite
> hope."
> Martin Luther
> King_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel