Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand
From: Philippe (philmpi_at_[hidden])
Date: 2010-08-24 11:02:59


awesome, I'll give it a spin! with the parameters as below?

p.

On Tue, Aug 24, 2010 at 10:47 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> I think I have this working now - try anything on or after r23647
>
>
> On Aug 23, 2010, at 1:36 PM, Philippe wrote:
>
>> sure. I took a guess at ppn and nodes for the case where 2 processes
>> are on the same node... I dont claim these are the right values ;-)
>>
>>
>>
>> c0301b10e1 ~/mpi> env|grep OMPI
>> OMPI_MCA_orte_nodes=c0301b10e1
>> OMPI_MCA_orte_rank=0
>> OMPI_MCA_orte_ppn=2
>> OMPI_MCA_orte_num_procs=2
>> OMPI_MCA_oob_tcp_static_ports_v6=10000-11000
>> OMPI_MCA_ess=generic
>> OMPI_MCA_orte_jobid=9999
>> OMPI_MCA_oob_tcp_static_ports=10000-11000
>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>> [c0301b10e1:22827] [[0,9999],0] assigned port 10001
>> [c0301b10e1:22827] [[0,9999],0] accepting connections via event library
>> minsize=1 maxsize=1 delay=1.000000
>>
>> <no more output after that>
>>
>>
>> c0301b10e1 ~/mpi> env|grep OMPI
>> OMPI_MCA_orte_nodes=c0301b10e1
>> OMPI_MCA_orte_rank=1
>> OMPI_MCA_orte_ppn=2
>> OMPI_MCA_orte_num_procs=2
>> OMPI_MCA_oob_tcp_static_ports_v6=10000-11000
>> OMPI_MCA_ess=generic
>> OMPI_MCA_orte_jobid=9999
>> OMPI_MCA_oob_tcp_static_ports=10000-11000
>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>> [c0301b10e1:22830] [[0,9999],1] assigned port 10002
>> [c0301b10e1:22830] [[0,9999],1] accepting connections via event library
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0] mca_oob_tcp_send_nb: tag 15 size 189
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>> 10.4.72.110:10000
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>> refused (111) - retrying
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>> 10.4.72.110:10000
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>> refused (111) - retrying
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>> 10.4.72.110:10000
>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>> refused (111) - retrying
>>
>> <repeats..>
>>
>>
>> Thanks!
>> p.
>>
>>
>> On Mon, Aug 23, 2010 at 3:24 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> Can you send me the values you are using for the relevant envars? That way I can try to replicate here
>>>
>>>
>>> On Aug 23, 2010, at 1:15 PM, Philippe wrote:
>>>
>>>> I took a look at the code but I'm afraid I dont see anything wrong.
>>>>
>>>> p.
>>>>
>>>> On Thu, Aug 19, 2010 at 2:32 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>> Yes, that is correct - we reserve the first port in the range for a daemon,
>>>>> should one exist.
>>>>> The problem is clearly that get_node_rank is returning the wrong value for
>>>>> the second process (your rank=1). If you want to dig deeper, look at the
>>>>> orte/mca/ess/generic code where it generates the nidmap and pidmap. There is
>>>>> a bug down there somewhere that gives the wrong answer when ppn > 1.
>>>>>
>>>>>
>>>>> On Thu, Aug 19, 2010 at 12:12 PM, Philippe <philmpi_at_[hidden]> wrote:
>>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment:
>>>>>>
>>>>>>                orte_node_rank_t nrank;
>>>>>>                /* do I know my node_local_rank yet? */
>>>>>>                if (ORTE_NODE_RANK_INVALID != (nrank =
>>>>>> orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) &&
>>>>>>                    (nrank+1) <
>>>>>> opal_argv_count(mca_oob_tcp_component.tcp4_static_ports)) {
>>>>>>                    /* any daemon takes the first entry, so we start
>>>>>> with the second */
>>>>>>
>>>>>> which seems constant with process #0 listening on 10001. the question
>>>>>> would be why process #1 attempt to connect to port 10000 then? or
>>>>>> maybe totally unrelated :-)
>>>>>>
>>>>>> btw, if I trick process #1 to open the connection to 10001 by shifting
>>>>>> the range, I now get this error and the process terminate immediately:
>>>>>>
>>>>>> [c0301b10e1:03919] [[0,9999],1]-[[0,0],0]
>>>>>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process
>>>>>> identifier [[0,9999],0]
>>>>>>
>>>>>> good luck with the surgery and wishing you a prompt recovery!
>>>>>>
>>>>>> p.
>>>>>>
>>>>>> On Thu, Aug 19, 2010 at 2:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>> Something doesn't look right - here is what the algo attempts to do:
>>>>>>> given a port range of 10000-12000, the lowest rank'd process on the node
>>>>>>> should open port 10000. The next lowest rank on the node will open
>>>>>>> 10001,
>>>>>>> etc.
>>>>>>> So it looks to me like there is some confusion in the local rank algo.
>>>>>>> I'll
>>>>>>> have to look at the generic module - must be a bug in it somewhere.
>>>>>>> This might take a couple of days as I have surgery tomorrow morning, so
>>>>>>> please forgive the delay.
>>>>>>>
>>>>>>> On Thu, Aug 19, 2010 at 11:13 AM, Philippe <philmpi_at_[hidden]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Ralph,
>>>>>>>>
>>>>>>>> I'm able to use the generic module when the processes are on different
>>>>>>>> machines.
>>>>>>>>
>>>>>>>> what would be the values of the EV when two processes are on the same
>>>>>>>> machine (hopefully talking over SHM).
>>>>>>>>
>>>>>>>> i've played with combination of nodelist and ppn but no luck. I get
>>>>>>>> errors
>>>>>>>> like:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [c0301b10e1:03172] [[0,9999],1] -> [[0,0],0] (node: c0301b10e1)
>>>>>>>> oob-tcp: Number of attempts to create TCP connection has been
>>>>>>>> exceeded.  Can not communicate with peer
>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>> grpcomm_hier_module.c at line 303
>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>> base/grpcomm_base_modex.c at line 470
>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>> grpcomm_hier_module.c at line 484
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>>>>> environment
>>>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>>>> developer):
>>>>>>>>
>>>>>>>>  orte_grpcomm_modex failed
>>>>>>>>  --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>>>>>>>> *** This is disallowed by the MPI standard.
>>>>>>>> *** Your MPI job will now abort.
>>>>>>>> [c0301b10e1:3172] Abort before MPI_INIT completed successfully; not
>>>>>>>> able to guarantee that all other processes were killed!
>>>>>>>>
>>>>>>>>
>>>>>>>> maybe a related question is how to assign the TCP port range and how
>>>>>>>> is it used? when the processes are on different machines, I use the
>>>>>>>> same range and that's ok as long as the range is free. but when the
>>>>>>>> processes are on the same node, what value should the range be for
>>>>>>>> each process? My range is 10000-12000 (for both processes) and I see
>>>>>>>> that process with rank #0 listen on port 10001 while process with rank
>>>>>>>> #1 try to establish a connect to port 10000.
>>>>>>>>
>>>>>>>> Thanks so much!
>>>>>>>> p. still here... still trying... ;-)
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2010 at 12:58 AM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>> wrote:
>>>>>>>>> Use what hostname returns - don't worry about IP addresses as we'll
>>>>>>>>> discover them.
>>>>>>>>>
>>>>>>>>> On Jul 26, 2010, at 10:45 PM, Philippe wrote:
>>>>>>>>>
>>>>>>>>>> Thanks a lot!
>>>>>>>>>>
>>>>>>>>>> now, for the ev "OMPI_MCA_orte_nodes", what do I put exactly? our
>>>>>>>>>> nodes have a short/long name (it's rhel 5.x, so the command hostname
>>>>>>>>>> returns the long name) and at least 2 IP addresses.
>>>>>>>>>>
>>>>>>>>>> p.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2010 at 12:06 AM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>>>> wrote:
>>>>>>>>>>> Okay, fixed in r23499. Thanks again...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jul 26, 2010, at 9:47 PM, Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Doh - yes it should! I'll fix it right now.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 26, 2010, at 9:28 PM, Philippe wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>
>>>>>>>>>>>>> i was able to test the generic module and it seems to be working.
>>>>>>>>>>>>>
>>>>>>>>>>>>> one question tho, the function orte_ess_generic_component_query
>>>>>>>>>>>>> in
>>>>>>>>>>>>> "orte/mca/ess/generic/ess_generic_component.c" calls getenv with
>>>>>>>>>>>>> the
>>>>>>>>>>>>> argument "OMPI_MCA_enc", which seems to cause the module to fail
>>>>>>>>>>>>> to
>>>>>>>>>>>>> load. shouldnt it be "OMPI_MCA_ess" ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> .....
>>>>>>>>>>>>>
>>>>>>>>>>>>>   /* only pick us if directed to do so */
>>>>>>>>>>>>>   if (NULL != (pick = getenv("OMPI_MCA_env")) &&
>>>>>>>>>>>>>                0 == strcmp(pick, "generic")) {
>>>>>>>>>>>>>       *priority = 1000;
>>>>>>>>>>>>>       *module = (mca_base_module_t *)&orte_ess_generic_module;
>>>>>>>>>>>>>
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> p.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 22, 2010 at 5:53 PM, Ralph Castain <rhc_at_[hidden]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Dev trunk looks okay right now - I think you'll be fine using
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>> My new component -might- work with 1.5, but probably not with
>>>>>>>>>>>>>> 1.4. I haven't
>>>>>>>>>>>>>> checked either of them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Anything at r23478 or above will have the new module. Let me
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>> how it works for you. I haven't tested it myself, but am pretty
>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>> should work.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 22, 2010, at 3:22 PM, Philippe wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you so much!!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll give it a try and let you know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know it's a tough question, but how stable is the dev trunk?
>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> just grab the latest and run, or am I better off taking your
>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>> and copy them back in a stable release? (if so, which one? 1.4?
>>>>>>>>>>>>>>> 1.5?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> p.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain
>>>>>>>>>>>>>>> <rhc_at_[hidden]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> It was easier for me to just construct this module than to
>>>>>>>>>>>>>>>> explain how to do so :-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I will commit it this evening (couple of hours from now) as
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> is our standard practice. You'll need to use the developer's
>>>>>>>>>>>>>>>> trunk, though,
>>>>>>>>>>>>>>>> to use it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here are the envars you'll need to provide:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Each process needs to get the same following values:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * OMPI_MCA_ess=generic
>>>>>>>>>>>>>>>> * OMPI_MCA_orte_num_procs=<number of MPI procs>
>>>>>>>>>>>>>>>> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames
>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> MPI procs reside>
>>>>>>>>>>>>>>>> * OMPI_MCA_orte_ppn=<number of procs/node>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that I have assumed this last value is a constant for
>>>>>>>>>>>>>>>> simplicity. If that isn't the case, let me know - you could
>>>>>>>>>>>>>>>> instead provide
>>>>>>>>>>>>>>>> it as a comma-separated list of values with an entry for each
>>>>>>>>>>>>>>>> node.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In addition, you need to provide the following value that will
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> unique to each process:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * OMPI_MCA_orte_rank=<MPI rank>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Finally, you have to provide a range of static TCP ports for
>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>> by the processes. Pick any range that you know will be
>>>>>>>>>>>>>>>> available across all
>>>>>>>>>>>>>>>> the nodes. You then need to ensure that each process sees the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> envar:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * OMPI_MCA_oob_tcp_static_ports=6000-6010  <== obviously,
>>>>>>>>>>>>>>>> replace
>>>>>>>>>>>>>>>> this with your range
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You will need a port range that is at least equal to the ppn
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the job (each proc on a node will take one of the provided
>>>>>>>>>>>>>>>> ports).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That should do it. I compute everything else I need from those
>>>>>>>>>>>>>>>> values.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does that work for you?
>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>