Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Intercomm Merge
From: Suraj Prabhakaran (suraj.prabhakaran_at_[hidden])
Date: 2013-09-25 08:00:36


Dear Ralph,

I am sorry but I think I missed adding plm verbosity to 5 last time. Here is the output of the complete program with and without -novm to the following mpiexec.

mpiexec -mca state_base_verbose 10 -mca errmgr_base_verbose 10 -mca plm_base_verbose 5 -mca btl tcp,sm,self -np 2 ./addhosttest
mpiexec -mca state_base_verbose 10 -mca errmgr_base_verbose 10 -mca plm_base_verbose 5 -mca btl tcp,sm,self -novm -np 2 ./addhosttest

Here you can see that although I spawn only one process on grsacc18, something is also done with grsacc19.

Sorry and thanks!
Suraj


On Sep 24, 2013, at 8:24 PM, Ralph Castain wrote:

> What I find puzzling is that I don't see any output indicating that you went thru the Torque launcher to launch the daemons - not a peep of debug output. This makes me suspicious that something else is going on. Are you sure you sent me all the output?
>
> Try adding -novm to your mpirun cmd line and let's see if that mode works
>
> On Sep 24, 2013, at 9:06 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>
>> Hi Ralph,
>>
>> So here is what I do. I spawn just a "single" process on a new node which is basically not in the $PBS_NODEFILE list.
>> My $PBS_NODEFILE list contains
>> grsacc20
>> grsacc19
>>
>> I then start the app with just 2 processes. So one host gets one process and they are successfully spawned through the torque (through tm_spawn()). MPI would have stored grsacc20 and grsacc19 to its list of hosts with launchid 0 and 1 correspondingly.
>> I then use the add-host info and spawn ONE new process on a new host "grsacc18" through MPI_Comm_spawn. From what I saw in the code, the launchid of this new host is -1 since openmpi does not know about this and it is not available in the $PBS_NODEFILE. Since withouth the launchid, torque would not know where to spawn, I just retrieve the correct launchid of this host from a file just before tm_spawn() and use this launchid. This is the only modification that I made to openmpi.
>> So, the host "grsacc18" will have a new launchid = 2 and will be used to spawn the process through torque. This worked perfectly until 1.6.5.
>>
>> As we see here from the outputs, although I spawn only a single process on grsacc18, I too have no clue why openmpi tries to spawn something on grsacc19. Of course, without pbs/torque involved, everything works fine.
>> I have attached the simple test code. Please modify hostnames and executable path before use.
>>
>> Best,
>> Suraj
>>
>> <addhosttest.c>
>>
>>
>> On Sep 24, 2013, at 4:59 PM, Ralph Castain wrote:
>>
>>> I'm going to need a little help here. The problem is that you launch two new daemons, and one of them exits immediately because it thinks it lost the connection back to mpirun - before it even gets a chance to create it.
>>>
>>> Can you give me a little more info as to exactly what you are doing? Perhaps send me your test code?
>>>
>>> On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>
>>>> Hi Ralph,
>>>>
>>>> Output attached in a file.
>>>> Thanks a lot!
>>>>
>>>> Best,
>>>> Suraj
>>>>
>>>> <output.rtf>
>>>>
>>>> On Sep 24, 2013, at 4:11 PM, Ralph Castain wrote:
>>>>
>>>>> Afraid I don't see the problem offhand - can you add the following to your cmd line?
>>>>>
>>>>> -mca state_base_verbose 10 -mca errmgr_base_verbose 10
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>>>
>>>>>> Hi Ralph,
>>>>>>
>>>>>> I always got this output from any MPI job that ran on our nodes. There seems to be a problem somewhere but it never stopped the applications from running. But anyway, I ran it again now with only tcp and excluded the infiniband and I get the same output again. Except that this time, the error related to this openib is not there anymore. Printing out the log again.
>>>>>>
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from [[6160,1],0]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_job
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon [[6160,0],2] to node grsacc18
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv:
>>>>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>>>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>>>>>> orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self
>>>>>> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds
>>>>>> [grsacc19:28821] mca:base:select:( plm) Querying component [rsh]
>>>>>> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>>> [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set priority to 10
>>>>>> [grsacc19:28821] mca:base:select:( plm) Selected component [rsh]
>>>>>> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm
>>>>>> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm
>>>>>> [grsacc18:16717] mca:base:select:( plm) Querying component [rsh]
>>>>>> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>>> [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set priority to 10
>>>>>> [grsacc18:16717] mca:base:select:( plm) Selected component [rsh]
>>>>>> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2] on node grsacc18
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from [[6160,0],2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job [6160,2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job [6160,2] to [[6160,1],0]
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands
>>>>>> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm
>>>>>> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm
>>>>>> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm
>>>>>>
>>>>>> Best,
>>>>>> Suraj
>>>>>> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote:
>>>>>>
>>>>>>> Your output shows that it launched your apps, but they exited. The error is reported here, though it appears we aren't flushing the message out before exiting due to a race condition:
>>>>>>>
>>>>>>>> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
>>>>>>>
>>>>>>> Here is the full text:
>>>>>>> [no active ports found]
>>>>>>> WARNING: There is at least non-excluded one OpenFabrics device found,
>>>>>>> but there are no active ports detected (or Open MPI was unable to use
>>>>>>> them). This is most certainly not what you wanted. Check your
>>>>>>> cables, subnet manager configuration, etc. The openib BTL will be
>>>>>>> ignored for this job.
>>>>>>>
>>>>>>> Local host: %s
>>>>>>>
>>>>>>> Looks like at least one node being used doesn't have an active Infiniband port on it?
>>>>>>>
>>>>>>>
>>>>>>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Hi Ralph,
>>>>>>>>
>>>>>>>> I tested it with the trunk r29228. I still have the following problem. Now, it even spawns the daemon on the new node through torque but then suddently quits. The following is the output. Can you please have a look?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Suraj
>>>>>>>>
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from [[6253,1],0]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon [[6253,0],2] to node grsacc18
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv:
>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds
>>>>>>>> [grsacc19:28754] mca:base:select:( plm) Querying component [rsh]
>>>>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>>>>> [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set priority to 10
>>>>>>>> [grsacc19:28754] mca:base:select:( plm) Selected component [rsh]
>>>>>>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm
>>>>>>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm
>>>>>>>> [grsacc18:16648] mca:base:select:( plm) Querying component [rsh]
>>>>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>>>>>>>> [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set priority to 10
>>>>>>>> [grsacc18:16648] mca:base:select:( plm) Selected component [rsh]
>>>>>>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>>>>>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon [[6253,0],2]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon [[6253,0],2] on node grsacc18
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2]
>>>>>>>> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
>>>>>>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>>>>>> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command from [[6253,0],2]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for job [6253,2]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job [6253,2] to [[6253,1],0]
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit commands
>>>>>>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm
>>>>>>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm
>>>>>>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Found a bug in the Torque support - we were trying to connect to the MOM again, which would hang (I imagine). I pushed a fix to the trunk (r29227) and scheduled it to come to 1.7.3 if you want to try it again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Dear Ralph,
>>>>>>>>>>
>>>>>>>>>> This is the output I get when I execute with the verbose option.
>>>>>>>>>>
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from [[23526,1],0]
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],2]
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon [[23526,0],2] to node grsacc17/1-4
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],3]
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon [[23526,0],3] to node grsacc17/0-5
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
>>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit commands
>>>>>>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
>>>>>>>>>>
>>>>>>>>>> Says something?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Suraj
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
>>>>>>>>>>
>>>>>>>>>>> I'll still need to look at the intercomm_create issue, but I just tested both the trunk and current 1.7.3 branch for "add-host" and both worked just fine. This was on my little test cluster which only has rsh available - no Torque.
>>>>>>>>>>>
>>>>>>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some debug output as to the problem.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to check if it works for my case and as of revision 29215, it works for the original case I reported. Although it works, I still see the following in the output. Does it mean anything?
>>>>>>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>>>>>>>>>>>
>>>>>>>>>>>> Yes - it means we don't quite have this right yet :-(
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, on another topic relevant to my use case, I have another problem to report. I am having problems using the "add-host" info to the MPI_Comm_spawn() when MPI is compiled with support for Torque resource manager. This problem is totally new in the 1.7 series and it worked perfectly until 1.6.5
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, I am working on implementing dynamic resource management facilities in the Torque/Maui batch system. Through a new tm call, an application can get new resources for a job.
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to support precisely that operation. It allows an application to request that we dynamically obtain additional resources during execution (e.g., as part of a Comm_spawn call via an info_key). We originally implemented this with Slurm, but you could add the calls into the Torque component as well if you like.
>>>>>>>>>>>>
>>>>>>>>>>>> This is in the trunk now - will come over to 1.7.4
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. With my extended torque/maui batch system, I was able to perfectly use the "add-host" info argument to MPI_Comm_spawn() to spawn new processes on these hosts. Since MPI and Torque refer to the hosts through the nodeids, I made sure that OpenMPI uses the correct nodeid's for these new hosts.
>>>>>>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the Intercomm_merge problem, I could not really run a real application to its completion.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While this is now fixed in the trunk, I found that, however, when using the "add-host" info argument, everything collapses after printing out the following error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll take a look - probably some stale code that hasn't been updated yet for async ORTE operations
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And due to this, I am still not really able to run my application! I also compiled the MPI without any Torque/PBS support and just used the "add-host" argument normally. Again, this worked perfectly in 1.6.5. But in the 1.7 series, it works but after printing out the following error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we "illegally" re-enter libevent. The error again means we don't have Intercomm_create correct just yet.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll see what I can do about this and get back to you
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque support, it runs after spitting the above lines.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would really appreciate some help on this, since I need these features to actually test my case and (at least in my short experience) no other MPI implementation seem friendly to such dynamic scenarios.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Suraj
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works for me. Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks George - much appreciated
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hangs with any np > 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the underlying implementation
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sent from my phone. No type good.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one difference - I only run it with np=1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have another network enabled.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you only run with sm,self because the comm_spawn will fail with unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm based spawn and the debugging. It can't work without xterm support. Instead try using the test case from the trunk, the one committed by Ralph.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached test case hangs:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4]
>>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5]
>>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 6]
>>>>>>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 7]
>>>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that addresses the MPI_Intercomm issue at the MPI level. It should be applied after removal of 29166.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner cases by doing barriers at every inter-comm creation and doing a clean disconnect.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>>>>>>>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>>>>>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel