Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Debugger problem with srun and openmpi 1.5 (hang in OMPI)
From: Nikolay Piskun (Nikolay.Piskun_at_[hidden])
Date: 2011-02-10 15:19:40


Thanks much, looks like this should work. The patch is one line:
--------------------------------------------------------------
 diff -c ompi_debuggers.c ompi_debuggers.c.old
*** ompi_debuggers.c Thu Feb 10 15:13:07 2011
--- ompi_debuggers.c.old Fri Jan 22 09:21:23 2010
***************
*** 222,228 ****
      mpimsgq_dll_locations = tmp1;
      mpidbg_dll_locations = tmp2;
  
! if (ORTE_DISABLE_FULL_SUPPORT || orte_standalone_operation) {
          /* spin until debugger attaches and releases us */
          while (MPIR_debug_gate == 0) {
  #if defined(__WINDOWS__)
--- 222,228 ----
      mpimsgq_dll_locations = tmp1;
      mpidbg_dll_locations = tmp2;
  
! if (ORTE_DISABLE_FULL_SUPPORT) {
          /* spin until debugger attaches and releases us */
          while (MPIR_debug_gate == 0) {
  #if defined(__WINDOWS__)
----------------------------------------------------------------
 What would be the best way to put it in?

--
Nikolay Piskun
Director of Continuing Engineering
TotalView Technologies, Rogue Wave Software company
mailto:nikolay_at_[hidden]   phone: 508-652-7739
24 Prime Parkway,          Natick, MA 01760
http://www.totalviewtech.com
________________________________________
From: devel-bounces_at_[hidden] [devel-bounces_at_[hidden]] On Behalf Of Ralph Castain [rhc_at_[hidden]]
Sent: Thursday, February 10, 2011 12:42 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Debugger problem with srun and openmpi 1.5    (hang   in OMPI)
FWIW: there already is a flag in ORTE that gets set when procs are launched by a non-orterun entity: orte_standalone_operation. So all you would have to do is add an appropriate check for that flag to be true.
On Feb 10, 2011, at 9:18 AM, Jeff Squyres wrote:
> I think what Ralph was trying to say is that Open MPI doesn't (currently) support running parallel debuggers when only srun is used (and mpirun is not).
>
> We'd certainly be open to someone submitting a patch to enable this functionality, though!
>
>
> On Feb 10, 2011, at 8:02 AM, Nikolay Piskun wrote:
>
>> Actually in SLURM 2.2.0 that I am using now,  there is a support for parallel debugger and srun does provide needed info  and fill proc_table and set up all debug variable correctly. The only problem that I see so far is the one that I described. Maybe the solution would be to check if job was started by non orterun and then/or check for MPIR_debug_gate before waiting for signal.
>>
>> Nikolay Piskun | Director of Continuing Engineering | Totalview Technologies |
>> Rogue Wave Software Inc  |  24 Prime Parkway, Natick, MA 01760 | p 508-652-7739|
>> nikolay.piskun_at_[hidden]
>> www.roguewave.com
>>
>> From: devel-bounces_at_[hidden] [mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph Castain
>> Sent: Thursday, February 10, 2011 10:47 AM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] Debugger problem with srun and openmpi 1.5 (hang in OMPI)
>>
>> If you srun a job, then there is no "mpirun" to provide a proc_table. So running a  job directly via srun means you cannot run TV on it.
>>
>>
>> On Feb 10, 2011, at 8:34 AM, Nikolay Piskun wrote:
>>
>>
>>
>>   Hi,
>> I am trying to use Totalview with srun and hit interesting problem. Looks like if OMPI is started by “srun   –mpi=ompi ”, mpi job is hang in ompi_wait_for_debugger() subroutine. What happen, I think is ompi was compiled without ORTE_DISABLE_FULL_SUPPORT and as result rank 0 is waiting for message from HNP (by the way what is HNP?)  that was supposed to be send by orterun. The problem is that orterun was never invoked because MPI was initiated by srun, not orterun.  So what is the solution? Should we always compile OMPI with  ORTE_DISABLE_FULL_SUPPORT=true for anything that uses different starters like srun from SLURM?
>> Thanks
>> Nikolay
>>
>> Nikolay Piskun | Director of Continuing Engineering | Totalview Technologies |
>> Rogue Wave Software Inc  |  24 Prime Parkway, Natick, MA 01760 | p 508-652-7739|
>> nikolay.piskun_at_[hidden]
>> www.roguewave.com
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/devel