Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] HRM problem
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2012-04-24 06:52:18


On 4/24/2012 6:19 AM, Syed Ahsan Ali wrote:
> I am not familiar with attaching debugger to the processes. Other
> things you asked are as follows:
The easiest is to get Totalview or Allinea (both are parallel debuggers)
and attach them to the job. However they cost. Another is to try padb,
look at http://padb.pittman.org.uk (this is probably your best bet).
Lastly is on a node that has a running process find the pid of that
process and attach gdb or dbx to it using "gdb - <pid>" where <pid> is
the process id of one of the processes. Then once in the debugger do a
"where" command (this will give you the stack of the process).
> Is this the first time you've ran it (with Open MPI? with any MPI?)
> *No We have been running this and other models but this problem has
> arised now
> *
Ok, so from the above are you saying HRM has worked with Open MPI on the
same cluster before? If so what has changed?
> How many processes is the job using? Are you oversubscribing your
> processors?*I have tried to run on cluster having 184 cores as well on
> 8 cores of the same server
> *
So the hang even happens on a single server without any networks?
Does the job get past MPI_Init?
> ** What version of Open MPI are you using? *openmpi 1.4.2*
> Have you tested all network connections? *yes
> * It might help us to know the size of cluster you are running and
> what type of network? *the cluster has 32 nodes dell power edge blade
> servers and connectivity is Gigabit Ethernet and Infiniband,
> *

--td
> **
>
>
> On Tue, Apr 24, 2012 at 3:02 PM, TERRY DONTJE <terry.dontje_at_[hidden]
> <mailto:terry.dontje_at_[hidden]>> wrote:
>
> To determine if an MPI process is waiting for a message do what
> Rayson suggested and attach a debugger to the processes and see if
> any of them are stuck in MPI. Either internally in a MPI_Recv or
> MPI_Wait call or looping on a MPI_Test call.
>
> Other things to consider.
> Is this the first time you've ran it (with Open MPI? with any MPI?)?
> How many processes is the job using? Are you oversubscribing
> your processors?
> What version of Open MPI are you using?
> Have you tested all network connections?
> It might help us to know the size of cluster you are running and
> what type of network?
>
> --td
>
> On 4/24/2012 2:42 AM, Syed Ahsan Ali wrote:
>> Dear Rayson,
>>
>> That is a Nuemrical model that is written by National weather
>> service of a country. The logs of the model show every detail
>> about the simulation progress. I have checked on the remote nodes
>> as well the application binary is running but the logs show no
>> progress, it is just waiting at a point. The input data is
>> correct everything is fine. How can I check if the MPI task is
>> waiting for a message?
>> Ahsan
>>
>> On Tue, Apr 24, 2012 at 11:03 AM, Rayson Ho
>> <raysonlogin_at_[hidden] <mailto:raysonlogin_at_[hidden]>> wrote:
>>
>> Seems like there's a bug in the application. Did you or
>> someone else
>> write it, or did you get it from an ISV??
>>
>> You can log onto one of the nodes, attach a debugger, and see
>> if the
>> MPI task is waiting for a message (looping in one of the MPI
>> receive
>> functions)...
>>
>> Rayson
>>
>> =================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>> On Tue, Apr 24, 2012 at 12:49 AM, Syed Ahsan Ali
>> <ahsanshah01_at_[hidden] <mailto:ahsanshah01_at_[hidden]>> wrote:
>> > Dear All,
>> >
>> > I am having problem with running an application on Dell
>> cluster . The model
>> > starts well but no further progress is shown. It
>> just stuck. I have checked
>> > the systems, no apparent hardware error is there. Other
>> open mpi
>> > applications are running well on the same cluster. I have
>> tried running the
>> > application on cores of the same server as well but the
>> problem is same. The
>> > application just don't move further. The same application
>> is also running
>> > well on a backup cluster. Please help.
>> >
>> >
>> > Thanks and Best Regards
>> >
>> > Ahsan
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden] <mailto:users_at_[hidden]>
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631 <tel:%2B1.781.442.2631>
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>

-- 
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>