Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] HRM problem
From: Syed Ahsan Ali (ahsanshah01_at_[hidden])
Date: 2012-04-24 06:19:57


I am not familiar with attaching debugger to the processes. Other things
you asked are as follows:

  Is this the first time you've ran it (with Open MPI? with any MPI?) *No
We have been running this and other models but this problem has arised now
* How many processes is the job using? Are you oversubscribing your
processors?* I have tried to run on cluster having 184 cores as well on 8
cores of the same server
* What version of Open MPI are you using? *openmpi 1.4.2*
  Have you tested all network connections? *yes
* It might help us to know the size of cluster you are running and what
type of network? *the cluster has 32 nodes dell power edge blade servers
and connectivity is Gigabit Ethernet and Infiniband,
*

On Tue, Apr 24, 2012 at 3:02 PM, TERRY DONTJE <terry.dontje_at_[hidden]>wrote:

> To determine if an MPI process is waiting for a message do what Rayson
> suggested and attach a debugger to the processes and see if any of them are
> stuck in MPI. Either internally in a MPI_Recv or MPI_Wait call or looping
> on a MPI_Test call.
>
> Other things to consider.
> Is this the first time you've ran it (with Open MPI? with any MPI?)?
> How many processes is the job using? Are you oversubscribing your
> processors?
> What version of Open MPI are you using?
> Have you tested all network connections?
> It might help us to know the size of cluster you are running and what
> type of network?
>
> --td
>
> On 4/24/2012 2:42 AM, Syed Ahsan Ali wrote:
>
> Dear Rayson,
>
> That is a Nuemrical model that is written by National weather service of a
> country. The logs of the model show every detail about the simulation
> progress. I have checked on the remote nodes as well the application binary
> is running but the logs show no progress, it is just waiting at a point.
> The input data is correct everything is fine. How can I check if the MPI
> task is waiting for a message?
> Ahsan
>
> On Tue, Apr 24, 2012 at 11:03 AM, Rayson Ho <raysonlogin_at_[hidden]> wrote:
>
>> Seems like there's a bug in the application. Did you or someone else
>> write it, or did you get it from an ISV??
>>
>> You can log onto one of the nodes, attach a debugger, and see if the
>> MPI task is waiting for a message (looping in one of the MPI receive
>> functions)...
>>
>> Rayson
>>
>> =================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>> On Tue, Apr 24, 2012 at 12:49 AM, Syed Ahsan Ali <ahsanshah01_at_[hidden]>
>> wrote:
>> > Dear All,
>> >
>> > I am having problem with running an application on Dell cluster . The
>> model
>> > starts well but no further progress is shown. It just stuck. I have
>> checked
>> > the systems, no apparent hardware error is there. Other open mpi
>> > applications are running well on the same cluster. I have tried running
>> the
>> > application on cores of the same server as well but the problem is
>> same. The
>> > application just don't move further. The same application is also
>> running
>> > well on a backup cluster. Please help.
>> >
>> >
>> > Thanks and Best Regards
>> >
>> > Ahsan
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>
>
> _______________________________________________
> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>