Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-06 21:57:09


You might want to run your app through a memory-checking debugger to
see if anything obvious shows up.

Also, check to see if your corelimit size is greater than zero (i.e.,
make it "unlimited"). Then run again and see if you can get corefiles
to see if your app is silently dumping core, and these would give you
a clue as to what is going on.

These are the ares where I typically start with parallel debugging;
hopefully this is at least somewhat helpful...

On Nov 1, 2007, at 2:59 PM, Karsten Bolding wrote:

> This is not OpenMPI specific - but maybe somebody on the list can
> give a
> hint.
>
> I start a parallel job with:
> mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT.
> 0096x0096
>
> everything starts OK and the simulation carries on 2+ hours of
> wall clock time - then suddenly without a trace in the logfile:
>
> 19:48:46.172 n= 1800
> 2003-09-01 05:06:00: reading 2D boundary data ...
> 19:49:21.710 n= 1900
> 19:49:50.490 n= 2000
>
> or in any system logfiles the simulation stops and all related
> processes
> on the nodes stops.
>
> If I re-run the simulation does not stop at the same time.
>
> Does anybody have a clue where I shall search.
>
> I use a 4 machine/dual P/dual core cluster connected via GBit/s
> ethernet.
>
> Karsten
>
> PS: If I use MPICH I get the same problem.
>
>
> --
> ----------------------------------------------------------------------
> Karsten Bolding Bolding & Burchard Hydrodynamics
> Strandgyden 25 Phone: +45 64422058
> DK-5466 Asperup Fax: +45 64422068
> Denmark Email: karsten_at_[hidden]
>
> http://www.findvej.dk/Strandgyden25,5466,11,3
> ----------------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems