Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Karsten Bolding (karsten_at_[hidden])
Date: 2007-11-01 14:59:16


This is not OpenMPI specific - but maybe somebody on the list can give a
hint.

I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT.0096x0096

everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace in the logfile:

    19:48:46.172 n= 1800
            2003-09-01 05:06:00: reading 2D boundary data ...
    19:49:21.710 n= 1900
    19:49:50.490 n= 2000

or in any system logfiles the simulation stops and all related processes
on the nodes stops.

If I re-run the simulation does not stop at the same time.

Does anybody have a clue where I shall search.

I use a 4 machine/dual P/dual core cluster connected via GBit/s ethernet.

Karsten

PS: If I use MPICH I get the same problem.

-- 
----------------------------------------------------------------------
Karsten Bolding                    Bolding & Burchard Hydrodynamics
Strandgyden 25                     Phone: +45 64422058
DK-5466 Asperup                    Fax:   +45 64422068
Denmark                            Email: karsten_at_[hidden]
http://www.findvej.dk/Strandgyden25,5466,11,3
----------------------------------------------------------------------