If I cannot ssh to a worker node, it means that my program cannot work correctly ?
I can run it on 32 nodes *4 cores/node parallel processes. But, for larger parallel processes,
128 nodes * 1 cpu/node, it is killed by signal 9.
Is this a reason ?
> Date: Wed, 13 Apr 2011 05:59:10 -0700 > From: email@example.com > To: firstname.lastname@example.org > Subject: Re: [OMPI users] OMPI monitor each process behavior > > On 4/12/2011 8:55 PM, Jack Bryan wrote: > > > > > I need to monitor the memory usage of each parallel process on a linux > > Open MPI cluster. > > > > But, top, ps command cannot help here because they only show the head > > node information. > > > > I need to follow the behavior of each process on each cluster node. > Did you consider ganglia et al? > > > > I cannot use ssh to access each node. > How can MPI run? > > > > The program takes 8 hours to finish. > > > > -- > Tim Prince > _______________________________________________ > users mailing list > email@example.com > http://www.open-mpi.org/mailman/listinfo.cgi/users