Is gamess calling fork(), perchance? Perhaps through a system() or
On Mar 5, 2009, at 3:50 AM, Thomas Exner wrote:
> Dear Jeff:
> Thank you very much for your reply. Unfortunately, the overloading is
> not the problem. The phenomenon also appears if we use only two
> processes on the 8core machines. When I run the jobs over two nodes,
> is doing nothing anymore after a couple of minutes. The strange thing
> is that this only happens on ifiniband and only with mpi2 libraries
> (openmpi and mvapich2). Mvapich1 is running reasonably fine at the
> moment. Perhaps the first to mpi implementations have something in
> common, which could trigger the problems.
> Thanks again.
> Jeff Squyres wrote:
> > Sorry for the delay in replying -- INBOX deluge makes me miss
> emails on
> > the users list sometimes.
> > I'm unfortunately not familiar with gamess -- have you checked with
> > their support lists or documentation?
> > Note that Open MPI's IB progression engine will spin hard to make
> > progress for message passing. Specifically, if you have processes
> > are "blocking" in message passing calls, those processes will
> > be spinning trying to make progress (vs. actually blocking in the
> > kernel). So if you overload your hosts -- meaning that you run more
> > Open MPI jobs than there are cores -- you could well experience
> > slowdown in overall performance because every MPI job will be
> > for CPU cycles.
> > On Feb 24, 2009, at 4:57 AM, Thomas Exner wrote:
> >> Dear all:
> >> Because I am new to this list, I would like to introduce myself as
> >> Thomas Exner and please excuse silly questions, because I am only a
> >> chemist.
> >> And now my problem, with which I am fiddling around for almost a
> week: I
> >> try to use gamess with openmpi on infiniband. There is a good
> >> description on how to compile it with mpi and it can be done,
> even if
> >> it is not easy. But then on run time everything gets weird. The
> >> specialty of gamess is that it runs twice as much mpi jobs than
> used for
> >> the computation. The second half is used as data server,
> requiring data
> >> but with very little cpu load. Each one of these data servers is
> >> connected to a specific compute job. Therefore, these two
> >> jobs have to be run on the same node. On one node everything is
> >> (2x4core machines in my case), because all the jobs are
> guarantied to
> >> run on this node. If I try two nodes, at the beginning also
> >> is fine. 8 compute jobs and 8 data server are running on each
> >> But after a short while, the entire set of processes (16) on the
> >> node start to accumulate CPU time, with nothing useful
> happening. The
> >> second node's processes go entirely to sleep. Is it possible that
> >> the compute jobs are for some reason been transfered to the first
> >> This would explain the load of 16 on the first and 0 on the
> second node,
> >> because 16 compute jobs (100 % cpu load) and 16 data servers
> (almost 0%
> >> load) are running, respectively. Strange thing is also that the
> >> version runs on gigabit and myrinet fine.
> >> It would be great if somebody could help me on that. If you need
> >> information, I will be happy to share them with you.
> >> Thanks very much.
> >> Thomas
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list