Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI loop problem
From: Julia He (springwater4he_at_[hidden])
Date: 2009-08-18 14:18:53


The OpenMPI version is

[julia.he_at_bob bin]$ mpirun --version
mpirun (Open MPI) 1.2.8

Report bugs to http://www.open-mpi.org/community/help/

The platform is

[julia.he_at_bob bin]$ uname -a
Linux bob.csi.cuny.edu 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

The my_sub is a modification of Radiative Transfer code 6S. http://6s.ltdri.org/ The 6S code takes angles, atmospheric conditions, altitude, etc as inputs, and it returns top of the atmosphere reflectance as the output. The code I provided is a pseudo code because 6S code consists of plenty of subroutines and the main program has 3219 lines.

What I need is to use MPI to parallel the jobs. So, each computing node computes one set of the inputs. But I found that the returned value were not correct after 570 instances. So, I passed the same inputs to each computing node. But the problem still exist. The first 570 returned values are correct(also same in this case), but after 570 the returned values are NaN.

Can someone give a hint because our system administrator can't help with programming? But, I suspect if some setting in MPI prevents computing more than certain times? I know it sounds weird. But I have no clue why with the same inputs the returned value could be garbage after 570 instances.

Julia

--- On Tue, 8/18/09, Ralph Castain <rhc_at_[hidden]> wrote:

From: Ralph Castain <rhc_at_[hidden]>
Subject: Re: [OMPI users] MPI loop problem
To: "Open MPI Users" <users_at_[hidden]>
Date: Tuesday, August 18, 2009, 10:32 AM

Sorry, but there is no way to answer this question with what is given. What is "my_sub" doing? Which version of OpenMPI are you talking about, and on what platform?

On Tue, Aug 18, 2009 at 8:28 AM, Julia He <springwater4he_at_[hidden]> wrote:

Hi,

I found that the subroutine call inside a loop did not
return correct value after certain iterations. In order to simplify the
problem, the inputs to the subroutine are chosen to be constant, so the
output should be the same for every iteration on every computing node.
It is a fortran program, after the initialization the program goes like
this:

do i = 1, N
  call my_sub(A, B, C, re)
  print *, mypn, A, B, C, re
end do

where re is the output value of the my_sub, A, B, C are inputs to my_sub.

570
is the number of correct iterations. If the combined instances does not
exceed 570, the output is fine. For example, if I requested 10
computing nodes and N were 40, so it gives 10*40=400 instances, the
output would be fine. But if the combined instances exceeded 570, the
first 570 is fine, but the rest will return NaN value. For example, if
the number of computing nodes were 20 and N were 40, which gives
20*40=800 instances, then the first 570 are fine, but the rest are NaN
value.

Does
someone know what might cause the problem? I googled it, but can't find
a clue where to start. Please also let me know what else you need to
debug the problem.

Thanks.

Julia

__________________________________________________
Do You Yahoo!?

Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________

users mailing list

users_at_[hidden]

http://www.open-mpi.org/mailman/listinfo.cgi/users

-----Inline Attachment Follows-----

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users