Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] error depends on the number of processors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-03-24 13:22:25


On Mar 23, 2010, at 12:06 PM, Junwei Huang wrote:

> I am still using LAM/MPI on an old cluster and wonder if I can get
> some help from this mail list.

Please upgrade to Open MPI if possible. :-)

> Here is the problem. I am using a 18
> node cluster, each node has 2 CPU and each CPU supports up to 2
> threads. So I assume I can use 18*4 number of processors. As running
> the following code, an error message will always pops up for np=30 or
> np=60.

Depending on your CPU type and application behavior, using hyperthreads may be more of a hinderance than a help.

> But works fine for np=12, np=1. The error message is always the
> same, something like: one of the processor n15, exit with (0), ip
> 192......,
>
> Here is a part of the code, where the n15 exit. All other PE can
> finish writing the file, except PE15. Then I see the error message
> about n15 and the written of file by PE15 is not completed. An quick
> question here, is PE15 necessarily generated by node 15 on the
> cluster? Appreciate if anyone would share experiences in debuging
> errors like this.
>
> code:
> ....
> sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
> each PE opens a different file

If each MPI process is opening a separate file, then it may not be a file issue that is causing the problem. For example, if each process opens /dev/null, do you have the same problem?

> if ((fp=fopen(p_obsfile,"w"))==NULL)
> printf("PE_%d: The file %s cannot be opened\n",my_rank,p_obsfile);

I do note that you don't have an escape clause here -- if you fail to open the file, you still fall through and try to write to the file.

> for (int id=loc*my_rank;id<loc*(my_rank+1);id++){ // loc=TotalNum/NumofPE
> //call a function to calculate U, the function will return the
> finishing message
> // no communication is needed among processors
> for (int j=0;j<NUM;j++)
> fprintf (fp, "%f\n",U[j]); //output updated U
> }

I think you just want to try standard debugging stuff here -- are you going beyond the end of the U array? And so on. Perhaps try running your app through valgrind, or under a debugger, etc. Do you get corefiles from the run? And so on.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/