Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI fails to run with -np larger than 10
From: Seyyed Mohtadin Hashemi (haadah_at_[hidden])
Date: 2012-04-13 12:36:42


That fixed the issue but have brought a big question mark on why this
happened.

I'm pretty sure it's not a system memory issue, the node with least RAM has
8gb which i would think is more than enough.

Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, and
mpool_sm_max_size can help fix the problem? (Found this at
http://www.open-mpi.org/faq/?category=sm ) Because compared to the -np 10
the performance of -np 18 is worse when running with the cmd you suggested.
I'll try playing around with the parameters and see what works.

On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Afraid I have no idea how those packages were built, what release they
> correspond to, etc. I would suggest sticking with the tarballs.
>
> Your output indicates a problem with shared memory when you completely
> fill the machine. Could be a couple of things, like running out of memory -
> but for now, try adding -mca btl ^sm to your cmd line. Should work.
>
>
> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:
>
> Hi,
>
> Sorry that it took so long to answer, I didn't get any return mails and
> had to check the digest for reply.
>
> Anyway, when i compiled from scratch then i did use the tarballs from
> open-mpi.org. GROMACS is not the problem (or at least i don't think so),
> i just used it as a check to see if i could run parallel jobs - i am now
> using OSU benchmarks because i can't be sure that the problem is not with
> GROMACS.
>
> On the new installation i have not installed (nor compiled) OMPI from the
> official tarballs but rather installed the "openmpi-bin, openmpi-common,
> libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" packages using
> apt-get.
>
> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c
> extracted from the 1.4.2 official tarball) i get the exact same behavior as
> with GROMACS/OSU bench.
>
> I suspect you'll have to ask someone familiar with GROMACS about that
>> specific package. As for testing OMPI, can you run the codes in the
>> examples directory - e.g., "hello" and "ring"? I assume you are downloading
>> and installing OMPI from our tarballs?
>>
>
>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:
>>
>
>> > Hello,
>>
> >
>>
> > I have a very peculiar problem: I have a micro cluster with three nodes
>> (18 cores total); the nodes are clones of each other and connected to a
>> frontend via Ethernet and Debian squeeze as the OS for all nodes. When I
>> run parallel jobs I can used up ?-np 10? if I go further the job crashes, I
>> have primarily done tests with GROMACS (because that is what I will be
>> running) but have also used OSU Micro-Benchmarks 3.5.2.
>>
> >
>>
> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile
>> ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr?
>>
> >
>>
> > (path is global) For ?np XX being smaller than or 10 it works, however
>> as soon as I make use of 11 or larger the whole thing crashes. The terminal
>> dump is attached to this mail: when_working.txt is for ??np 10?,
>> when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from
>> ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v
>> ompi full ?parsable?
>>
> >
>>
> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all
>> yield the same result.
>>
> >
>>
> > The output files are from a new install I did today: I formatted all
>> nodes and started from a fresh minimal install of Squeeze and used "apt-get
>> install gromacs gromacs-openmpi" and installed all dependencies. Then I ran
>> two jobs using the parameters described above, I also did one with OSU
>> bench (data is not included) it also crashed with ?-np? larger than 10.
>>
> >
>>
> > I hope somebody can help figure out what is wrong and how I can fix it.
>>
> >
>>
> > Best regards,
>>
> > Mohtadin
>>
> >
>>
> >
>> *****************************************************************************
>>
> > ** **
>>
> > ** WARNING: This email contains an attachment of a very suspicious type.
>> **
>>
> > ** You are urged NOT to open this attachment unless you are absolutely **
>>
> > ** sure it is legitimate. Opening this attachment may cause irreparable
>> **
>>
> > ** damage to your computer and your files. If you have any questions **
>>
> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING
>> IT. **
>>
> > ** **
>>
> > ** This warning was added by the IU Computer Science Dept. mail scanner.
>> **
>>
> >
>> *****************************************************************************
>>
> >
>>
> > <Archive.zip>_______________________________________________
>>
> > users mailing list
>>
> > users_at_[hidden]
>>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

-- 
De venligste hilsner/I am, yours most sincerely
Seyyed Mohtadin Hashemi