Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI fails to run with -np larger than 10
From: Seyyed Mohtadin Hashemi (haadah_at_[hidden])
Date: 2012-04-16 04:57:28


I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI
1.4.5 tarball.

I did some tests and it does not seem that i can make it work, i tried
these:

btl_sm_num_fifos 4
btl_sm_free_list_num 1000
btl_sm_free_list_max 1000000
mpool_sm_min_size 1500000000
mpool_sm_max_size 7500000000

but nothing helped. I started out with varying one parameter at the time
from default to 1000000 (except fifo which i only varied till 100, and
sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] to
7.5gb) to see what reactions i could get. When running with 10 npp
everything worked, but as soon as i went to 11 npp it crashed with the same
old error.

On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <rhc_at_[hidden]> wrote:

>
> On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote:
>
> That fixed the issue but have brought a big question mark on why this
> happened.
>
> I'm pretty sure it's not a system memory issue, the node with least RAM
> has 8gb which i would think is more than enough.
>
> Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, and
> mpool_sm_max_size can help fix the problem? (Found this at
> http://www.open-mpi.org/faq/?category=sm ) Because compared to the -np
> 10 the performance of -np 18 is worse when running with the cmd you
> suggested. I'll try playing around with the parameters and see what works.
>
>
> Yes, performance will definitely be worse - I was just trying to isolate
> the problem. I would play a little with those sizes and see what you can
> do. Our shared memory person is pretty much unavailable for the next two
> weeks, but the rest of us will at least try to get you working.
>
> We typically do run with more than 10 ppn, so I know the base sm code
> works at that scale. However, those nodes usually have 32Gbytes of RAM, and
> the default sm params are scaled accordingly.
>
>
>
> On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Afraid I have no idea how those packages were built, what release they
>> correspond to, etc. I would suggest sticking with the tarballs.
>>
>> Your output indicates a problem with shared memory when you completely
>> fill the machine. Could be a couple of things, like running out of memory -
>> but for now, try adding -mca btl ^sm to your cmd line. Should work.
>>
>>
>> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:
>>
>> Hi,
>>
>> Sorry that it took so long to answer, I didn't get any return mails and
>> had to check the digest for reply.
>>
>> Anyway, when i compiled from scratch then i did use the tarballs from
>> open-mpi.org. GROMACS is not the problem (or at least i don't think so),
>> i just used it as a check to see if i could run parallel jobs - i am now
>> using OSU benchmarks because i can't be sure that the problem is not with
>> GROMACS.
>>
>> On the new installation i have not installed (nor compiled) OMPI from the
>> official tarballs but rather installed the "openmpi-bin, openmpi-common,
>> libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" packages using
>> apt-get.
>>
>> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c
>> extracted from the 1.4.2 official tarball) i get the exact same behavior as
>> with GROMACS/OSU bench.
>>
>> I suspect you'll have to ask someone familiar with GROMACS about that
>>> specific package. As for testing OMPI, can you run the codes in the
>>> examples directory - e.g., "hello" and "ring"? I assume you are downloading
>>> and installing OMPI from our tarballs?
>>>
>>
>>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:
>>>
>>
>>> > Hello,
>>>
>> >
>>>
>> > I have a very peculiar problem: I have a micro cluster with three nodes
>>> (18 cores total); the nodes are clones of each other and connected to a
>>> frontend via Ethernet and Debian squeeze as the OS for all nodes. When I
>>> run parallel jobs I can used up ?-np 10? if I go further the job crashes, I
>>> have primarily done tests with GROMACS (because that is what I will be
>>> running) but have also used OSU Micro-Benchmarks 3.5.2.
>>>
>> >
>>>
>> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile
>>> ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr?
>>>
>> >
>>>
>> > (path is global) For ?np XX being smaller than or 10 it works, however
>>> as soon as I make use of 11 or larger the whole thing crashes. The terminal
>>> dump is attached to this mail: when_working.txt is for ??np 10?,
>>> when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from
>>> ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v
>>> ompi full ?parsable?
>>>
>> >
>>>
>> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all
>>> yield the same result.
>>>
>> >
>>>
>> > The output files are from a new install I did today: I formatted all
>>> nodes and started from a fresh minimal install of Squeeze and used "apt-get
>>> install gromacs gromacs-openmpi" and installed all dependencies. Then I ran
>>> two jobs using the parameters described above, I also did one with OSU
>>> bench (data is not included) it also crashed with ?-np? larger than 10.
>>>
>> >
>>>
>> > I hope somebody can help figure out what is wrong and how I can fix it.
>>>
>> >
>>>
>> > Best regards,
>>>
>> > Mohtadin
>>>
>> >
>>>
>> >
>>> *****************************************************************************
>>>
>> > ** **
>>>
>> > ** WARNING: This email contains an attachment of a very suspicious
>>> type. **
>>>
>> > ** You are urged NOT to open this attachment unless you are absolutely
>>> **
>>>
>> > ** sure it is legitimate. Opening this attachment may cause irreparable
>>> **
>>>
>> > ** damage to your computer and your files. If you have any questions **
>>>
>> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING
>>> IT. **
>>>
>> > ** **
>>>
>> > ** This warning was added by the IU Computer Science Dept. mail
>>> scanner. **
>>>
>> >
>>> *****************************************************************************
>>>
>> >
>>>
>> > <Archive.zip>_______________________________________________
>>>
>> > users mailing list
>>>
>> > users_at_[hidden]
>>>
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
>
> --
> De venligste hilsner/I am, yours most sincerely
> Seyyed Mohtadin Hashemi
>
>
>

-- 
De venligste hilsner/I am, yours most sincerely
Seyyed Mohtadin Hashemi