Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.3.1 rpm build error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-02 08:50:14


I'm pretty sure that this particular VT compile issue has already been
fixed in the 1.3 series.

Lenny -- can you try the latest OMPI 1.3.1 nightly tarball to verify?

On Mar 1, 2009, at 4:54 PM, Lenny Verkhovsky wrote:

> We saw the same problem with compilation,
> the workaround for us was configuring without vt ( ./configure --
> help ).
> I hope vt guys will fix it somewhen .
>
> Lenny.
>
> On Mon, Feb 23, 2009 at 11:48 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>> It would be interesting to see what happens with the 1.3 build.
>>
>> It's hard to interpret the output of your user's test program without
>> knowing exactly what that printf means...
>>
>>
>> On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote:
>>
>>> I haven't had time to do the openmpi build from the nightly yet, but
>>> my user has run some more tests and now has a simple program and
>>> algorithm to "break" openmpi. His notes:
>>>
>>> hey, just fyi, I can reproduce the error readily in a simple test
>>> case
>>> my "way to break mpi" is as follows: Master proc runs MPI_Send 1000
>>> times to each child, then waits for a "I got it" ack from each
>>> child.
>>> Each child receives 1000 numbers from the Master, then sends "I got
>>> it" to the master
>>> running this on 25 nodes causes it to break about 60% of the time
>>> interestingly, it usually breaks on the same process number each
>>> time
>>>
>>> ah. It looks like if I let it sit for about 5 minutes, sometimes it
>>> will work. From my log
>>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 816
>>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 817
>>> rank: 23 Mon Feb 23 13:29:44 2009 recieved 818
>>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 819
>>> rank: 23 Mon Feb 23 13:33:08 2009 recieved 820
>>>
>>> Any thoughts on this problem?
>>> (this is the only reason I'm currently working on upgrading openmpi)
>>>
>>> --Jim
>>>
>>> On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres <jsquyres_at_[hidden]>
>>> wrote:
>>>>
>>>> There won't be an official SRPM until 1.3.1 is released.
>>>>
>>>> But to test if 1.3.1 is on-track to deliver a proper solution to
>>>> you, can
>>>> you try a nightly tarball, perhaps in conjunction with our
>>>> "buildrpm.sh"
>>>> script?
>>>>
>>>>
>>>>
>>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh
>>>>
>>>> It should build a trivial SRPM for you from the tarball. You'll
>>>> likely
>>>> need
>>>> to get the specfile, too, and put it in the same dir as
>>>> buildrpm.sh. The
>>>> specfile is in the same SVN directory:
>>>>
>>>>
>>>>
>>>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec
>>>>
>>>>
>>>>
>>>> On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote:
>>>>
>>>>> As long as I can still build the rpm for it and install it via
>>>>> rpm.
>>>>> I'm running it on a ROCKS cluster, so it needs to be an RPM to get
>>>>> pushed out to the compute nodes.
>>>>>
>>>>> --Jim
>>>>>
>>>>> On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres
>>>>> <jsquyres_at_[hidden]>
>>>>> wrote:
>>>>>>
>>>>>> On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote:
>>>>>>
>>>>>>> I just went to www.open-mpi.org, went to download, then source
>>>>>>> rpm.
>>>>>>> Looks like it was actually 1.3-1. Here's the src.rpm that I
>>>>>>> pulled
>>>>>>> in:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm
>>>>>>
>>>>>> Ah, gotcha. Yes, that's 1.3.0, SRPM version 1. We didn't make
>>>>>> up this
>>>>>> nomenclature. :-(
>>>>>>
>>>>>>> The reason for this upgrade is it seems a user found some bug
>>>>>>> that may
>>>>>>> be in the OpenMPI code that results in occasionally an
>>>>>>> MPI_Send()
>>>>>>> message getting lost. He's managed to reproduce it multiple
>>>>>>> times,
>>>>>>> and we can't find anything in his code that can cause
>>>>>>> it...He's got
>>>>>>> logs of mpi_send() going out, but the matching mpi_receive()
>>>>>>> never
>>>>>>> getting anything, thus killing his code. We're currently
>>>>>>> running
>>>>>>> 1.2.8 with ofed support (Haven't tried turning off ofed, etc.
>>>>>>> yet).
>>>>>>
>>>>>> Ok. 1.3.x is much mo' betta' then 1.2 in many ways. We could
>>>>>> probably
>>>>>> help
>>>>>> track down the problem, but if you're willing to upgrade to
>>>>>> 1.3.x,
>>>>>> it'll
>>>>>> hopefully just make the problem go away.
>>>>>>
>>>>>> Can you try a 1.3.1 nightly tarball?
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> Cisco Systems
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems