Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.3.1 rpm build error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-23 16:48:20


It would be interesting to see what happens with the 1.3 build.

It's hard to interpret the output of your user's test program without
knowing exactly what that printf means...

On Feb 23, 2009, at 4:44 PM, Jim Kusznir wrote:

> I haven't had time to do the openmpi build from the nightly yet, but
> my user has run some more tests and now has a simple program and
> algorithm to "break" openmpi. His notes:
>
> hey, just fyi, I can reproduce the error readily in a simple test case
> my "way to break mpi" is as follows: Master proc runs MPI_Send 1000
> times to each child, then waits for a "I got it" ack from each child.
> Each child receives 1000 numbers from the Master, then sends "I got
> it" to the master
> running this on 25 nodes causes it to break about 60% of the time
> interestingly, it usually breaks on the same process number each time
>
> ah. It looks like if I let it sit for about 5 minutes, sometimes it
> will work. From my log
> rank: 23 Mon Feb 23 13:29:44 2009 recieved 816
> rank: 23 Mon Feb 23 13:29:44 2009 recieved 817
> rank: 23 Mon Feb 23 13:29:44 2009 recieved 818
> rank: 23 Mon Feb 23 13:33:08 2009 recieved 819
> rank: 23 Mon Feb 23 13:33:08 2009 recieved 820
>
> Any thoughts on this problem?
> (this is the only reason I'm currently working on upgrading openmpi)
>
> --Jim
>
> On Fri, Feb 20, 2009 at 1:59 PM, Jeff Squyres <jsquyres_at_[hidden]>
> wrote:
>> There won't be an official SRPM until 1.3.1 is released.
>>
>> But to test if 1.3.1 is on-track to deliver a proper solution to
>> you, can
>> you try a nightly tarball, perhaps in conjunction with our
>> "buildrpm.sh"
>> script?
>>
>>
>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/buildrpm.sh
>>
>> It should build a trivial SRPM for you from the tarball. You'll
>> likely need
>> to get the specfile, too, and put it in the same dir as
>> buildrpm.sh. The
>> specfile is in the same SVN directory:
>>
>>
>> https://svn.open-mpi.org/source/xref/ompi_1.3/contrib/dist/linux/openmpi.spec
>>
>>
>>
>> On Feb 20, 2009, at 3:51 PM, Jim Kusznir wrote:
>>
>>> As long as I can still build the rpm for it and install it via rpm.
>>> I'm running it on a ROCKS cluster, so it needs to be an RPM to get
>>> pushed out to the compute nodes.
>>>
>>> --Jim
>>>
>>> On Fri, Feb 20, 2009 at 11:30 AM, Jeff Squyres
>>> <jsquyres_at_[hidden]> wrote:
>>>>
>>>> On Feb 20, 2009, at 2:20 PM, Jim Kusznir wrote:
>>>>
>>>>> I just went to www.open-mpi.org, went to download, then source
>>>>> rpm.
>>>>> Looks like it was actually 1.3-1. Here's the src.rpm that I
>>>>> pulled
>>>>> in:
>>>>>
>>>>>
>>>>> http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3-1.src.rpm
>>>>
>>>> Ah, gotcha. Yes, that's 1.3.0, SRPM version 1. We didn't make
>>>> up this
>>>> nomenclature. :-(
>>>>
>>>>> The reason for this upgrade is it seems a user found some bug
>>>>> that may
>>>>> be in the OpenMPI code that results in occasionally an MPI_Send()
>>>>> message getting lost. He's managed to reproduce it multiple
>>>>> times,
>>>>> and we can't find anything in his code that can cause it...He's
>>>>> got
>>>>> logs of mpi_send() going out, but the matching mpi_receive() never
>>>>> getting anything, thus killing his code. We're currently running
>>>>> 1.2.8 with ofed support (Haven't tried turning off ofed, etc.
>>>>> yet).
>>>>
>>>> Ok. 1.3.x is much mo' betta' then 1.2 in many ways. We could
>>>> probably
>>>> help
>>>> track down the problem, but if you're willing to upgrade to
>>>> 1.3.x, it'll
>>>> hopefully just make the problem go away.
>>>>
>>>> Can you try a 1.3.1 nightly tarball?
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems