Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 64-bit version of openmpi-1.6.5a1r28554 hangs
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-05-29 17:29:26


Siegmar --

I'm a bit confused by your final table:

> local machine | -host
> | sunpc1 | linpc1 | rs1
> -----------------------------+--------+--------+-------
> sunpc1 (Solaris 10, x86_64) | ok | hangs | hangs
> linpc1 (Solaris 10, x86_64) | hangs | ok | ok
> rs1 (Solaris 10, sparc) | hangs | ok | ok

Is linpc1 a Linux machine or Solaris machine?

Ralph and I talked about this on the phone, and it seems like sunpc1 is just wrong somehow -- it just doesn't jive with the error message you sent.

Can you verify that all 3 versions were built exactly the same way (e.g., debug or not debug)?

On May 29, 2013, at 10:31 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hello Ralph,
>
>> Could you please clarify - are you mixing 32 and 64 bit versions
>> in your runs that have a problem?
>
> No, I have four different versions on each machine.
>
> tyr fd1026 1250 ls -ld /usr/local/openmpi-1.6.5_*
> drwxr-xr-x 7 root root 512 May 23 14:00 /usr/local/openmpi-1.6.5_32_cc
> drwxr-xr-x 7 root root 512 May 23 13:55 /usr/local/openmpi-1.6.5_32_gcc
> drwxr-xr-x 7 root root 512 May 23 10:12 /usr/local/openmpi-1.6.5_64_cc
> drwxr-xr-x 7 root root 512 May 23 12:21 /usr/local/openmpi-1.6.5_64_gcc
>
> "/usr/local" is a link to machine specific files on a NFS server.
>
> lrwxrwxrwx 1 root root 25 Jan 10 07:47 local -> /export2/prog/SunOS_sparc
> lrwxrwxrwx 1 root root 26 Oct 5 2012 local -> /export2/prog/SunOS_x86_64
> ...
>
> I can choose a package in my file "$HOME/.cshrc".
>
> tyr fd1026 1251 more .cshrc
> ...
> #set MPI = openmpi-1.6.5_32_cc
> #set MPI = openmpi-1.6.5_32_gcc
> #set MPI = openmpi-1.6.5_64_cc
> #set MPI = openmpi-1.6.5_64_gcc
> ...
> source /opt/global/cshrc
> ...
>
>
> "/opt/global/cshrc" determines the processor architecture and operating
> system and calls package specific initialization files.
>
> tyr fd1026 1258 more /opt/global/mpi.csh
> ...
> case openmpi-1.6.5_32_cc:
> case openmpi-1.6.5_32_gcc:
> case openmpi-1.6.5_64_cc:
> case openmpi-1.6.5_64_gcc:
> ...
> if (($MPI == openmpi-1.7_32_cc) || ($MPI == openmpi-1.9_32_cc) || \
> ($MPI == ompi-java_32_cc) || ($MPI == ompi-java_32_gcc) || \
> ($MPI == openmpi-1.7_32_gcc) || ($MPI == openmpi-1.9_32_gcc)) then
> if ($JDK != jdk1.7.0_07-32) then
> echo " "
> echo "In '${MPI}' funktioniert 'mpijavac' nur mit"
> echo "'jdk1.7.0_07-32'. Waehlen Sie bitte das entsprechende"
> echo "Paket in der Datei '${HOME}/.cshrc' aus und melden Sie"
> echo "sich ab und wieder an, wenn Sie 'mpiJava' benutzen"
> echo "wollen."
> echo " "
> endif
> endif
> ...
> setenv OPENMPI_HOME ${DIRPREFIX_PROG}/$MPI
> ...
> set path = ( $path ${OPENMPI_HOME}/bin )
> ...
>
> Sorry for the german message in my shell script, but mpi.csh sets
> all necessary environment variables for the selected package. I
> must logout and login again, if I select a different package in
> "$HOME/.cshrc", so that I never mix environments for different
> packages, because my home directory and "/opt/global" are the
> same on all machines (they are provided via an NFS server).
>
>
>> If that isn't the case, then the error message is telling you that
>> the system thinks you are mixing optimized and debug versions -
>> i.e., one node is using an optimized version of OMPI and another
>> is using a debug version. This also isn't allowed.
>
> I build my packages with copy-paste from a file. All configure
> commands use "--enable-debug" (three different architectures with
> two different compilers each).
>
> tyr openmpi-1.6.5 1263 grep -- enable-debug README-OpenMPI-1.6.5
> --enable-debug \
> --enable-debug \
> --enable-debug \
> --enable-debug \
> --enable-debug \
> --enable-debug \
> tyr openmpi-1.6.5 1264
>
>
>> If you check and find those two conditions are okay, then I suspect
>> you are hitting the Solaris "bit rot" problem that we've talked
>> about before - and are unlikely to be able to fix any time soon.
>
> sunpc1 hello_1 113 mpiexec -mca btl ^udapl -np 4 -host sunpc1 hello_1_mpi
> Process 2 of 4 running on sunpc1
> ...
>
>
> sunpc1 hello_1 114 mpiexec -mca btl ^udapl -np 4 -host linpc1 hello_1_mpi
> [sunpc1:05035] [[4165,0],0] ORTE_ERROR_LOG: Buffer type (described vs
> non-described) mismatch - operation not allowed in file
> ../../../../../openmpi-1.6.5a1r28554/orte/mca/grpcomm/bad/grpcomm_bad_module.c
> at line 841
> ^Cmpiexec: killing job...
>
>
> I get the following table, if I use every machine as local machine
> and run my command on one of my hosts.
>
>
> local machine | -host
> |
> | sunpc1 | linpc1 | rs1
> -----------------------------+--------+--------+-------
> sunpc1 (Solaris 10, x86_64) | ok | hangs | hangs
> linpc1 (Solaris 10, x86_64) | hangs | ok | ok
> rs1 (Solaris 10, sparc) | hangs | ok | ok
>
>
>
> It seems that I have a problem with Solaris x86_64 and gcc-4.8.0,
> if I use a 64-bit version of Open MPI. I have no problems with
> Sun C and a 64-bit version of Open MPI or any 32-bit version of
> Open MPI. Do you have any idea, what I can do to track the problem
> and to get a solution?
>
>
> Kind regards
>
> Siegmar
>
>
>
>> On May 24, 2013, at 12:02 AM, Siegmar Gross
> <Siegmar.Gross_at_[hidden]> wrote:
>>
>>> Hi
>>>
>>> I installed openmpi-1.6.5a1r28554 on "openSuSE Linux 12.1", "Solaris 10
>>> x86_64", and "Solaris 10 sparc" with gcc-4.8.0 and "Sun C 5.12" in 32-
>>> and 64-bit versions. Unfortunately I have a problem with the 64-bit
>>> version, if I build Open MPI with gcc. The program hangs and I have
>>> to terminate it with <Ctrl-c>.
>>>
>>>
>>> sunpc1 hello_1 144 mpiexec -mca btl ^udapl -np 4 \
>>> -host sunpc1,linpc1,rs0 hello_1_mpi
>>> [sunpc1:15576] [[16182,0],0] ORTE_ERROR_LOG: Buffer type (described vs
>>> non-described) mismatch - operation not allowed in file
>>>
> ../../../../../openmpi-1.6.5a1r28554/orte/mca/grpcomm/bad/grpcomm_bad_module.c
>>> at line 841
>>> ^Cmpiexec: killing job...
>>>
>>> sunpc1 hello_1 145 which mpiexec
>>> /usr/local/openmpi-1.6.5_64_gcc/bin/mpiexec
>>> sunpc1 hello_1 146
>>>
>>>
>>> I have no problems with the 64-bit version, if I compile Open MPI
>>> with Sun C. Both 32-bit versions (compiled with "cc" or "gcc") work
>>> as expectedas well.
>>>
>>> sunpc1 hello_1 106 mpiexec -mca btl ^udapl -np 4 \
>>> -host sunpc1,linpc1,rs0 hello_1_mpi
>>> Process 2 of 4 running on rs0.informatik.hs-fulda.de
>>> Process 0 of 4 running on sunpc1
>>> Process 3 of 4 running on sunpc1
>>> Process 1 of 4 running on linpc1
>>> Now 3 slave tasks are sending greetings.
>>> Greetings from task 3:
>>> message type: 3
>>> msg length: 116 characters
>>> message:
>>> hostname: sunpc1
>>> operating system: SunOS
>>> release: 5.10
>>> processor: i86pc
>>> ...
>>>
>>> sunpc1 hello_1 107 which mpiexec
>>> /usr/local/openmpi-1.6.5_64_cc/bin/mpiexec
>>>
>>>
>>>
>>> sunpc1 hello_1 106 mpiexec -mca btl ^udapl -np 4 \
>>> -host sunpc1,linpc1,rs0 hello_1_mpi
>>> Process 2 of 4 running on rs0.informatik.hs-fulda.de
>>> Process 3 of 4 running on sunpc1
>>> Process 0 of 4 running on sunpc1
>>> Process 1 of 4 running on linpc1
>>> ...
>>>
>>> sunpc1 hello_1 107 which mpiexec
>>> /usr/local/openmpi-1.6.5_32_gcc/bin/mpiexec
>>>
>>>
>>> I would be grateful, if somebody can fix the problem for the
>>> 64-bit version with gcc. Thank you very much for any help in
>>> advance.
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/