Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
From: Andrea Negri (negri.andre_at_[hidden])
Date: 2012-09-07 05:38:45


George,

I hace done some modifications to the code, however this is the first
part my zmp_list:
! ZEUSMP2 CONFIGURATION FILE
 &GEOMCONF LGEOM = 2,
            LDIMEN = 2 /
 &PHYSCONF LRAD = 0,
            XHYDRO = .TRUE.,
            XFORCE = .TRUE.,
            XMHD = .false.,
            XTOTNRG = .false.,
            XGRAV = .false.,
            XGRVFFT = .false.,
            XPTMASS = .false.,
            XISO = .false.,
            XSUBAV = .false.,
            XVGRID = .false.,
!- - - - - - - - - - - - - - - - - - -
            XFIXFORCE = .TRUE.,
            XFIXFORCE2 = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
            XSOURCEENERGY = .TRUE.,
            XSOURCEMASS = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
            XRADCOOL = .TRUE.,
            XA_RGB_WINDS = .TRUE.,
            XSNIa = .TRUE./
!=====================================
 &IOCONF XASCII = .false.,
            XA_MULT = .false.,
            XHDF = .TRUE.,
            XHST = .TRUE.,
            XRESTART = .TRUE.,
            XTSL = .false.,
            XDPRCHDF = .TRUE.,
            XTTY = .TRUE. ,
            XAGRID = .false. /
 &PRECONF SMALL_NO = 1.0D-307,
            LARGE_NO = 1.0D+307 /
 &ARRAYCONF IZONES = 100,
            JZONES = 125,
            KZONES = 1,
            MAXIJK = 125/
 &mpitop ntiles(1)=5,ntiles(2)=2,ntiles(3)=1,periodic=2*.false.,.true. /

I have done some tests, and currently I'm able to perform a run with
10 processes on 10 nodes, ie I use only 1 of two CPUs in a node. It
crashes after 6 hours, and not after 20 minutes!

2012/9/6 <users-request_at_[hidden]>:
> Send users mailing list submissions to
> users_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request_at_[hidden]
>
> You can reach the person managing the list at
> users-owner_at_[hidden]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
> 2. Re: OMPI 1.6.x Hang on khugepaged 100% CPU time (Yong Qin)
> 3. Regarding the Pthreads (seshendra seshu)
> 4. Re: some mpi processes "disappear" on a cluster of servers
> (George Bosilca)
> 5. SIGSEGV in OMPI 1.6.x (Yong Qin)
> 6. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
> 7. Re: Infiniband performance Problem and stalling
> (Yevgeny Kliteynik)
> 8. Re: SIGSEGV in OMPI 1.6.x (Jeff Squyres)
> 9. Re: Regarding the Pthreads (Jeff Squyres)
> 10. Re: python-mrmpi() failed (Jeff Squyres)
> 11. Re: MPI_Cart_sub periods (Jeff Squyres)
> 12. Re: error compiling openmpi-1.6.1 on Windows 7 (Shiqing Fan)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 5 Sep 2012 17:43:50 +0200 (CEST)
> From: Siegmar Gross <Siegmar.Gross_at_[hidden]>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: fan_at_[hidden]
> Cc: users_at_[hidden]
> Message-ID: <201209051543.q85FhoBa021975_at_[hidden]>
> Content-Type: TEXT/plain; charset=ISO-8859-1
>
> Hi Shiqing,
>
>> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>> This env is a backup option for the registry.
>
> It solves one problem but there is a new problem now :-((
>
>
> Without OPENMPI_HOME: Wrong pathname to help files.
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry! You were supposed to get help about:
> invalid if_inexclude
> But I couldn't open the help file:
> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
> No such file or directory. Sorry!
> --------------------------------------------------------------------------
> ...
>
>
>
> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
> the pathname contains the character " in the wrong place so that it
> couldn't find the available help file.
>
> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry! You were supposed to get help about:
> no-hostfile
> But I couldn't open the help file:
> "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. Sorry
> !
> --------------------------------------------------------------------------
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\ras\base
> \ras_base_allocate.c at line 200
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\base
> \plm_base_launch_support.c at line 99
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\proc
> ess\plm_process_module.c at line 996
>
>
>
> It looks like that the environment variable can also solve my
> problem in the 64-bit environment.
>
> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>
> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
> ...
>
>
> The process hangs without OPENMPI_HOME.
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> ^C
>
>
> With OPENMPI_HOME:
>
> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry! You were supposed to get help about:
> no-hostfile
> But I couldn't open the help file:
> "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. S
> orry!
> --------------------------------------------------------------------------
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
> a\ras\base\ras_base_allocate.c at line 200
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
> a\plm\base\plm_base_launch_support.c at line 99
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
> a\plm\process\plm_process_module.c at line 996
>
>
> At least the program doesn't block any longer. Do you have any ideas
> how this new problem can be solved?
>
>
> Kind regards
>
> Siegmar
>
>
>
>> On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>> > Hi Shiqing,
>> >
>> >>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> >>>> ---------------------------------------------------------------------
>> >>>> Sorry! You were supposed to get help about:
>> >>>> invalid if_inexclude
>> >>>> But I couldn't open the help file:
>> >>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>> >>>> No such file or directory. Sorry!
>> >>>> ---------------------------------------------------------------------
>> >>> ...
>> >>>> Why does "mpiexec" look for the help file relativ to my current
>> >>>> program and not relative to itself? The file is part of the
>> >>>> package.
>> >>> Do you know how I can solve this problem?
>> >> I have similar issue with message from tcp, but it's not finding the
>> >> file, it's something else, which doesn't affect the execution of the
>> >> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>> >> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>> > That wouldn't be a good idea because I have MPI programs in different
>> > directories so that I would have to install all help files in several
>> > places (<my_directory>/../share/openmpi/help*.txt). All help files are
>> > available in the installation directory of Open MPI.
>> >
>> > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>> > ...
>> > 29.08.2012 10:59 38.912 mpiexec.exe
>> > ...
>> > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>> > ...
>> > 03.04.2012 16:30 631 help-mpi-btl-tcp.txt
>> > ...
>> >
>> > I don't know if "mpiexec" or my program "init_finilize" is responsible
>> > for the error message but whoever is responsible shouldn't use the path
>> > to my program but the prefix_dir from MPI to find the help files. Perhaps
>> > you can change the behaviour in the Open MPI source code.
>> >
>> >
>> >>>> I can also compile in 64-bit mode but the program hangs.
>> >>> Do you have any ideas why the program hangs? Thank you very much for any
>> >>> help in advance.
>> >> To be honest I don't know. I couldn't reproduce it. Did you try
>> >> installing the binary installer, will it also behave the same?
>> > I like to have different versions of Open MPI which I activate via
>> > a batch file so that I can still run my program in an old version if
>> > something goes wrong in a new one. I have no entries in the system
>> > environment or registry so that I can even run different versions in
>> > different command windows without problems (everything is only known
>> > within the command window in which a have run my batch file). It seems
>> > that you put something in the registry when I use your installer.
>> > Perhaps you remember an earlier email where I had to uninstall an old
>> > version because the environment in my own installation was wrong
>> > as long as your installation was active. Nevertheless I can give it
>> > a try. Perhaps I find out if you set more than just the path to your
>> > binaries. Do you know if there is something similar to "truss" or
>> > "strace" in the UNIX world so that I can see where the program hangs?
>> > Thank you very much for your help in advance.
>> >
>> >
>> > Kind regards
>> >
>> > Siegmar
>> >
>>
>>
>> --
>> ---------------------------------------------------------------
>> Shiqing Fan
>> High Performance Computing Center Stuttgart (HLRS)
>> Tel: ++49(0)711-685-87234 Nobelstrasse 19
>> Fax: ++49(0)711-685-65832 70569 Stuttgart
>> http://www.hlrs.de/organization/people/shiqing-fan/
>> email: fan_at_[hidden]
>>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 5 Sep 2012 09:07:35 -0700
> From: Yong Qin <yong.qin_at_[hidden]>
> Subject: Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time
> To: kliteyn_at_[hidden]
> Cc: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <CADEJBEWq0Rzfi_uKx8U4Uz4tjz=vJzn1=RDtPhPYuL04cv9T7A_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Yes, so far this has only been observed in VASP and a specific dataset.
>
> Thanks,
>
> On Wed, Sep 5, 2012 at 4:52 AM, Yevgeny Kliteynik
> <kliteyn_at_[hidden]> wrote:
>> On 9/4/2012 7:21 PM, Yong Qin wrote:
>>> On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik
>>> <kliteyn_at_[hidden]> wrote:
>>>> On 8/30/2012 10:28 PM, Yong Qin wrote:
>>>>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquyres_at_[hidden]> wrote:
>>>>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote:
>>>>>>
>>>>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but
>>>>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and
>>>>>>> only one specific dataset is identified during the testing, and the OS
>>>>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that
>>>>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread
>>>>>>> always runs with 100% CPU load, and it looks to me like that OMPI is
>>>>>>> waiting for some memory to be available thus appears to be hung.
>>>>>>> Reducing the per node processes would sometimes ease the problem a bit
>>>>>>> but not always. So I did some further testing by playing around with
>>>>>>> the kernel transparent hugepage support.
>>>>>>>
>>>>>>> 1. Disable transparent hugepage support completely (echo never
>>>>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow
>>>>>>> the program to progress as normal (as in 1.4.5). Total run time for an
>>>>>>> iteration is 3036.03 s.
>>>>>>
>>>>>> I'll admit that we have not tested using transparent hugepages. I wonder if there's some kind of bad interaction going on here...
>>>>>
>>>>> The transparent hugepage is "transparent", which means it is
>>>>> automatically applied to all applications unless it is explicitly told
>>>>> otherwise. I highly suspect that it is not working properly in this
>>>>> case.
>>>>
>>>> Like Jeff said - I don't think we've ever tested OMPI with transparent
>>>> huge pages.
>>>>
>>>
>>> Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS
>>> 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or
>>> not it's there.
>>
>> Interesting. Indeed, THP is on be default in RHEL 6.x.
>> I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem.
>>
>> I'm checking it with OFED folks, but I doubt that there are some dedicated
>> tests for THP.
>>
>> So do you see it only with a specific application and only on a specific
>> data set? Wonder if I can somehow reproduce it in-house...
>>
>> -- YK
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 5 Sep 2012 20:23:05 +0200
> From: seshendra seshu <seshu199_at_[hidden]>
> Subject: [OMPI users] Regarding the Pthreads
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <CAJ_xm3AYtMt22NgjtY67TuwOpZxev0ZYSW4fEYGxKA=2yVdG9Q_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
> I am learning pthreads and trying to implement the pthreads in my
> quicksortprogram.
> My problem is iam unable to understand how to implement the pthreads at
> data received at a node from the master (In detail: In my program Master
> will divide the data and send to the slaves and each slave will do the
> sorting independently of The received data and send back to master after
> sorting is done. Now Iam having a problem in Implementing the pthreads at
> the slaves,i.e how to implement the pthreads in order to share data among
> the cores in each slave and sort the data and send it back to master.
> So could anyone help in solving this problem by providing some suggestions
> and clues.
>
> Thanking you very much.
>
> --
> WITH REGARDS
> M.L.N.Seshendra
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 4
> Date: Thu, 6 Sep 2012 02:40:19 +0200
> From: George Bosilca <bosilca_at_[hidden]>
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
> of servers
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <F6F521B2-DF90-4827-8ABF-ABE0F3599CF5_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> Andrea,
>
> As suggested by the previous answers I guess the size of your problem is too large for the memory available on the nodes. I can runs ZeusMP without any issues up to 64 processes, both over Ethernet and Infiniband. I tried the 1.6 and the current trunk, and both perform as expected.
>
> What is the content of your zmp_inp file?
>
> george.
>
> On Sep 1, 2012, at 16:01 , Andrea Negri <negri.andre_at_[hidden]> wrote:
>
>> I have tried to run with a single process (i.e. the entire grid is
>> contained by one process) and the the command free -m on the compute
>> node returns
>>
>> total used free shared buffers cached
>> Mem: 3913 1540 2372 0 49 1234
>> -/+ buffers/cache: 257 3656
>> Swap: 1983 0 1983
>>
>>
>> while top returns
>> top - 16:01:09 up 4 days, 5:56, 1 user, load average: 0.53, 0.16, 0.10
>> Tasks: 63 total, 3 running, 60 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 49.4% us, 0.7% sy, 0.0% ni, 49.9% id, 0.0% wa, 0.0% hi, 0.0% si
>> Mem: 4007720k total, 1577968k used, 2429752k free, 50664k buffers
>> Swap: 2031608k total, 0k used, 2031608k free, 1263844k cached
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Wed, 5 Sep 2012 21:06:12 -0700
> From: Yong Qin <yong.qin_at_[hidden]>
> Subject: [OMPI users] SIGSEGV in OMPI 1.6.x
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <CADEJBEVFcsyh5WnK=3YJ6w7b2AASrF7YC4uiMCVAqia-J6CDBg_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi,
>
> While debugging a mysterious crash of a code, I was able to trace down
> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>
> (gdb) c
> Continuing.
>
> Program received signal SIGSEGV, Segmentation fault.
> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
> at malloc.c:4385
> 4385 nextsize = chunksize(nextchunk);
> (gdb) l
> 4380 Consolidate other non-mmapped chunks as they arrive.
> 4381 */
> 4382
> 4383 else if (!chunk_is_mmapped(p)) {
> 4384 nextchunk = chunk_at_offset(p, size);
> 4385 nextsize = chunksize(nextchunk);
> 4386 assert(nextsize > 0);
> 4387
> 4388 /* consolidate backward */
> 4389 if (!prev_inuse(p)) {
> (gdb) bt
> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637,
> mem=0x203a746f74512000) at malloc.c:4385
> #1 0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
> at malloc.c:3511
> #2 0x00002ae6b18ea736 in opal_memory_linux_free_hook
> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
> #3 0x0000000001412fcc in for_dealloc_allocatable ()
> #4 0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
> ) at alloc.F90:1357
> #5 0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
> lasto=..., iphorb=...,
> numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
> fa=..., stress=..., h=...,
> first=@0x0, last=@0x0) at ldau.F:752
> #6 0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
> #7 0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
> #8 0x000000000070e475 in siesta () at siesta.F:23
> #9 0x000000000045e47c in main ()
>
> Can anybody shed some light here on what could be wrong?
>
> Thanks,
>
> Yong Qin
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 6 Sep 2012 07:48:34 +0200 (CEST)
> From: Siegmar Gross <Siegmar.Gross_at_[hidden]>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: fan_at_[hidden]
> Cc: users_at_[hidden]
> Message-ID: <201209060548.q865mYkE023698_at_[hidden]>
> Content-Type: TEXT/plain; charset=ISO-8859-1
>
> Hi Shiqing,
>
> I have solved the problem with the double quotes in OPENMPI_HOME but
> there is still something wrong.
>
> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>
> mpicc init_finalize.c
> Cannot open configuration file "c:\Program Files (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt
> Error parsing data file mpicc: Not found
>
>
> Everything is OK if you remove the double quotes which Windows
> automatically adds.
>
> set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1
>
> mpicc init_finalize.c
> Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 f?r 80x86
> ...
>
> mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude. This
> value will be ignored.
>
> Local host: hermes
> Value: 127.0.0.1/8
> Message: Did not find interface matching this subnet
> --------------------------------------------------------------------------
>
> Hello!
>
>
> I get the output from my program but also a warning from Open MPI.
> The new value for the loopback device was introduced a short time
> ago when I have had problems with the loopback device on Solaris
> (it used "lo0" instead of your default "lo"). How can I avoid this
> message? The 64-bit version of my program still hangs.
>
>
> Kind regards
>
> Siegmar
>
>
>> > Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>> > This env is a backup option for the registry.
>>
>> It solves one problem but there is a new problem now :-((
>>
>>
>> Without OPENMPI_HOME: Wrong pathname to help files.
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry! You were supposed to get help about:
>> invalid if_inexclude
>> But I couldn't open the help file:
>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>> No such file or directory. Sorry!
>> --------------------------------------------------------------------------
>> ...
>>
>>
>>
>> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
>> the pathname contains the character " in the wrong place so that it
>> couldn't find the available help file.
>>
>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry! You were supposed to get help about:
>> no-hostfile
>> But I couldn't open the help file:
>> "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. Sorry
>> !
>> --------------------------------------------------------------------------
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\ras\base
>> \ras_base_allocate.c at line 200
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\base
>> \plm_base_launch_support.c at line 99
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\proc
>> ess\plm_process_module.c at line 996
>>
>>
>>
>> It looks like that the environment variable can also solve my
>> problem in the 64-bit environment.
>>
>> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>>
>> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
>> ...
>>
>>
>> The process hangs without OPENMPI_HOME.
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> ^C
>>
>>
>> With OPENMPI_HOME:
>>
>> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry! You were supposed to get help about:
>> no-hostfile
>> But I couldn't open the help file:
>> "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. S
>> orry!
>> --------------------------------------------------------------------------
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>> a\ras\base\ras_base_allocate.c at line 200
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>> a\plm\base\plm_base_launch_support.c at line 99
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>> a\plm\process\plm_process_module.c at line 996
>>
>>
>> At least the program doesn't block any longer. Do you have any ideas
>> how this new problem can be solved?
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>
>> > On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>> > > Hi Shiqing,
>> > >
>> > >>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> > >>>> ---------------------------------------------------------------------
>> > >>>> Sorry! You were supposed to get help about:
>> > >>>> invalid if_inexclude
>> > >>>> But I couldn't open the help file:
>> > >>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>> > >>>> No such file or directory. Sorry!
>> > >>>> ---------------------------------------------------------------------
>> > >>> ...
>> > >>>> Why does "mpiexec" look for the help file relativ to my current
>> > >>>> program and not relative to itself? The file is part of the
>> > >>>> package.
>> > >>> Do you know how I can solve this problem?
>> > >> I have similar issue with message from tcp, but it's not finding the
>> > >> file, it's something else, which doesn't affect the execution of the
>> > >> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>> > >> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>> > > That wouldn't be a good idea because I have MPI programs in different
>> > > directories so that I would have to install all help files in several
>> > > places (<my_directory>/../share/openmpi/help*.txt). All help files are
>> > > available in the installation directory of Open MPI.
>> > >
>> > > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>> > > ...
>> > > 29.08.2012 10:59 38.912 mpiexec.exe
>> > > ...
>> > > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>> > > ...
>> > > 03.04.2012 16:30 631 help-mpi-btl-tcp.txt
>> > > ...
>> > >
>> > > I don't know if "mpiexec" or my program "init_finilize" is responsible
>> > > for the error message but whoever is responsible shouldn't use the path
>> > > to my program but the prefix_dir from MPI to find the help files. Perhaps
>> > > you can change the behaviour in the Open MPI source code.
>> > >
>> > >
>> > >>>> I can also compile in 64-bit mode but the program hangs.
>> > >>> Do you have any ideas why the program hangs? Thank you very much for any
>> > >>> help in advance.
>> > >> To be honest I don't know. I couldn't reproduce it. Did you try
>> > >> installing the binary installer, will it also behave the same?
>> > > I like to have different versions of Open MPI which I activate via
>> > > a batch file so that I can still run my program in an old version if
>> > > something goes wrong in a new one. I have no entries in the system
>> > > environment or registry so that I can even run different versions in
>> > > different command windows without problems (everything is only known
>> > > within the command window in which a have run my batch file). It seems
>> > > that you put something in the registry when I use your installer.
>> > > Perhaps you remember an earlier email where I had to uninstall an old
>> > > version because the environment in my own installation was wrong
>> > > as long as your installation was active. Nevertheless I can give it
>> > > a try. Perhaps I find out if you set more than just the path to your
>> > > binaries. Do you know if there is something similar to "truss" or
>> > > "strace" in the UNIX world so that I can see where the program hangs?
>> > > Thank you very much for your help in advance.
>> > >
>> > >
>> > > Kind regards
>> > >
>> > > Siegmar
>> > >
>> >
>> >
>> > --
>> > ---------------------------------------------------------------
>> > Shiqing Fan
>> > High Performance Computing Center Stuttgart (HLRS)
>> > Tel: ++49(0)711-685-87234 Nobelstrasse 19
>> > Fax: ++49(0)711-685-65832 70569 Stuttgart
>> > http://www.hlrs.de/organization/people/shiqing-fan/
>> > email: fan_at_[hidden]
>> >
>>
>>
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 06 Sep 2012 11:03:04 +0300
> From: Yevgeny Kliteynik <kliteyn_at_[hidden]>
> Subject: Re: [OMPI users] Infiniband performance Problem and stalling
> To: Randolph Pullen <randolph_pullen_at_[hidden]>, OpenMPI Users
> <users_at_[hidden]>
> Message-ID: <504858B8.3050202_at_[hidden]>
> Content-Type: text/plain; charset=UTF-8
>
> On 9/3/2012 4:14 AM, Randolph Pullen wrote:
>> No RoCE, Just native IB with TCP over the top.
>
> Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
> Could you run "ibstat" and post the results?
>
> What is the expected BW on your cards?
> Could you run "ib_write_bw" between two machines?
>
> Also, please see below.
>
>> No I haven't used 1.6 I was trying to stick with the standards on the mellanox disk.
>> Is there a known problem with 1.4.3 ?
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!
> ---
>> *From:* Yevgeny Kliteynik <kliteyn_at_[hidden]>
>> *To:* Randolph Pullen <randolph_pullen_at_[hidden]>; Open MPI Users <users_at_[hidden]>
>> *Sent:* Sunday, 2 September 2012 10:54 PM
>> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
>>
>> Randolph,
>>
>> Some clarification on the setup:
>>
>> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to Ethernet?
>> That is, when you're using openib BTL, you mean RoCE, right?
>>
>> Also, have you had a chance to try some newer OMPI release?
>> Any 1.6.x would do.
>>
>>
>> -- YK
>>
>> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>> > (reposted with consolidatedinformation)
>> > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G cards
>> > running Centos 5.7 Kernel 2.6.18-274
>> > Open MPI 1.4.3
>> > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>> > On a Cisco 24 pt switch
>> > Normal performance is:
>> > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>> > results in:
>> > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>> > and:
>> > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>> > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
>> > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems fine.
>> > log_num_mtt =20 and log_mtts_per_seg params =2
>> > My application exchanges about a gig of data between the processes with 2 sender and 2 consumer processes on each node with 1 additional controller process on the starting node.
>> > The program splits the data into 64K blocks and uses non blocking sends and receives with busy/sleep loops to monitor progress until completion.
>> > Each process owns a single buffer for these 64K blocks.
>> > My problem is I see better performance under IPoIB then I do on native IB (RDMA_CM).
>> > My understanding is that IPoIB is limited to about 1G/s so I am at a loss to know why it is faster.
>> > These 2 configurations are equivelant (about 8-10 seconds per cycle)
>> > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog
>> > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl tcp,self -H vh2,vh1 -np 9 --bycore prog
>
> When you say "--mca btl tcp,self", it means that openib btl is not enabled.
> Hence "--mca btl_openib_flags" is irrelevant.
>
>> > And this one produces similar run times but seems to degrade with repeated cycles:
>> > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl openib,self -H vh2,vh1 -np 9 --bycore prog
>
> You're running 9 ranks on two machines, but you're using IB for intra-node communication.
> Is it intentional? If not, you can add "sm" btl and have performance improved.
>
> -- YK
>
>> > Other btl_openib_flags settings result in much lower performance.
>> > Changing the first of the above configs to use openIB results in a 21 second run time at best. Sometimes it takes up to 5 minutes.
>> > In all cases, OpenIB runs in twice the time it takes TCP,except if I push the small message max to 64K and force short messages. Then the openib times are the same as TCP and no faster.
>> > With openib:
>> > - Repeated cycles during a single run seem to slow down with each cycle
>> > (usually by about 10 seconds).
>> > - On occasions it seems to stall indefinitely, waiting on a single receive.
>> > I'm still at a loss as to why. I can?t find any errors logged during the runs.
>> > Any ideas appreciated.
>> > Thanks in advance,
>> > Randolph
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden] <mailto:users_at_[hidden]>
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 6 Sep 2012 08:01:01 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] SIGSEGV in OMPI 1.6.x
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <256DA22F-F9AC-4746-ACD9-501F8208E718_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> If you run into a segv in this code, it almost certainly means that you have heap corruption somewhere. FWIW, that has *always* been what it meant when I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: my user code did something wrong, it created heap corruption, and then later some malloc() or free() caused a segv in this area of the code.
>
> This code is the same ptmalloc memory allocator that has shipped in glibc for years. I'll be hard-pressed to say that any code is 100% bug free :-), but I'd be surprised if there is a bug in this particular chunk of code.
>
> I'd run your code through valgrind or some other memory-checking debugger and see if that can shed any light on what's going on.
>
>
> On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:
>
>> Hi,
>>
>> While debugging a mysterious crash of a code, I was able to trace down
>> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
>> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>>
>> (gdb) c
>> Continuing.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
>> at malloc.c:4385
>> 4385 nextsize = chunksize(nextchunk);
>> (gdb) l
>> 4380 Consolidate other non-mmapped chunks as they arrive.
>> 4381 */
>> 4382
>> 4383 else if (!chunk_is_mmapped(p)) {
>> 4384 nextchunk = chunk_at_offset(p, size);
>> 4385 nextsize = chunksize(nextchunk);
>> 4386 assert(nextsize > 0);
>> 4387
>> 4388 /* consolidate backward */
>> 4389 if (!prev_inuse(p)) {
>> (gdb) bt
>> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637,
>> mem=0x203a746f74512000) at malloc.c:4385
>> #1 0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
>> at malloc.c:3511
>> #2 0x00002ae6b18ea736 in opal_memory_linux_free_hook
>> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
>> #3 0x0000000001412fcc in for_dealloc_allocatable ()
>> #4 0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
>> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
>> ) at alloc.F90:1357
>> #5 0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
>> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
>> lasto=..., iphorb=...,
>> numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
>> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
>> fa=..., stress=..., h=...,
>> first=@0x0, last=@0x0) at ldau.F:752
>> #6 0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
>> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
>> #7 0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
>> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
>> #8 0x000000000070e475 in siesta () at siesta.F:23
>> #9 0x000000000045e47c in main ()
>>
>> Can anybody shed some light here on what could be wrong?
>>
>> Thanks,
>>
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 6 Sep 2012 08:03:06 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] Regarding the Pthreads
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <7FD0702A-4A29-4FF6-A80A-170D2002F862_at_[hidden]>
> Content-Type: text/plain; charset=iso-8859-1
>
> Your question is somewhat outside the scope of this list. Perhaps people may chime in with some suggestions, but that's more of a threading question than an MPI question.
>
> Be warned that you need to call MPI_Init_thread (not MPI_Init) with MPI_THREAD_MULTIPLE in order to get true multi-threaded support in Open MPI. And we only support that on the TCP and shared memory transports if you built Open MPI with threading support enabled.
>
>
> On Sep 5, 2012, at 2:23 PM, seshendra seshu wrote:
>
>> Hi,
>> I am learning pthreads and trying to implement the pthreads in my quicksort program.
>> My problem is iam unable to understand how to implement the pthreads at data received at a node from the master (In detail: In my program Master will divide the data and send to the slaves and each slave will do the sorting independently of The received data and send back to master after sorting is done. Now Iam having a problem in Implementing the pthreads at the slaves,i.e how to implement the pthreads in order to share data among the cores in each slave and sort the data and send it back to master.
>> So could anyone help in solving this problem by providing some suggestions and clues.
>>
>> Thanking you very much.
>>
>> --
>> WITH REGARDS
>> M.L.N.Seshendra
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 10
> Date: Thu, 6 Sep 2012 08:05:30 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] python-mrmpi() failed
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <E8AEFB84-8702-432C-9FB0-0C34451B02D8_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> On Sep 4, 2012, at 3:09 PM, mariana Vargas wrote:
>
>> I 'am new in this, I have some codes that use mpi for python and I
>> just installed (openmpi, mrmpi, mpi4py) in my home (from a cluster
>> account) without apparent errors and I tried to perform this simple
>> test in python and I get the following error related with openmpi,
>> could you help to figure out what is going on? I attach as many
>> informations as possible...
>
> I think I know what's happening here.
>
> It's a complicated linker issue that we've discussed before -- I'm not sure whether it was on this users list or the OMPI developers list.
>
> The short version is that you should remove your prior Open MPI installation, and then rebuild Open MPI with the --disable-dlopen configure switch. See if that fixes the problem.
>
>> Thanks.
>>
>> Mariana
>>
>>
>> From a python console
>> >>> from mrmpi import mrmpi
>> >>> mr=mrmpi()
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_paffinity_hwloc: /home/mvargas/lib/openmpi/
>> mca_paffinity_hwloc.so: undefined symbol: opal_hwloc_topology (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_carto_auto_detect: /home/mvargas/lib/openmpi/
>> mca_carto_auto_detect.so: undefined symbol:
>> opal_carto_base_graph_get_host_graph_fn (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_carto_file: /home/mvargas/lib/openmpi/
>> mca_carto_file.so: undefined symbol:
>> opal_carto_base_graph_get_host_graph_fn (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_mmap: /home/mvargas/lib/openmpi/
>> mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_posix: /home/mvargas/lib/openmpi/
>> mca_shmem_posix.so: undefined symbol: opal_show_help (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_sysv: /home/mvargas/lib/openmpi/
>> mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
>> --------------------------------------------------------------------------
>> It looks like opal_init failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during opal_init; some of which are due to configuration or
>> environment problems. This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>> opal_shmem_base_select failed
>> --> Returned value -1 instead of OPAL_SUCCESS
>> --------------------------------------------------------------------------
>> [ferrari:23417] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 79
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> ompi_mpi_init: orte_init failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>> [ferrari:23417] Local abort before MPI_INIT completed successfully;
>> not able to aggregate error messages, and not able to guarantee that
>> all other processes were killed!
>>
>>
>>
>> echo $PATH
>>
>> /home/mvargas/idl/pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/
>> idl70/bin:/opt/local/bin:/home/mvargas/bin:/home/mvargas/lib:/home/
>> mvargas/lib/openmpi/:/home/mvargas:/home/vargas/bin/:/home/mvargas/idl/
>> pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/idl70/bin:/opt/local/
>> bin:/home/mvargas/bin:/home/mvargas/lib:/home/mvargas/lib/openmpi/:/
>> home/mvargas:/home/vargas/bin/:/usr/lib64/qt3.3/bin:/usr/kerberos/bin:/
>> usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/
>> envswitcher/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX64:/opt/pvm3/bin/
>> LINUX64:/opt/c3-4/
>>
>> echo $LD_LIBRARY_PATH
>> /usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/mvargas/
>> lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/lib/:/user/
>> local/:/usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/
>> mvargas/lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/
>> lib/:/user/local/:
>>
>> Version: openmpi-1.6
>>
>>
>>
>> mpirun --bynode --tag-output ompi_info -v ompi full --parsable
>> [1,0]<stdout>:package:Open MPI mvargas_at_ferrari Distribution
>> [1,0]<stdout>:ompi:version:full:1.6
>> [1,0]<stdout>:ompi:version:svn:r26429
>> [1,0]<stdout>:ompi:version:release_date:May 10, 2012
>> [1,0]<stdout>:orte:version:full:1.6
>> [1,0]<stdout>:orte:version:svn:r26429
>> [1,0]<stdout>:orte:version:release_date:May 10, 2012
>> [1,0]<stdout>:opal:version:full:1.6
>> [1,0]<stdout>:opal:version:svn:r26429
>> [1,0]<stdout>:opal:version:release_date:May 10, 2012
>> [1,0]<stdout>:mpi-api:version:full:2.1
>> [1,0]<stdout>:ident:1.6
>>
>>
>> eth0 Link encap:Ethernet HWaddr 00:30:48:95:99:CC
>> inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
>> inet6 addr: fe80::230:48ff:fe95:99cc/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:4739875255 errors:0 dropped:1636 overruns:0 frame:0
>> TX packets:5196871012 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:4959384349297 (4.5 TiB) TX bytes:3933641883577 (3.5
>> TiB)
>> Memory:ef300000-ef320000
>>
>> eth1 Link encap:Ethernet HWaddr 00:30:48:95:99:CD
>> inet addr:128.2.116.104 Bcast:128.2.119.255 Mask:
>> 255.255.248.0
>> inet6 addr: fe80::230:48ff:fe95:99cd/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:2645952109 errors:0 dropped:13353 overruns:0 frame:0
>> TX packets:2974763570 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:2024044043824 (1.8 TiB) TX bytes:3390935387820 (3.0
>> TiB)
>> Memory:ef400000-ef420000
>>
>> lo Link encap:Local Loopback
>> inet addr:127.0.0.1 Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>> RX packets:143359307 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:143359307 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:80413513464 (74.8 GiB) TX bytes:80413513464 (74.8
>> GiB)
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> <files.tar.gz>
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 11
> Date: Thu, 6 Sep 2012 10:23:04 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] MPI_Cart_sub periods
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <346C2878-A5A6-4043-B890-09DAB68807F2_at_[hidden]>
> Content-Type: text/plain; charset=iso-8859-1
>
> John --
>
> This cartesian stuff always makes my head hurt. :-)
>
> You seem to have hit on a bona-fide bug. I have fixed the issue in our SVN trunk and will get the fixed moved over to the v1.6 and v1.7 branches.
>
> Thanks for the report!
>
>
> On Aug 29, 2012, at 5:32 AM, Craske, John wrote:
>
>> Hello,
>>
>> We are partitioning a two-dimensional Cartesian communicator into
>> two one-dimensional subgroups. In this situation we have found
>> that both one-dimensional communicators inherit the period
>> logical of the first dimension of the original two-dimensional
>> communicator when using Open MPI. Using MPICH each
>> one-dimensional communicator inherits the period corresponding to
>> the dimensions specified in REMAIN_DIMS, as expected. Could this
>> be a bug, or are we making a mistake? The relevant calls we make in a
>> Fortran code are
>>
>> CALL MPI_CART_CREATE(MPI_COMM_WORLD, 2, (/ NDIMX, NDIMY /), (/ .True., .False. /), .TRUE.,
>> COMM_CART_2D, IERROR)
>>
>> CALL MPI_CART_SUB(COMM_CART_2D, (/ .True., .False. /), COMM_CART_X, IERROR)
>> CALL MPI_CART_SUB(COMM_CART_2D, (/ .False., .True. /), COMM_CART_Y, IERROR)
>>
>> Following these requests,
>>
>> CALL MPI_CART_GET(COMM_CART_X, MAXDIM_X, DIMS_X, PERIODS_X, COORDS_X, IERROR)
>> CALL MPI_CART_GET(COMM_CART_Y, MAXDIM_Y, DIMS_Y, PERIODS_Y, COORDS_Y, IERROR)
>>
>> will result in
>>
>> PERIODS_X = T
>> PERIODS_Y = T
>>
>> If, on the other hand we define the two-dimensional communicator
>> using PERIODS = (/ .False., .True. /), we find
>>
>> PERIODS_X = F
>> PERIODS_Y = F
>>
>> Your advice on the matter would be greatly appreciated.
>>
>> Regards,
>>
>> John.
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 12
> Date: Thu, 06 Sep 2012 16:58:03 +0200
> From: Shiqing Fan <fan_at_[hidden]>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: Siegmar Gross <Siegmar.Gross_at_[hidden]>
> Cc: users_at_[hidden]
> Message-ID: <5048B9FB.3070408_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi Siegmar,
>
> Glad to hear that it's working for you.
>
> The warning message is because the loopback adapter is excluded by
> default, but this adapter is actually not installed on Windows.
>
> One solution might be installing the loopback adapter on Windows. It
> very easy, only a few minutes.
>
> Or it may be possible to avoid this message from internal Open MPI. But
> I'm not sure about how this can be done.
>
>
> Regards,
> Shiqing
>
>
> On 2012-09-06 7:48 AM, Siegmar Gross wrote:
>> Hi Shiqing,
>>
>> I have solved the problem with the double quotes in OPENMPI_HOME but
>> there is still something wrong.
>>
>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>
>> mpicc init_finalize.c
>> Cannot open configuration file "c:\Program Files (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt
>> Error parsing data file mpicc: Not found
>>
>>
>> Everything is OK if you remove the double quotes which Windows
>> automatically adds.
>>
>> set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1
>>
>> mpicc init_finalize.c
>> Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 f?r 80x86
>> ...
>>
>> mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> WARNING: An invalid value was given for btl_tcp_if_exclude. This
>> value will be ignored.
>>
>> Local host: hermes
>> Value: 127.0.0.1/8
>> Message: Did not find interface matching this subnet
>> --------------------------------------------------------------------------
>>
>> Hello!
>>
>>
>> I get the output from my program but also a warning from Open MPI.
>> The new value for the loopback device was introduced a short time
>> ago when I have had problems with the loopback device on Solaris
>> (it used "lo0" instead of your default "lo"). How can I avoid this
>> message? The 64-bit version of my program still hangs.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>>> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>>>> This env is a backup option for the registry.
>>> It solves one problem but there is a new problem now :-((
>>>
>>>
>>> Without OPENMPI_HOME: Wrong pathname to help files.
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry! You were supposed to get help about:
>>> invalid if_inexclude
>>> But I couldn't open the help file:
>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>>> No such file or directory. Sorry!
>>> --------------------------------------------------------------------------
>>> ...
>>>
>>>
>>>
>>> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
>>> the pathname contains the character " in the wrong place so that it
>>> couldn't find the available help file.
>>>
>>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry! You were supposed to get help about:
>>> no-hostfile
>>> But I couldn't open the help file:
>>> "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. Sorry
>>> !
>>> --------------------------------------------------------------------------
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\ras\base
>>> \ras_base_allocate.c at line 200
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\base
>>> \plm_base_launch_support.c at line 99
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mca\plm\proc
>>> ess\plm_process_module.c at line 996
>>>
>>>
>>>
>>> It looks like that the environment variable can also solve my
>>> problem in the 64-bit environment.
>>>
>>> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>>>
>>> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
>>> ...
>>>
>>>
>>> The process hangs without OPENMPI_HOME.
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> ^C
>>>
>>>
>>> With OPENMPI_HOME:
>>>
>>> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry! You were supposed to get help about:
>>> no-hostfile
>>> But I couldn't open the help file:
>>> "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument. S
>>> orry!
>>> --------------------------------------------------------------------------
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>>> a\ras\base\ras_base_allocate.c at line 200
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>>> a\plm\base\plm_base_launch_support.c at line 99
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file ..\..\openmpi-1.6.1\orte\mc
>>> a\plm\process\plm_process_module.c at line 996
>>>
>>>
>>> At least the program doesn't block any longer. Do you have any ideas
>>> how this new problem can be solved?
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>>
>>>
>>>> On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>>>>> Hi Shiqing,
>>>>>
>>>>>>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> Sorry! You were supposed to get help about:
>>>>>>>> invalid if_inexclude
>>>>>>>> But I couldn't open the help file:
>>>>>>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>>>>>>>> No such file or directory. Sorry!
>>>>>>>> ---------------------------------------------------------------------
>>>>>>> ...
>>>>>>>> Why does "mpiexec" look for the help file relativ to my current
>>>>>>>> program and not relative to itself? The file is part of the
>>>>>>>> package.
>>>>>>> Do you know how I can solve this problem?
>>>>>> I have similar issue with message from tcp, but it's not finding the
>>>>>> file, it's something else, which doesn't affect the execution of the
>>>>>> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>>>>>> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>>>>> That wouldn't be a good idea because I have MPI programs in different
>>>>> directories so that I would have to install all help files in several
>>>>> places (<my_directory>/../share/openmpi/help*.txt). All help files are
>>>>> available in the installation directory of Open MPI.
>>>>>
>>>>> dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>>>>> ...
>>>>> 29.08.2012 10:59 38.912 mpiexec.exe
>>>>> ...
>>>>> dir "c:\Program Files (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>>>>> ...
>>>>> 03.04.2012 16:30 631 help-mpi-btl-tcp.txt
>>>>> ...
>>>>>
>>>>> I don't know if "mpiexec" or my program "init_finilize" is responsible
>>>>> for the error message but whoever is responsible shouldn't use the path
>>>>> to my program but the prefix_dir from MPI to find the help files. Perhaps
>>>>> you can change the behaviour in the Open MPI source code.
>>>>>
>>>>>
>>>>>>>> I can also compile in 64-bit mode but the program hangs.
>>>>>>> Do you have any ideas why the program hangs? Thank you very much for any
>>>>>>> help in advance.
>>>>>> To be honest I don't know. I couldn't reproduce it. Did you try
>>>>>> installing the binary installer, will it also behave the same?
>>>>> I like to have different versions of Open MPI which I activate via
>>>>> a batch file so that I can still run my program in an old version if
>>>>> something goes wrong in a new one. I have no entries in the system
>>>>> environment or registry so that I can even run different versions in
>>>>> different command windows without problems (everything is only known
>>>>> within the command window in which a have run my batch file). It seems
>>>>> that you put something in the registry when I use your installer.
>>>>> Perhaps you remember an earlier email where I had to uninstall an old
>>>>> version because the environment in my own installation was wrong
>>>>> as long as your installation was active. Nevertheless I can give it
>>>>> a try. Perhaps I find out if you set more than just the path to your
>>>>> binaries. Do you know if there is something similar to "truss" or
>>>>> "strace" in the UNIX world so that I can see where the program hangs?
>>>>> Thank you very much for your help in advance.
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>> Siegmar
>>>>>
>>>>
>>>> --
>>>> ---------------------------------------------------------------
>>>> Shiqing Fan
>>>> High Performance Computing Center Stuttgart (HLRS)
>>>> Tel: ++49(0)711-685-87234 Nobelstrasse 19
>>>> Fax: ++49(0)711-685-65832 70569 Stuttgart
>>>> http://www.hlrs.de/organization/people/shiqing-fan/
>>>> email: fan_at_[hidden]
>>>>
>>>
>>
>
>
> --
> ---------------------------------------------------------------
> Shiqing Fan
> High Performance Computing Center Stuttgart (HLRS)
> Tel: ++49(0)711-685-87234 Nobelstrasse 19
> Fax: ++49(0)711-685-65832 70569 Stuttgart
> http://www.hlrs.de/organization/people/shiqing-fan/
> email: fan_at_[hidden]
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 2345, Issue 1
> **************************************