Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi with xgrid
From: Klymak Jody (jklymak_at_[hidden])
Date: 2009-08-15 10:05:24


On 15-Aug-09, at 1:03 AM, Alan wrote:

> Thanks Warner,
>
> This is frustrating... I read the ticket. 6 months already and 2
> releases postponed... Frankly, I am very skeptical that this will be
> fixed for 1.3.4. I really hope so, but when 1.3.4 will be released?
>
> I have to think about going with 1.2.x and possible disruptions in
> my configuration (I use Fink) or wait.
>
> And I offered myself to test any nightly snapshot claiming this bug
> is fixed.

Hi Alan,

Its not too hard to get PBS/torque up and running.

The OS/X-specific (?) issues I had:

1) /etc/hosts had to have the server explicitly listed on each of the
nodes.
2) $usecp had to be set in mom_config
3) $restricted had to be set on the nodes in mom_priv/config to accept
calls from the server.

I think that is it. Of course if you are already using xgrid on these
machines for other uses it won't play well with PBS, but otherwise all
you are missing is the cute tachometer display.

Cheers, Jody

> Cheers,
> Alan
>
> On Fri, Aug 14, 2009 at 17:20, Warner Yuen <wyuen_at_[hidden]> wrote:
> Hi Alan,
>
> Xgrid support for Open MPI is currently broken in the latest version
> of Open MPI. See the ticket below. However, I believe that Xgrid
> still works with one of the earlier 1.2 versions of Open MPI. I
> don't recall for sure, but I think that it's Open MPI 1.2.3.
>
> #1777: Xgrid support is broken in the v1.3 series
> ---------------------
> +------------------------------------------------------
> Reporter: jsquyres | Owner: brbarret
> Type: defect | Status: accepted
> Priority: major | Milestone: Open MPI 1.3.4
> Version: trunk | Resolution:
> Keywords: |
> ---------------------
> +------------------------------------------------------
> Changes (by bbenton):
>
> * milestone: Open MPI 1.3.3 => Open MPI 1.3.4
>
>
> Warner Yuen
> Scientific Computing
> Consulting Engineer
> Apple, Inc.
> email: wyuen_at_[hidden]
> Tel: 408.718.2859
>
>
>
>
> On Aug 14, 2009, at 6:21 AM, users-request_at_[hidden] wrote:
>
>
> Message: 1
> Date: Fri, 14 Aug 2009 14:21:30 +0100
> From: Alan <alanwilter_at_[hidden]>
> Subject: [OMPI users] openmpi with xgrid
> To: users_at_[hidden]
> Message-ID:
> <cf58c8d00908140621v18d384f2wef97ee80ca3ded0c_at_[hidden]>
> Content-Type: text/plain; charset="utf-8"
>
>
> Hi there,
> I saw that http://www.open-mpi.org/community/lists/users/2007/08/3900.php
> .
>
> I use fink, and so I changed the openmpi.info file in order to get
> openmpi
> with xgrid support.
>
> As you can see:
> amadeus[2081]:~/Downloads% /sw/bin/ompi_info
> Package: Open MPI root_at_amadeus.local Distribution
> Open MPI: 1.3.3
> Open MPI SVN revision: r21666
> Open MPI release date: Jul 14, 2009
> Open RTE: 1.3.3
> Open RTE SVN revision: r21666
> Open RTE release date: Jul 14, 2009
> OPAL: 1.3.3
> OPAL SVN revision: r21666
> OPAL release date: Jul 14, 2009
> Ident string: 1.3.3
> Prefix: /sw
> Configured architecture: x86_64-apple-darwin9
> Configure host: amadeus.local
> Configured by: root
> Configured on: Fri Aug 14 12:58:12 BST 2009
> Configure host: amadeus.local
> Built by:
> Built on: Fri Aug 14 13:07:46 BST 2009
> Built host: amadeus.local
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (single underscore)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /sw/var/lib/fink/path-prefix-10.6/gcc
> C++ compiler: g++
> C++ compiler absolute: /sw/var/lib/fink/path-prefix-10.6/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /sw/bin/gfortran
> Fortran90 compiler: gfortran
> Fortran90 compiler abs: /sw/bin/gfortran
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Sparse Groups: no
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
> MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
> FT Checkpoint support: no (checkpoint thread: no)
> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.3.3)
> MCA carto: auto_detect (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
> MCA maffinity: first_use (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA timer: darwin (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: hierarch (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
> MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: binomial (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: xgrid (MCA v2.0, API v2.0, Component v1.3.3)
> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: singleton (MCA v2.0, API v2.0, Component
> v1.3.3)
> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>
> All seemed fine and I also have xgrid controller and agent running
> in my
> laptop, and then when I tried:
>
> /sw/bin/om-mpirun -c 2 mpiapp # hello world example for mpi
> [amadeus.local:40293] [[804,0],0] ORTE_ERROR_LOG: Unknown error: 1
> in file
> src/plm_xgrid_module.m at line 119
> [amadeus.local:40293] [[804,0],0] ORTE_ERROR_LOG: Unknown error: 1
> in file
> src/plm_xgrid_module.m at line 153
> --------------------------------------------------------------------------
> om-mpirun was unable to start the specified application as it
> encountered an
> error.
> More information may be available above.
> --------------------------------------------------------------------------
> 2009-08-14 14:16:19.715 om-mpirun[40293:10b] *** Terminating app due
> to
> uncaught exception 'NSInvalidArgumentException', reason: '***
> -[NSKVONotifying_XGConnection<0x1001164b0> finalize]: called when
> collecting
> not enabled'
> 2009-08-14 14:16:19.716 om-mpirun[40293:10b] Stack: (
> 140735390096156,
> 140735366109391,
> 140735390122388,
> 4295943988,
> 4295939168,
> 4295171139,
> 4295883300,
> 4295025321,
> 4294973498,
> 4295401605,
> 4295345774,
> 4295056598,
> 4295116412,
> 4295119970,
> 4295401605,
> 4294972881,
> 4295401605,
> 4295345774,
> 4295056598,
> 4295172615,
> 4295938185,
> 4294971936,
> 4294969401,
> 4294969340
> )
> terminate called after throwing an instance of 'NSException'
> [amadeus:40293] *** Process received signal ***
> [amadeus:40293] Signal: Abort trap (6)
> [amadeus:40293] Signal code: (0)
> [amadeus:40293] [ 0] 2 libSystem.B.dylib
> 0x00000000831443fa _sigtramp + 26
> [amadeus:40293] [ 1] 3 ???
> 0x000000005fbfb1e8 0x0 + 1606398440
> [amadeus:40293] [ 2] 4 libstdc++.6.dylib
> 0x00000000827f2085 _ZN9__gnu_cxx27__verbose_terminate_handlerEv + 377
> [amadeus:40293] [ 3] 5 libobjc.A.dylib
> 0x0000000081811adf objc_end_catch + 280
> [amadeus:40293] [ 4] 6 libstdc++.6.dylib
> 0x00000000827f0425 __gxx_personality_v0 + 1259
> [amadeus:40293] [ 5] 7 libstdc++.6.dylib
> 0x00000000827f045b _ZSt9terminatev + 19
> [amadeus:40293] [ 6] 8 libstdc++.6.dylib
> 0x00000000827f054c __cxa_rethrow + 0
> [amadeus:40293] [ 7] 9 libobjc.A.dylib
> 0x0000000081811966 objc_exception_rethrow + 0
> [amadeus:40293] [ 8] 10 CoreFoundation
> 0x0000000082ef8194 _CF_forwarding_prep_0 + 5700
> [amadeus:40293] [ 9] 11 mca_plm_xgrid.so
> 0x00000000000ee734 orte_plm_xgrid_finalize + 4884
> [amadeus:40293] [10] 12 mca_plm_xgrid.so
> 0x00000000000ed460 orte_plm_xgrid_finalize + 64
> [amadeus:40293] [11] 13 libopen-rte.0.dylib
> 0x0000000000031c43 orte_plm_base_close + 195
> [amadeus:40293] [12] 14 mca_ess_hnp.so
> 0x00000000000dfa24 0x0 + 916004
> [amadeus:40293] [13] 15 libopen-rte.0.dylib
> 0x000000000000e2a9 orte_finalize + 89
> [amadeus:40293] [14] 16 om-mpirun
> 0x000000000000183a start + 4210
> [amadeus:40293] [15] 17 libopen-pal.0.dylib
> 0x000000000006a085 opal_event_add_i + 1781
> [amadeus:40293] [16] 18 libopen-pal.0.dylib
> 0x000000000005c66e opal_progress + 142
> [amadeus:40293] [17] 19 libopen-rte.0.dylib
> 0x0000000000015cd6 orte_trigger_event + 70
> [amadeus:40293] [18] 20 libopen-rte.0.dylib
> 0x000000000002467c orte_daemon_recv + 4332
> [amadeus:40293] [19] 21 libopen-rte.0.dylib
> 0x0000000000025462 orte_daemon_cmd_processor + 722
> [amadeus:40293] [20] 22 libopen-pal.0.dylib
> 0x000000000006a085 opal_event_add_i + 1781
> [amadeus:40293] [21] 23 om-mpirun
> 0x00000000000015d1 start + 3593
> [amadeus:40293] [22] 24 libopen-pal.0.dylib
> 0x000000000006a085 opal_event_add_i + 1781
> [amadeus:40293] [23] 25 libopen-pal.0.dylib
> 0x000000000005c66e opal_progress + 142
> [amadeus:40293] [24] 26 libopen-rte.0.dylib
> 0x0000000000015cd6 orte_trigger_event + 70
> [amadeus:40293] [25] 27 libopen-rte.0.dylib
> 0x0000000000032207 orte_plm_base_launch_failed + 135
> [amadeus:40293] [26] 28 mca_plm_xgrid.so
> 0x00000000000ed089 orte_plm_xgrid_spawn + 89
> [amadeus:40293] [27] 29 om-mpirun
> 0x0000000000001220 start + 2648
> [amadeus:40293] [28] 30 om-mpirun
> 0x0000000000000839 start + 113
> [amadeus:40293] [29] 31 om-mpirun
> 0x00000000000007fc start + 52
> [amadeus:40293] *** End of error message ***
> [1] 40293 abort /sw/bin/om-mpirun -c 2 mpiapp
>
>
> Is there anyone using openmpi with xgrid successfully keen to share
> his/her
> experience? I am not new to xgrid or mpi, but to both integrated I
> must say
> that I am in uncharted waters.
>
> Any help would be very appreciated.
>
> Many thanks in advance,
> Alan
> --
> Alan Wilter S. da Silva, D.Sc. - CCPN Research Associate
> Department of Biochemistry, University of Cambridge.
> 80 Tennis Court Road, Cambridge CB2 1GA, UK.
> http://www.bio.cam.ac.uk/~awd28<<
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1318, Issue 2
> **************************************
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Alan Wilter S. da Silva, D.Sc. - CCPN Research Associate
> Department of Biochemistry, University of Cambridge.
> 80 Tennis Court Road, Cambridge CB2 1GA, UK.
> >>http://www.bio.cam.ac.uk/~awd28<<
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users