Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2006-10-31 08:53:53


Aha! Thanks for your detailed information - that helps identify the problem.

See some thoughts below.
Ralph

On 10/31/06 3:49 AM, "hpetit_at_[hidden]" <hpetit_at_[hidden]> wrote:

> Thank you for you quick reply Ralf,
>
> As far as I know, the NODES environment variable is created when a job is
> submitted to the bjs scheduler.
> The only way I know (but I am a bproc newbe) is to use the bjssub command.

That is correct. However, Open MPI requires that ALL of the nodes you are
going to use must be allocated in advance. In other words, you have to get
an allocation large enough to run your entire job - both the initial
application and anything you comm_spawn.

I wish I could help you with the proper bjs commands to get an allocation,
but I am not familiar with bjs and (even after multiple Google searches)
cannot find any documentation on that code. Try doing a "bjs --help" and see
what it says.

>
> Then, I have retried my test with the following running command: "bjssub -i
> mpirun -np 1 main_exe".
>

<snip>
>
> I guess, this problem comes from the way I set the parameters to the spawned
> program. Instead of giving instructions to spawn the program on a specific
> host, I should set parameters to spawn the program on a specific node.
> But I do not know how to do it.
>

What you did was fine. "host" is the correct field to set. I suspect two
possible issues:

1. The specified host may not be in the allocation. In the case you showed
here, I would expect it to be since you specified the same host we are
already on. However, you might try running mpirun with the "--nolocal"
option - this will force mpirun to launch the processes on a machine other
than the one you are on (typically you are on the head node. In many bproc
machines, this node is not included in an allocation as the system admins
don't want you running MPI jobs on it).

2. We may have something wrong in our code for this case. I'm not sure how
well that has been tested, especially in the 1.1 code branch.
 
> Then, I have a bunch of questions:
> - when mpi is used together with bproc, is it necessary to use bjssub or bjs
> in general ?

You have to use some kind of resource manager to obtain a node allocation
for your use. At our site, we use LSF - other people use bjs. Anything that
sets the NODE variable is fine.

> - I was wondering if I had to submit to bjs the spawned program ? i.e do I
> have to add 'bjssub' to the commands parameter of the MPI_Comm_spawn_mutliple
> call ?

You shouldn't have to do so. I suspect, however, that bjssub is not getting
a large enough allocation for your combined mpirun + spawned job. I'm not
familiar enough with bjs to know for certain.
>
> As you can see, I am still not able to spawn a program and need some more help
> ?
> Do you have a some examples describing how to do it ?

Unfortunately, not in the 1.1 branch, nor do I have one for
comm_spawn_multiple that uses the "host" field. I can try to concoct
something over the next few days, though, and verify that our code is
working correctly.

>
> Regards.
>
> Herve
>
> Date: Mon, 30 Oct 2006 09:00:47 -0700
> From: Ralph H Castain <rhc_at_[hidden]>
> Subject: Re: [OMPI users] MPI_Comm_spawn multiple bproc support
> problem
> To: "Open MPI Users <users_at_[hidden]>" <users_at_[hidden]>
> Message-ID: <C16B6FBF.570D%rhc_at_[hidden]>
> Content-Type: text/plain; charset="ISO-8859-1"
>
> On 1.1.2, what that error is telling you is that it didn't find any nodes in
> the environment. The bproc allocator looks for an environmental variable
> NODES that contains a list of nodes assigned to you. This error indicates it
> didn't find anything.
>
> Did you get an allocation prior to running the job? Could you check to see
> if NODES appears in your environment?
>
> Ralph
>
>
>
> On 10/30/06 8:47 AM, "hpetit_at_[hidden]" <hpetit_at_[hidden]> wrote:
>
>> Hi,
>> I have a problem using the MPI_Comm_spawn multiple together with bproc.
>>
>> I want to use the MPI_Comm_spawn multiple call to spawn a set of exe, but in
>> a
>> bproc environment, the program crashes or is stuck on this call (depending of
>> the used open mpi release).
>>
>> I have created one test program that spawns one other program on the same
>> host
>> (cf. code listing at the end of the mail).
>>
>> * With open mpi 1.1.2, the program crashs on the MPI_Comm_spawn multiple
>> call:
>> <--------------------------------->
>> [myhost:17061] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at
>> line
>> 253
>> main_exe: Begining of main_exe
>> main_exe: Call MPI_Init
>> main_exe: Call MPI_Comm_spawn_multiple()
>> [myhost:17061] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at
>> line
>> 253
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:(nil)
>> [0] func:/usr/local/Mpi/openmpi-1.1.2/lib/libopal.so.0 [0xb7f70ccf]
>> [1] func:[0xffffe440]
>> [2]
>>
func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0(orte_schema_base_get_node_>>
t
>> okens+0x7f) [0xb7fdc41f]
>> [3]
>>
func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0(orte_ras_base_node_assign+>>
0
>> x20b) [0xb7fd230b]
>> [4]
>>
func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0(orte_ras_base_allocate_nod>>
e
>> s+0x41) [0xb7fd0371]
>> [5] func:/usr/local/Mpi/openmpi-1.1.2/lib/openmpi/mca_ras_hostfile.so
>> [0xb7538ba8]
>> [6]
>>
func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0(orte_ras_base_allocate+0xd>>
0
>> ) [0xb7fd0470]
>> [7] func:/usr/local/Mpi/openmpi-1.1.2/lib/openmpi/mca_rmgr_urm.so
>> [0xb754d62f]
>> [8]
>>
func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0(orte_rmgr_base_cmd_dispatc>>
h
>> +0x137) [0xb7fd9187]
>> [9] func:/usr/local/Mpi/openmpi-1.1.2/lib/openmpi/mca_rmgr_urm.so
>> [0xb754e09e]
>> [10] func:/usr/local/Mpi/openmpi-1.1.2/lib/liborte.so.0 [0xb7fcd00e]
>> [11] func:/usr/local/Mpi/openmpi-1.1.2/lib/openmpi/mca_oob_tcp.so
>> [0xb7585084]
>> [12] func:/usr/local/Mpi/openmpi-1.1.2/lib/openmpi/mca_oob_tcp.so
>> [0xb7586763]
>> [13]
>> func:/usr/local/Mpi/openmpi-1.1.2/lib/libopal.so.0(opal_event_loop+0x199)
>> [0xb7f5f7a9]
>> [14] func:/usr/local/Mpi/openmpi-1.1.2/lib/libopal.so.0 [0xb7f60353]
>> [15] func:/lib/tls/libpthread.so.0 [0xb7ef7b63]
>> [16] func:/lib/tls/libc.so.6(__clone+0x5a) [0xb7e9518a]
>> *** End of error message ***
>> <----------------------------------------------->
>>
>> * With open mpi 1.1.1, the program is simply stuck on the MPI_Comm_spawn
>> multiple call:
>> <--------------------------------->
>> [myhost:17187] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at
>> line
>> 253
>> main_exe: Begining of main_exe
>> main_exe: Call MPI_Init
>> main_exe: Call MPI_Comm_spawn_multiple()
>> [myhost:17187] [0,0,0] ORTE_ERROR_LOG: Not available in file ras_bjs.c at
>> line
>> 253
>> <--------------------------------->
>>
>> * With open mpi 1.0.2, the program is also stuck on the MPI_Comm_spawn
>> multiple call but there is no ORTE_ERROR_LOG:
>> <--------------------------------->
>> main_exe: Begining of main_exe
>> main_exe: Call MPI_Init
>> main_exe: Call MPI_Comm_spawn_multiple()
>> <--------------------------------->
>>
>>
>> * With open mpi 1.1.2 in a non bproc environment, the program works just fine
>> :
>> <--------------------------------->
>> main_exe: Begining of main_exe
>> main_exe: Call MPI_Init
>> main_exe: Call MPI_Comm_spawn_multiple()
>> spawned_exe: Begining of spawned_exe
>> spawned_exe: Call MPI_Init
>> main_exe: Back from MPI_Comm_spawn_multiple() result = 0
>> main_exe: Spawned exe returned errcode = 0
>> spawned_exe: This exe does not do really much thing actually
>> main_exe: Call MPI_finalize
>> main_exe: End of main_exe
>> <--------------------------------->
>>
>> Can you help me to solve this problem ?
>>
>> Regards.
>>
>> Herve
>>
>>
>> The bproc release is:
>> bproc: Beowulf Distributed Process Space Version 4.0.0pre8
>> bproc: (C) 1999-2003 Erik Hendriks <erik_at_[hidden]>
>> bproc: Initializing node set. node_ct=1 id_ct=1
>>
>> the system is a debian sarge with a 2.6.9 kernel installed and patched with
>> bproc.
>>
>> Eventually, I provide to you the ompi_info log fot he open mpi 1.1.2 release:
>> Open MPI: 1.1.2
>> Open MPI SVN revision: r12073
>> Open RTE: 1.1.2
>> Open RTE SVN revision: r12073
>> OPAL: 1.1.2
>> OPAL SVN revision: r12073
>> Prefix: /usr/local/Mpi/openmpi-1.1.2
>> Configured architecture: i686-pc-linux-gnu
>> Configured by: itrsat
>> Configured on: Mon Oct 23 12:55:17 CEST 2006
>> Configure host: myhost
>> Built by: setics
>> Built on: lun oct 23 13:09:47 CEST 2006
>> Built host: myhost
>> C bindings: yes
>> C++ bindings: yes
>> Fortran77 bindings: no
>> Fortran90 bindings: no
>> Fortran90 bindings size: na
>> C compiler: gcc
>> C compiler absolute: /usr/bin/gcc
>> C++ compiler: g++
>> C++ compiler absolute: /usr/bin/g++
>> Fortran77 compiler: none
>> Fortran77 compiler abs: none
>> Fortran90 compiler: none
>> Fortran90 compiler abs: none
>> C profiling: yes
>> C++ profiling: yes
>> Fortran77 profiling: no
>> Fortran90 profiling: no
>> C++ exceptions: no
>> Thread support: posix (mpi: yes, progress: yes)
>> Internal debug support: no
>> MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>> libltdl support: yes
>> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA ras: bjs (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: lsf_bproc (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pls: bproc (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pls: bproc_orted (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: bproc (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)
>> MCA soh: bproc (MCA v1.0, API v1.0, Component v1.1.2)
>>
>> Here below, the code listings:
>> * main_exe.c
>> <------------------------------------------------------------------->
>> #include "mpi.h"
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <unistd.h>
>> int gethostname(char *nom, size_t lg);
>>
>> int main( int argc, char **argv ) {
>>
>> /*
>> * MPI_Comm_spawn_multiple parameters
>> */
>> int result, count, root;
>> int maxprocs;
>> char **commands;
>> MPI_Info infos;
>> int errcodes;
>>
>> MPI_Comm intercomm, newintracomm;
>> int rank;
>> char hostname[80];
>> int len;
>>
>> printf( "main_exe: Begining of main_exe\n");
>> printf( "main_exe: Call MPI_Init\n");
>> MPI_Init( &argc, &argv );
>> MPI_Comm_rank( MPI_COMM_WORLD, &rank );
>>
>> /*
>> * MPI_Comm_spawn_multiple parameters
>> */
>> count = 1;
>> maxprocs = 1;
>> root = rank;
>>
>> commands = malloc (sizeof (char *));
>> commands[0] = calloc (80, sizeof (char ));
>> sprintf (commands[0], "./spawned_exe");
>>
>> MPI_Info_create( &infos );
>>
>> /* set proc/cpu info */
>> result = MPI_Info_set( infos, "soft", "0:1" );
>>
>> /* set host info */
>> result = gethostname ( hostname, len);
>> if ( -1 == result ) {
>> printf ("main_exe: Problem in gethostname\n");
>> }
>> result = MPI_Info_set( infos, "host", hostname );
>>
>> printf( "main_exe: Call MPI_Comm_spawn_multiple()\n");
>> result = MPI_Comm_spawn_multiple( count,
>> commands,
>> MPI_ARGVS_NULL,
>> &maxprocs,
>> &infos,
>> root,
>> MPI_COMM_WORLD,
>> &intercomm,
>> &errcodes );
>> printf( "main_exe: Back from MPI_Comm_spawn_multiple() result = %d\n",
>> result);
>> printf( "main_exe: Spawned exe returned errcode = %d\n", errcodes );
>>
>> MPI_Intercomm_merge( intercomm, 0, &newintracomm );
>>
>> /* Synchronisation with spawned exe */
>> MPI_Barrier( newintracomm );
>>
>> free( commands[0] );
>> free( commands );
>> MPI_Comm_free( &newintracomm );
>>
>> printf( "main_exe: Call MPI_finalize\n");
>> MPI_Finalize( );
>>
>> printf( "main_exe: End of main_exe\n");
>> return 0;
>> }
>>
>> <------------------------------------------------------------------->
>>
>> * spawned_exe.c
>> <------------------------------------------------------------------->
>>
>> #include "mpi.h"
>> #include <stdio.h>
>>
>> int main( int argc, char **argv ) {
>> MPI_Comm parent, newintracomm;
>>
>> printf ("spawned_exe: Begining of spawned_exe\n");
>> printf( "spawned_exe: Call MPI_Init\n");
>> MPI_Init( &argc, &argv );
>>
>> MPI_Comm_get_parent ( &parent );
>> MPI_Intercomm_merge ( parent, 1, &newintracomm );
>>
>> printf( "spawned_exe: This exe does not do really much thing actually\n"
>> );
>>
>> /* Synchronisation with main exe */
>> MPI_Barrier( newintracomm );
>>
>> MPI_Comm_free( &newintracomm );
>>
>> printf( "spawned_exe: Call MPI_finalize\n");
>> MPI_Finalize( );
>>
>> printf( "spawned_exe: End of spawned_exe\n");
>> return 0;
>> }
>>
>> <------------------------------------------------------------------->
>>
>
> --------------------- ALICE SECURITE ENFANTS ---------------------
> Protégez vos enfants des dangers d'Internet en installant Sécurité Enfants, le
> contrôle parental d'Alice.
> http://www.aliceadsl.fr/securitepc/default_copa.asp
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users