Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] dropping a pls module into an Open MPI build
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-01-23 09:19:38


Hi Dean

Had to ponder this for awhile. I'm not entirely sure of the source of the
problem, but one suspicion has to do with the name of the module. Open MPI
ships with a module named "rsh" in the PLS framework. The MCA uses the
module name in its loading process.

If you insert another module with the identical "rsh" name, then the MCA is
likely to have a problem. Your output indicates that the PLS module has a
NULL function pointer somewhere. My suspicion is that the MCA is seeing two
modules with identical names and getting confused.

Recompiling the existing rsh module and "dropping" that into the precompiled
Open MPI distribution will definitely create that conflict!

Note that you will always get through open and init - the problem is that
select is getting confused by the conflicting names and returning a NULL set
of function pointers.

Try renaming your "drop-in" module something else - maybe "dean_rsh". You
need to do that in the component structure - here is the rsh one from
orte/mca/pls/rsh/pls_rsh_component.c:

/*
 * Instantiate the public struct with all of our public information
 * and pointers to our public functions in it
 */

orte_pls_rsh_component_t mca_pls_rsh_component = {
    {
    /* First, the mca_component_t struct containing meta information
       about the component itself */

    {
        /* Indicate that we are a pls v1.3.0 component (which also
           implies a specific MCA version) */

        ORTE_PLS_BASE_VERSION_1_3_0,

        /* Component name and version */

        "rsh", /***THIS IS WHERE YOU NEED TO PUT YOUR MODULE'S NAME ***/
        ORTE_MAJOR_VERSION,
        ORTE_MINOR_VERSION,
        ORTE_RELEASE_VERSION,

        /* Component open and close functions */

        orte_pls_rsh_component_open,
        orte_pls_rsh_component_close
    },

    /* Next the MCA v1.0.0 component meta data */

    {
        /* The component is checkpoint ready */
        MCA_BASE_METADATA_PARAM_CHECKPOINT
    },

    /* Initialization / querying functions */

    orte_pls_rsh_component_init
    }
};

Obviously, the names of the component structures and functions would be
unique to your module as well.

Hope that helps.
Ralph

Note

On 1/18/08 12:17 PM, "Dean Dauger, Ph. D." <d_at_[hidden]> wrote:

> Hello,
>
> I'm developing an mca_pls module, intending to drop it into a
> preexisting Open MPI build (in its lib/openmpi directory) and have
> orterun pick it up, but orterun kept crashing on me even though it
> correctly calls my module. To help isolate the issue I separately
> recompiled the mca_pls_rsh module from a given Open MPI source
> checkout and dropping that didn't work either. Any pointers?
>
> To give an idea of what's going on here's an example attempt to run
> on two local processors:
>
> dauger$ orterun -mca pls rsh -mca pls_base_verbose 10 --debug-devel --
> np 2 --host localhost "/Users/dauger/Documents/ompi-trunk/pingpong"
> [Rotarran-X-5.local:04475] connect_uni: connection not allowed
> [Rotarran-X-5.local:04475] mca: base: components_open: Looking for
> pls components
> [Rotarran-X-5.local:04475] mca: base: components_open: distilling pls
> components
> [Rotarran-X-5.local:04475] mca: base: components_open: including pls
> components
> [Rotarran-X-5.local:04475] mca: base: components_open: rsh -->
> included
> [Rotarran-X-5.local:04475] mca: base: components_open: opening pls
> components
> [Rotarran-X-5.local:04475] mca: base: components_open: found loaded
> component rsh
> [Rotarran-X-5.local:04475] mca: base: components_open: component rsh
> open function successful
> [Rotarran-X-5.local:04475] orte:base:select: querying component rsh
> [Rotarran-X-5.local:04475] [0,0,0] setting up session dir with
> [Rotarran-X-5.local:04475] universe default-universe-4475
> [Rotarran-X-5.local:04475] user dauger
> [Rotarran-X-5.local:04475] host Rotarran-X-5.local
> [Rotarran-X-5.local:04475] jobid 0
> [Rotarran-X-5.local:04475] procid 0
> [Rotarran-X-5.local:04475] procdir: /var/folders/oE/oENz6Cd
> +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger_at_Rotarran-
> X-5.local_0/default-universe-4475/0/0
> [Rotarran-X-5.local:04475] jobdir: /var/folders/oE/oENz6Cd
> +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger_at_Rotarran-
> X-5.local_0/default-universe-4475/0
> [Rotarran-X-5.local:04475] unidir: /var/folders/oE/oENz6Cd
> +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger_at_Rotarran-
> X-5.local_0/default-universe-4475
> [Rotarran-X-5.local:04475] top: openmpi-sessions-dauger_at_Rotarran-
> X-5.local_0
> [Rotarran-X-5.local:04475] tmp: /var/folders/oE/oENz6Cd+FTCWQbRGkntLLU
> +++TI/-Tmp-/
> [Rotarran-X-5.local:04475] [0,0,0] contact_file /var/folders/oE/
> oENz6Cd+FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger_at_Rotarran-
> X-5.local_0/default-universe-4475/universe-setup.txt
> [Rotarran-X-5.local:04475] [0,0,0] wrote setup file
> [Rotarran-X-5:04475] *** Process received signal ***
> [Rotarran-X-5:04475] Signal: Bus error (10)
> [Rotarran-X-5:04475] Signal code: (2)
> [Rotarran-X-5:04475] Failing at address: 0x0
> [ 1] [0xbffff828, 0x00000000] (-P-)
> [ 2] (orterun + 0x457) [0xbffff8b8, 0x00001d07]
> [ 3] (main + 0x18) [0xbffff8d8, 0x000018ae]
> [ 4] (start + 0x36) [0xbffff8fc, 0x0000186a]
> [ 5] [0x00000000, 0x0000000d] (FP-)
> [Rotarran-X-5:04475] *** End of error message ***
> Bus error
>
> pingpong was compiled with the existing Open MPI, and it runs with
> the built-in rsh module, but not when I replace the pls_rsh module
> with a recompiled one. When I add printf's in the pls_rsh module in
> its _open and _init, I can show each of its subroutines return
> without problem, but _launch is not yet called. I'm running Mac OS X
> 10.5.1, which ships with Open MPI at /usr, on a MacBook Pro with an
> Intel Core Duo. ("Rotarran X.5" is the name of the computer.) I
> first attempted the 1.3.0 source code via svn, then went back to the
> 1.2.3 source code from Open MPI, but both gave the above bus error.
> Then I went to Apple's copy of Open MPI 1.2.3 at opensource.apple.com
> guessing Apple changed things, but that still doesn't work. I've
> tried their take on ./configure options too to no avail. Other than
> debugging orterun, what else can I try?
>
> Thanks in advance,
> Dean
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel