Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] New MOSIX components draft
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-04-01 09:59:34


I suspect the problem is here:

/**
+ * MOSIX BTL component.
+ */
+struct mca_btl_base_component_t {
+ mca_btl_base_component_2_0_0_t super; /**< base BTL component */
+ mca_btl_mosix_module_t mosix_module; /**< local module */
+};
+typedef struct mca_btl_base_component_t mca_btl_mosix_component_t;
+
+OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t mca_btl_mosix_component;
+

You redefined the mca_btl_base_component_t struct. What we usually do is define a new struct:

struct mca_btl_mosix_component_t {
        mca_btl_base_component_t super; /**< base BTL component */
        mca_btl_mosix_module_t mosix_module; /**< local module */
};
typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t;

You can then overload that component with your additional info, leaving the base component to contain the required minimal elements.

On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote:

> I traced the problem to the BML component:
> Index: ompi/mca/bml/r2/bml_r2.c
> ===================================================================
> --- ompi/mca/bml/r2/bml_r2.c (revision 26191)
> +++ ompi/mca/bml/r2/bml_r2.c (working copy)
> @@ -105,6 +105,8 @@
> }
> }
> if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
> + printf("\n\nR1: %p\n\n", btl->btl_component->btl_version.mca_component_name);
> + printf("\n\nR2: %s\n\n", btl->btl_component->btl_version.mca_component_name);
> opal_argv_append_nosize(&btl_names_argv,
> btl->btl_component->btl_version.mca_component_name);
> }
>
> I Get (white-spaces removed) for normal run:
> R1: 0x7f820e3c31d8
> R2: self
> R1: 0x7f820e13c598
> R2: tcp
> ... and for my module:
> R1: 0x38
> - and then the segmentation fault.
> I guess it has something to do with the way I initialize my component - I'll resume debugging after lunch.
>
> Alex
>
> On 03/31/2012 07:04 PM, Alex Margolin wrote:
>>
>> P.S. I get the following Error - I'm pretty sure my BTL is to blame here:
>>
>> alex_at_singularity:~/huji/benchmarks/simple$ mpirun -mca btl_base_verbose 100 -mca btl self,mosix hello
>> [singularity:10838] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
>> [singularity:10838] mca: base: components_open: Looking for btl components
>> [singularity:10838] mca: base: components_open: opening btl components
>> [singularity:10838] mca: base: components_open: found loaded component mosix
>> [singularity:10838] mca: base: components_open: component mosix register function successful
>> [singularity:10838] mca: base: components_open: component mosix open function successful
>> [singularity:10838] mca: base: components_open: found loaded component self
>> [singularity:10838] mca: base: components_open: component self has no register function
>> [singularity:10838] mca: base: components_open: component self open function successful
>> [singularity:10838] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
>> [singularity:10838] select: initializing btl component mosix
>> [singularity:10838] select: init of component mosix returned success
>> [singularity:10838] select: initializing btl component self
>> [singularity:10838] select: init of component self returned success
>> [singularity:10838] *** Process received signal ***
>> [singularity:10838] Signal: Segmentation fault (11)
>> [singularity:10838] Signal code: Address not mapped (1)
>> [singularity:10838] Failing at address: 0x30
>> [singularity:10838] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7fa94a3cd420]
>> [singularity:10838] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x84391) [0x7fa94a41b391]
>> [singularity:10838] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0x16) [0x7fa94a41b086]
>> [singularity:10838] [ 3] /usr/local/lib/libmpi.so.0(opal_argv_append_nosize+0xf7) [0x7fa94add66a4]
>> [singularity:10838] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1cf5) [0x7fa946177cf5]
>> [singularity:10838] [ 5] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1e50) [0x7fa946177e50]
>> [singularity:10838] [ 6] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x12f) [0x7fa946382b6d]
>> [singularity:10838] [ 7] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x909) [0x7fa94acd1549]
>> [singularity:10838] [ 8] /usr/local/lib/libmpi.so.0(MPI_Init+0x16c) [0x7fa94ad033ec]
>> [singularity:10838] [ 9] /home/alex/huji/benchmarks/simple/hello(_ZN3MPI4InitERiRPPc+0x23) [0x409e2d]
>> [singularity:10838] [10] /home/alex/huji/benchmarks/simple/hello(main+0x22) [0x408f66]
>> [singularity:10838] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fa94a3b830d]
>> [singularity:10838] [12] /home/alex/huji/benchmarks/simple/hello() [0x408e89]
>> [singularity:10838] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 10838 on node singularity exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> alex_at_singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,tcp hello
>> [singularity:10841] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
>> [singularity:10841] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
>> Hello world!
>> alex_at_singularity:~/huji/benchmarks/simple$
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel