Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New MOSIX components draft
From: Alex Margolin (alex.margolin_at_[hidden])
Date: 2012-04-02 09:47:44


I found the problem(s) - It was more then just type redefinition, but I
fixed it too. I also added some code for btl/base to prevent/detect a
similar problem in the future. A newer version of my MOSIX patch (odls +
btl + fix) is attached. The BTL, still doesn't work, though, and when I
try to use valgrind it fails with some Open-MPI internal problems, which
are most likely unrelated to my patch. I'll keep working it, but maybe
someone who knows this part of the code should look at it...

alex_at_singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,mosix -n
2 valgrind simple
==22752== Memcheck, a memory error detector
==22752== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==22752== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for
copyright info
==22752== Command: simple
==22752==
==22753== Memcheck, a memory error detector
==22753== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==22753== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for
copyright info
==22753== Command: simple
==22753==
==22753== Invalid read of size 8
==22753== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
==22753== by 0x5AC5A6B: __GI_memmove (memmove.c:76)
==22753== by 0x5ACD000: argz_insert (argz-insert.c:55)
==22753== by 0x520A39A: lt_argz_insert (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520A537: lt_argz_insertinorder (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520A808: lt_argz_insertdir (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520A985: list_files_by_dir (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520AA0A: foreachfile_callback (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x52086AA: foreach_dirinpath (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520AADB: lt_dlforeachfile (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x52162DD: find_dyn_components
(mca_base_component_find.c:319)
==22753== by 0x5215EB6: mca_base_component_find
(mca_base_component_find.c:186)
==22753== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
==22753== at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
==22753== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520A73D: lt_argz_insertdir (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520A985: list_files_by_dir (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520AA0A: foreachfile_callback (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x52086AA: foreach_dirinpath (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x520AADB: lt_dlforeachfile (in
/usr/local/lib/libmpi.so.0.0.0)
==22753== by 0x52162DD: find_dyn_components
(mca_base_component_find.c:319)
==22753== by 0x5215EB6: mca_base_component_find
(mca_base_component_find.c:186)
==22753== by 0x5219AA3: mca_base_components_open
(mca_base_components_open.c:129)
==22753== by 0x5246183: opal_paffinity_base_open
(paffinity_base_open.c:129)
==22753== by 0x523C013: opal_init (opal_init.c:361)
==22753==
==22752== Invalid read of size 8
==22752== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
==22752== by 0x5AC5A6B: __GI_memmove (memmove.c:76)
==22752== by 0x5ACD000: argz_insert (argz-insert.c:55)
==22752== by 0x520A39A: lt_argz_insert (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520A537: lt_argz_insertinorder (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520A808: lt_argz_insertdir (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520A985: list_files_by_dir (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520AA0A: foreachfile_callback (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x52086AA: foreach_dirinpath (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520AADB: lt_dlforeachfile (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x52162DD: find_dyn_components
(mca_base_component_find.c:319)
==22752== by 0x5215EB6: mca_base_component_find
(mca_base_component_find.c:186)
==22752== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
==22752== at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
==22752== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520A73D: lt_argz_insertdir (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520A985: list_files_by_dir (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520AA0A: foreachfile_callback (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x52086AA: foreach_dirinpath (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x520AADB: lt_dlforeachfile (in
/usr/local/lib/libmpi.so.0.0.0)
==22752== by 0x52162DD: find_dyn_components
(mca_base_component_find.c:319)
==22752== by 0x5215EB6: mca_base_component_find
(mca_base_component_find.c:186)
==22752== by 0x5219AA3: mca_base_components_open
(mca_base_components_open.c:129)
==22752== by 0x5246183: opal_paffinity_base_open
(paffinity_base_open.c:129)
==22752== by 0x523C013: opal_init (opal_init.c:361)
==22752==
[singularity:22753] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
[singularity:22752] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
[singularity:22753] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
[singularity:22752] mca: base: component_find: unable to open
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open
shared object file: No such file or directory (ignored)
==22753== Warning: invalid file descriptor 207618048 in syscall open()
==22752== Warning: invalid file descriptor 207618048 in syscall open()
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

   Process 1 ([[59806,1],0]) is on host: singularity
   Process 2 ([[59806,1],1]) is on host: singularity
   BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

  * Check the output of ompi_info to see which BTL/MTL plugins are
    available.
  * Run your application with MPI_THREAD_SINGLE.
  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
    if using MTL-based communications) to see exactly which
    communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
==22752== Use of uninitialised value of size 8
==22752== at 0x5A8631B: _itoa_word (_itoa.c:195)
==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
==22752== by 0x5AADB83: vasprintf (vasprintf.c:64)
==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309)
==22752== by 0x51786F1: orte_show_help (show_help.c:648)
==22752== by 0x50B4693: backend_fatal_aggregate
(errhandler_predefined.c:205)
==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler
(errhandler_predefined.c:68)
==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
==22752== by 0x50FD446: PMPI_Init (pinit.c:95)
==22752== by 0x40A128: MPI::Init(int&, char**&) (in
/home/alex/huji/benchmarks/simple/simple)
==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
==22752==
==22752== Conditional jump or move depends on uninitialised value(s)
==22752== at 0x5A86325: _itoa_word (_itoa.c:195)
==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
==22752== by 0x5AADB83: vasprintf (vasprintf.c:64)
==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309)
==22752== by 0x51786F1: orte_show_help (show_help.c:648)
==22752== by 0x50B4693: backend_fatal_aggregate
(errhandler_predefined.c:205)
==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler
(errhandler_predefined.c:68)
==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
==22752== by 0x50FD446: PMPI_Init (pinit.c:95)
==22752== by 0x40A128: MPI::Init(int&, char**&) (in
/home/alex/huji/benchmarks/simple/simple)
==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
==22752==
[singularity:22752] *** An error occurred in MPI_Init
[singularity:22752] *** reported by process [3919446017,0]
[singularity:22752] *** on a NULL communicator
[singularity:22752] *** Unknown error
[singularity:22752] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[singularity:22752] *** and potentially your MPI job)
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

   Reason: Before MPI_INIT completed
   Local host: singularity
   PID: 22752
--------------------------------------------------------------------------
==22753== Use of uninitialised value of size 8
==22753== at 0x5A8631B: _itoa_word (_itoa.c:195)
==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
==22753== by 0x5AADB83: vasprintf (vasprintf.c:64)
==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309)
==22753== by 0x51786F1: orte_show_help (show_help.c:648)
==22753== by 0x50B4693: backend_fatal_aggregate
(errhandler_predefined.c:205)
==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler
(errhandler_predefined.c:68)
==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
==22753== by 0x50FD446: PMPI_Init (pinit.c:95)
==22753== by 0x40A128: MPI::Init(int&, char**&) (in
/home/alex/huji/benchmarks/simple/simple)
==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
==22753==
==22753== Conditional jump or move depends on uninitialised value(s)
==22753== at 0x5A86325: _itoa_word (_itoa.c:195)
==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
==22753== by 0x5AADB83: vasprintf (vasprintf.c:64)
==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309)
==22753== by 0x51786F1: orte_show_help (show_help.c:648)
==22753== by 0x50B4693: backend_fatal_aggregate
(errhandler_predefined.c:205)
==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler
(errhandler_predefined.c:68)
==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
==22753== by 0x50FD446: PMPI_Init (pinit.c:95)
==22753== by 0x40A128: MPI::Init(int&, char**&) (in
/home/alex/huji/benchmarks/simple/simple)
==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
==22753==
==22752==
==22752== HEAP SUMMARY:
==22752== in use at exit: 730,332 bytes in 2,844 blocks
==22752== total heap usage: 4,959 allocs, 2,115 frees, 11,353,797
bytes allocated
==22752==
==22753==
==22753== HEAP SUMMARY:
==22753== in use at exit: 730,332 bytes in 2,844 blocks
==22753== total heap usage: 4,970 allocs, 2,126 frees, 11,354,058
bytes allocated
==22753==
==22752== LEAK SUMMARY:
==22752== definitely lost: 2,138 bytes in 52 blocks
==22752== indirectly lost: 7,440 bytes in 12 blocks
==22752== possibly lost: 0 bytes in 0 blocks
==22752== still reachable: 720,754 bytes in 2,780 blocks
==22752== suppressed: 0 bytes in 0 blocks
==22752== Rerun with --leak-check=full to see details of leaked memory
==22752==
==22752== For counts of detected and suppressed errors, rerun with: -v
==22752== Use --track-origins=yes to see where uninitialised values come
from
==22752== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
==22753== LEAK SUMMARY:
==22753== definitely lost: 2,138 bytes in 52 blocks
==22753== indirectly lost: 7,440 bytes in 12 blocks
==22753== possibly lost: 0 bytes in 0 blocks
==22753== still reachable: 720,754 bytes in 2,780 blocks
==22753== suppressed: 0 bytes in 0 blocks
==22753== Rerun with --leak-check=full to see details of leaked memory
==22753==
==22753== For counts of detected and suppressed errors, rerun with: -v
==22753== Use --track-origins=yes to see where uninitialised values come
from
==22753== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
-------------------------------------------------------
While the primary job terminated normally, 2 processes returned
non-zero exit codes.. Further examination may be required.
-------------------------------------------------------
[singularity:22751] 1 more process has sent help message
help-mca-bml-r2.txt / unreachable proc
[singularity:22751] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[singularity:22751] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[singularity:22751] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[singularity:22751] 1 more process has sent help message
help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
alex_at_singularity:~/huji/benchmarks/simple$

On 04/01/2012 04:59 PM, Ralph Castain wrote:
> I suspect the problem is here:
>
> /**
> + * MOSIX BTL component.
> + */
> +struct mca_btl_base_component_t {
> + mca_btl_base_component_2_0_0_t super; /**< base BTL component */
> + mca_btl_mosix_module_t mosix_module; /**< local module */
> +};
> +typedef struct mca_btl_base_component_t mca_btl_mosix_component_t;
> +
> +OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t mca_btl_mosix_component;
> +
>
>
> You redefined the mca_btl_base_component_t struct. What we usually do is define a new struct:
>
> struct mca_btl_mosix_component_t {
> mca_btl_base_component_t super; /**< base BTL component */
> mca_btl_mosix_module_t mosix_module; /**< local module */
> };
> typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t;
>
> You can then overload that component with your additional info, leaving the base component to contain the required minimal elements.
>
>
> On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote:
>
>> I traced the problem to the BML component:
>> Index: ompi/mca/bml/r2/bml_r2.c
>> ===================================================================
>> --- ompi/mca/bml/r2/bml_r2.c (revision 26191)
>> +++ ompi/mca/bml/r2/bml_r2.c (working copy)
>> @@ -105,6 +105,8 @@
>> }
>> }
>> if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
>> + printf("\n\nR1: %p\n\n", btl->btl_component->btl_version.mca_component_name);
>> + printf("\n\nR2: %s\n\n", btl->btl_component->btl_version.mca_component_name);
>> opal_argv_append_nosize(&btl_names_argv,
>> btl->btl_component->btl_version.mca_component_name);
>> }
>>
>> I Get (white-spaces removed) for normal run:
>> R1: 0x7f820e3c31d8
>> R2: self
>> R1: 0x7f820e13c598
>> R2: tcp
>> ... and for my module:
>> R1: 0x38
>> - and then the segmentation fault.
>> I guess it has something to do with the way I initialize my component - I'll resume debugging after lunch.
>>
>> Alex
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel