Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] New MOSIX components draft
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2012-04-02 11:34:58


Looks like you failed to build the shared memory component. The system isn't seeing a comm path between procs on the same node.

Sent from my iPad

On Apr 2, 2012, at 7:47 AM, Alex Margolin <alex.margolin_at_[hidden]> wrote:

> I found the problem(s) - It was more then just type redefinition, but I fixed it too. I also added some code for btl/base to prevent/detect a similar problem in the future. A newer version of my MOSIX patch (odls + btl + fix) is attached. The BTL, still doesn't work, though, and when I try to use valgrind it fails with some Open-MPI internal problems, which are most likely unrelated to my patch. I'll keep working it, but maybe someone who knows this part of the code should look at it...
>
> alex_at_singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,mosix -n 2 valgrind simple
> ==22752== Memcheck, a memory error detector
> ==22752== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==22752== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
> ==22752== Command: simple
> ==22752==
> ==22753== Memcheck, a memory error detector
> ==22753== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==22753== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright info
> ==22753== Command: simple
> ==22753==
> ==22753== Invalid read of size 8
> ==22753== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
> ==22753== by 0x5AC5A6B: __GI_memmove (memmove.c:76)
> ==22753== by 0x5ACD000: argz_insert (argz-insert.c:55)
> ==22753== by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520A537: lt_argz_insertinorder (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520A808: lt_argz_insertdir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520A985: list_files_by_dir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520AA0A: foreachfile_callback (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x52086AA: foreach_dirinpath (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520AADB: lt_dlforeachfile (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22753== by 0x5215EB6: mca_base_component_find (mca_base_component_find.c:186)
> ==22753== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
> ==22753== at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
> ==22753== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520A73D: lt_argz_insertdir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520A985: list_files_by_dir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520AA0A: foreachfile_callback (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x52086AA: foreach_dirinpath (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x520AADB: lt_dlforeachfile (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22753== by 0x5215EB6: mca_base_component_find (mca_base_component_find.c:186)
> ==22753== by 0x5219AA3: mca_base_components_open (mca_base_components_open.c:129)
> ==22753== by 0x5246183: opal_paffinity_base_open (paffinity_base_open.c:129)
> ==22753== by 0x523C013: opal_init (opal_init.c:361)
> ==22753==
> ==22752== Invalid read of size 8
> ==22752== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
> ==22752== by 0x5AC5A6B: __GI_memmove (memmove.c:76)
> ==22752== by 0x5ACD000: argz_insert (argz-insert.c:55)
> ==22752== by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520A537: lt_argz_insertinorder (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520A808: lt_argz_insertdir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520A985: list_files_by_dir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520AA0A: foreachfile_callback (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x52086AA: foreach_dirinpath (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520AADB: lt_dlforeachfile (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22752== by 0x5215EB6: mca_base_component_find (mca_base_component_find.c:186)
> ==22752== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
> ==22752== at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
> ==22752== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520A73D: lt_argz_insertdir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520A985: list_files_by_dir (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520AA0A: foreachfile_callback (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x52086AA: foreach_dirinpath (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x520AADB: lt_dlforeachfile (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22752== by 0x5215EB6: mca_base_component_find (mca_base_component_find.c:186)
> ==22752== by 0x5219AA3: mca_base_components_open (mca_base_components_open.c:129)
> ==22752== by 0x5246183: opal_paffinity_base_open (paffinity_base_open.c:129)
> ==22752== by 0x523C013: opal_init (opal_init.c:361)
> ==22752==
> [singularity:22753] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
> [singularity:22752] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
> [singularity:22753] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
> [singularity:22752] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared object file: No such file or directory (ignored)
> ==22753== Warning: invalid file descriptor 207618048 in syscall open()
> ==22752== Warning: invalid file descriptor 207618048 in syscall open()
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[59806,1],0]) is on host: singularity
> Process 2 ([[59806,1],1]) is on host: singularity
> BTLs attempted: self
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another. This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used. Your MPI job will now abort.
>
> You may wish to try to narrow down the problem;
>
> * Check the output of ompi_info to see which BTL/MTL plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> ==22752== Use of uninitialised value of size 8
> ==22752== at 0x5A8631B: _itoa_word (_itoa.c:195)
> ==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22752== by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22752== by 0x51786F1: orte_show_help (show_help.c:648)
> ==22752== by 0x50B4693: backend_fatal_aggregate (errhandler_predefined.c:205)
> ==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:68)
> ==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22752== by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22752== by 0x40A128: MPI::Init(int&, char**&) (in /home/alex/huji/benchmarks/simple/simple)
> ==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22752==
> ==22752== Conditional jump or move depends on uninitialised value(s)
> ==22752== at 0x5A86325: _itoa_word (_itoa.c:195)
> ==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22752== by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22752== by 0x51786F1: orte_show_help (show_help.c:648)
> ==22752== by 0x50B4693: backend_fatal_aggregate (errhandler_predefined.c:205)
> ==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:68)
> ==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22752== by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22752== by 0x40A128: MPI::Init(int&, char**&) (in /home/alex/huji/benchmarks/simple/simple)
> ==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22752==
> [singularity:22752] *** An error occurred in MPI_Init
> [singularity:22752] *** reported by process [3919446017,0]
> [singularity:22752] *** on a NULL communicator
> [singularity:22752] *** Unknown error
> [singularity:22752] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [singularity:22752] *** and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly. You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: singularity
> PID: 22752
> --------------------------------------------------------------------------
> ==22753== Use of uninitialised value of size 8
> ==22753== at 0x5A8631B: _itoa_word (_itoa.c:195)
> ==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22753== by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22753== by 0x51786F1: orte_show_help (show_help.c:648)
> ==22753== by 0x50B4693: backend_fatal_aggregate (errhandler_predefined.c:205)
> ==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:68)
> ==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22753== by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22753== by 0x40A128: MPI::Init(int&, char**&) (in /home/alex/huji/benchmarks/simple/simple)
> ==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22753==
> ==22753== Conditional jump or move depends on uninitialised value(s)
> ==22753== at 0x5A86325: _itoa_word (_itoa.c:195)
> ==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22753== by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22753== by 0x51786F1: orte_show_help (show_help.c:648)
> ==22753== by 0x50B4693: backend_fatal_aggregate (errhandler_predefined.c:205)
> ==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:68)
> ==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22753== by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22753== by 0x40A128: MPI::Init(int&, char**&) (in /home/alex/huji/benchmarks/simple/simple)
> ==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22753==
> ==22752==
> ==22752== HEAP SUMMARY:
> ==22752== in use at exit: 730,332 bytes in 2,844 blocks
> ==22752== total heap usage: 4,959 allocs, 2,115 frees, 11,353,797 bytes allocated
> ==22752==
> ==22753==
> ==22753== HEAP SUMMARY:
> ==22753== in use at exit: 730,332 bytes in 2,844 blocks
> ==22753== total heap usage: 4,970 allocs, 2,126 frees, 11,354,058 bytes allocated
> ==22753==
> ==22752== LEAK SUMMARY:
> ==22752== definitely lost: 2,138 bytes in 52 blocks
> ==22752== indirectly lost: 7,440 bytes in 12 blocks
> ==22752== possibly lost: 0 bytes in 0 blocks
> ==22752== still reachable: 720,754 bytes in 2,780 blocks
> ==22752== suppressed: 0 bytes in 0 blocks
> ==22752== Rerun with --leak-check=full to see details of leaked memory
> ==22752==
> ==22752== For counts of detected and suppressed errors, rerun with: -v
> ==22752== Use --track-origins=yes to see where uninitialised values come from
> ==22752== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
> ==22753== LEAK SUMMARY:
> ==22753== definitely lost: 2,138 bytes in 52 blocks
> ==22753== indirectly lost: 7,440 bytes in 12 blocks
> ==22753== possibly lost: 0 bytes in 0 blocks
> ==22753== still reachable: 720,754 bytes in 2,780 blocks
> ==22753== suppressed: 0 bytes in 0 blocks
> ==22753== Rerun with --leak-check=full to see details of leaked memory
> ==22753==
> ==22753== For counts of detected and suppressed errors, rerun with: -v
> ==22753== Use --track-origins=yes to see where uninitialised values come from
> ==22753== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
> -------------------------------------------------------
> While the primary job terminated normally, 2 processes returned
> non-zero exit codes.. Further examination may be required.
> -------------------------------------------------------
> [singularity:22751] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
> [singularity:22751] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> [singularity:22751] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
> [singularity:22751] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> [singularity:22751] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
> alex_at_singularity:~/huji/benchmarks/simple$
>
>
> On 04/01/2012 04:59 PM, Ralph Castain wrote:
>> I suspect the problem is here:
>>
>> /**
>> + * MOSIX BTL component.
>> + */
>> +struct mca_btl_base_component_t {
>> + mca_btl_base_component_2_0_0_t super; /**< base BTL component */
>> + mca_btl_mosix_module_t mosix_module; /**< local module */
>> +};
>> +typedef struct mca_btl_base_component_t mca_btl_mosix_component_t;
>> +
>> +OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t mca_btl_mosix_component;
>> +
>>
>>
>> You redefined the mca_btl_base_component_t struct. What we usually do is define a new struct:
>>
>> struct mca_btl_mosix_component_t {
>> mca_btl_base_component_t super; /**< base BTL component */
>> mca_btl_mosix_module_t mosix_module; /**< local module */
>> };
>> typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t;
>>
>> You can then overload that component with your additional info, leaving the base component to contain the required minimal elements.
>>
>>
>> On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote:
>>
>>> I traced the problem to the BML component:
>>> Index: ompi/mca/bml/r2/bml_r2.c
>>> ===================================================================
>>> --- ompi/mca/bml/r2/bml_r2.c (revision 26191)
>>> +++ ompi/mca/bml/r2/bml_r2.c (working copy)
>>> @@ -105,6 +105,8 @@
>>> }
>>> }
>>> if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
>>> + printf("\n\nR1: %p\n\n", btl->btl_component->btl_version.mca_component_name);
>>> + printf("\n\nR2: %s\n\n", btl->btl_component->btl_version.mca_component_name);
>>> opal_argv_append_nosize(&btl_names_argv,
>>> btl->btl_component->btl_version.mca_component_name);
>>> }
>>>
>>> I Get (white-spaces removed) for normal run:
>>> R1: 0x7f820e3c31d8
>>> R2: self
>>> R1: 0x7f820e13c598
>>> R2: tcp
>>> ... and for my module:
>>> R1: 0x38
>>> - and then the segmentation fault.
>>> I guess it has something to do with the way I initialize my component - I'll resume debugging after lunch.
>>>
>>> Alex
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> <mosix_components.diff>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel