Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-12-19 11:15:33


Siegmar --

So it looks like the net problem is fixed; good. I'll commit and CMR that.

For the DDT test, can you give us access to this machine? It might help speed debugging a lot. (I'll let Nathan reply about the var problem)

If not, can you provide the following information about the DDT test:

1. It SIGBUS's at a point; can you send the full backtrace?
2. It complains about a misaligned read of a variable and shows its address. Can you print the values of all the parameters of the function so that we can see *which* one it is using for the misaligned read? (the printf is using 4 different variables, and we don't know which one is causing the misaligned read)

On Dec 19, 2013, at 8:52 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> at first thank you very much for your help.
>
> 1st patch:
>
>> Can you apply the following patch to a trunk tarball and see if it works
>> for you?
>
> 2nd patch:
>
>> Found the problem. Was accessing a boolean variable using intval. That
>> is a bug that has gone unnoticed on all platforms but thankfully Solaris
>> caught it.
>>
>> Please try the attached patch.
>
>
> I applied both patches manually to openmpi-1.9a1r29972, because
> my patch program couldn't use the patches. Unfortunately I still
> get a Bus Error. Hopefully I didn't make a mistake applying your
> patches. Therefore I show you a "diff" for my files. By the way,
> I tried to apply your patches with "patch -b -i <your file>".
> Is it necessary to use a different command?
>
>
> tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
> -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
> -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
> tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
> 1685,1689c1685
> < if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
> < ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
> value->boolval, &tmp);
> < } else {
> < ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
> value->intval, &tmp);
> < }
> ---
>> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator,
> value->intval, &tmp);
> tyr openmpi-1.9a1r29972 163
>
>
>
> tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
> -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
> -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
> tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
> 267,271c267,268
> < struct sockaddr_in inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
> < run into bus errors on Solaris/SPARC */
> < memcpy(&inaddr1, addr1, sizeof(inaddr1));
> < memcpy(&inaddr2, addr2, sizeof(inaddr2));
> ---
>> const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
>> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
> 274,275c271,272
> < if((inaddr1.sin_addr.s_addr & netmask) ==
> < (inaddr2.sin_addr.s_addr & netmask)) {
> ---
>> if((inaddr1->sin_addr.s_addr & netmask) ==
>> (inaddr2->sin_addr.s_addr & netmask)) {
> 284,290c281,284
> < struct sockaddr_in6 inaddr1, inaddr2;
> < /* Use temporary variables and memcpy's so that we don't
> < run into bus errors on Solaris/SPARC */
> < memcpy(&inaddr1, addr1, sizeof(inaddr1));
> < memcpy(&inaddr2, addr2, sizeof(inaddr2));
> < struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr;
> < struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr;
> ---
>> const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
>> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
>> struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
>> struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
> tyr openmpi-1.9a1r29972 167
>
>
>
> Now my debug information.
>
> tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
> tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
> Reading ompi_info
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run -a
> Running: ompi_info -a
> (process id 10998)
> Reading libc_psr.so.1
> ...
> MCA compress: parameter "compress_base_verbose" (current value:
> "-1", data source: default, level: 8 dev/detail,
> type: int)
> Verbosity level for the compress framework (0 = no
> verbosity)
> t_at_1 (l_at_1) signal BUS (invalid address alignment) in var_value_string
> at line 1680 in file "mca_base_var.c"
> 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type],
> value[0]);
> (dbx)
> (dbx)
> (dbx) check -all
> dbx: warning: check -all will be turned on in the next run of the process
> access checking - OFF
> memuse checking - OFF
> (dbx) run -a
> Running: ompi_info -a
> (process id 11000)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading rtcboot.so
> Reading librtc.so
> Reading libmd_psr.so.1
> RTC: Enabling Error Checking...
> RTC: Using UltraSparc trap mechanism
> RTC: See `help rtc showmap' and `help rtc limitations' for details.
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 4 bytes at address 0xffffffff7fffd5f8
> which is 184 bytes above the current stack pointer
> Variable is 'index'
> t_at_1 (l_at_1) stopped in var_find at line 802 in file "mca_base_var.c"
> 802 return (OPAL_SUCCESS != ret) ? ret : index;
> (dbx)
>
>
> In my opinion it is the same error as before.
>
>
>
> I still get a Bus Error with "make check".
>
> tyr bin 54 cd
> /export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li
> bs/
> tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
> Reading ddt_raw
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run
> Running: ddt_raw
> (process id 11018)
> Reading libc_psr.so.1
>
>
> #
> * TEST INVERSED VECTOR
> #
>
> t_at_1 (l_at_1) signal BUS (invalid address alignment) in opal_convertor_raw
> at line 71 in file "opal_convertor_raw.c"
> 71 DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p,
> %u}, %lu )\n", (void*)pConvertor,
> (dbx)
>
>
> Once more I think it is the same error. I have the same problem with
> my small program.
>
>
>
>
> tyr small_prog 62 mpicc init_finalize.c
> tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \
> /usr/local/openmpi-1.9_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9'
> in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx)
> (dbx) run -np 1 a.out
> Running: mpiexec -np 1 a.out
> (process id 11050)
> Reading libc_psr.so.1
> Reading mca_shmem_mmap.so
> Reading libmp.so.2
> Reading libscf.so.1
> Reading libuutil.so.1
> Reading libgen.so.1
> Reading mca_shmem_posix.so
> Reading mca_shmem_sysv.so
> Reading mca_ess_env.so
> Reading mca_ess_hnp.so
> Reading mca_ess_singleton.so
> Reading mca_ess_tool.so
> Reading mca_pstat_test.so
> Reading mca_state_app.so
> Reading mca_state_hnp.so
> Reading mca_state_novm.so
> Reading mca_state_orted.so
> Reading mca_state_staged_hnp.so
> Reading mca_state_staged_orted.so
> Reading mca_state_tool.so
> Reading mca_errmgr_default_app.so
> Reading mca_errmgr_default_hnp.so
> Reading mca_errmgr_default_orted.so
> Reading mca_errmgr_default_tool.so
> Reading mca_plm_rsh.so
> Reading mca_oob_tcp.so
> Reading mca_rml_oob.so
> Reading mca_routed_binomial.so
> Reading mca_routed_debruijn.so
> Reading mca_routed_direct.so
> Reading mca_routed_radix.so
> Reading mca_db_hash.so
> Reading mca_db_print.so
> Reading mca_grpcomm_bad.so
> Reading mca_ras_simulator.so
> Reading mca_rmaps_lama.so
> Reading mca_rmaps_mindist.so
> Reading mca_rmaps_ppr.so
> Reading mca_rmaps_rank_file.so
> Reading mca_rmaps_resilient.so
> Reading mca_rmaps_round_robin.so
> Reading mca_rmaps_seq.so
> Reading mca_rmaps_staged.so
> Reading mca_odls_default.so
> Reading mca_iof_hnp.so
> Reading mca_iof_mr_hnp.so
> Reading mca_iof_mr_orted.so
> Reading mca_iof_orted.so
> Reading mca_iof_tool.so
> Reading mca_filem_raw.so
> Reading mca_dfs_app.so
> Reading mca_dfs_orted.so
> Reading mca_dfs_test.so
>
> Now the program hangs.
>
> ^Cdbx: warning: Interrupt ignored but forwarded to child.
> t_at_1 (l_at_1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740
> 0xffffffff7d5dc740: __pollsys+0x0004: ta %icc,0x0000000000000040
> Current function is orterun
> 1049 opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
> (dbx)
> (dbx)
> (dbx)
> (dbx) check -all
> dbx: warning: check -all will be turned on in the next run of the process
> access checking - OFF
> memuse checking - OFF
> (dbx) run -np 1 a.out
> Running: mpiexec -np 1 a.out
> (process id 11054)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading rtcboot.so
> Reading librtc.so
> Reading libmd_psr.so.1
> RTC: Enabling Error Checking...
> RTC: Using UltraSparc trap mechanism
> RTC: See `help rtc showmap' and `help rtc limitations' for details.
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 4 bytes at address 0xffffffff7fffd438
> which is 184 bytes above the current stack pointer
> Variable is 'index'
> t_at_1 (l_at_1) stopped in var_find at line 802 in file "mca_base_var.c"
> 802 return (OPAL_SUCCESS != ret) ? ret : index;
> (dbx)
>
>
>
> I'm sorry that you have so much trouble with me and Solaris. On the
> other hand I still hope that you can solve the problem(s). Once more
> thank you very much for your help in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/