Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi configuration error?
From: Ben Lash (bl10_at_[hidden])
Date: 2014-05-21 17:45:22


I know why it quite - M3EXIT was called - but thanks for looking.

On Wed, May 21, 2014 at 4:02 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Ben
>
> One of the ranks (52) called MPI_Abort.
> This may be a bug in the code, or a problem with the setup
> (e.g. a missing or incorrect input file).
> For instance, the CCTM Wiki says:
> "AERO6 expects emissions inputs for 13 new PM species. CCTM will crash if
> any emitted PM species is not included in the emissions input file"
> I am not familiar to CCTM, so these are just guesses.
>
> It doesn't look like an MPI problem, though.
>
> You may want to check any other logs that the CCTM code may
> produce, for any clue on where it fails.
> Otherwise, you could compile with -g -traceback (and remove any
> optimization options in FFLAGS, FCFLAGS, CFLAGS, etc.)
> It may also have a -DDEBUG or similar that can be turned on
> in the CPPFLAGS, which in many models makes a more verbose log.
> This *may* tell you where it fails (source file, subroutine and line),
> and may help understand why it fails.
> If it dumps a core file, you can trace the failure point with
> a debugger.
>
>
> I hope this helps,
> Gus
>
> On 05/21/2014 03:20 PM, Ben Lash wrote:
>
>> I used a different build of netcdf 4.1.3, and the code seems to run now.
>> I have a totally different, non-mpi related error in part of it, but
>> there's no way for the list to help, I mostly just wanted to report that
>> this particular problem seems to be solved for the record. It doesn't
>> seem to fail quite as gracefully anymore, but I'm still getting enough
>> of the error messages to know what's going on.
>>
>> MPI_ABORT was invoked on rank 52 in communicator MPI_COMM_WORLD
>> with errorcode 0.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> ------------------------------------------------------------
>> --------------
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],52] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],54] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],55] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
>>
>> [[63355,0],1]-[[63355,1],15] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
>>
>> [[63355,0],1]-[[63355,1],17] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],56] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],53] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],51] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
>>
>> [[63355,0],4]-[[63355,1],57] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> forrtl: error (78): process killed (SIGTERM)
>> Image PC Routine Line Source
>>
>> ....
>>
>> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
>>
>> [[63355,0],1]-[[63355,1],16] mca_oob_tcp_msg_recv: readv failed:
>> Connection reset by peer (104)
>> ------------------------------------------------------------
>> --------------
>> mpiexec has exited due to process rank 49 with PID 26187 on
>> node cn-099 exiting improperly. There are two reasons this could occur:
>>
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>>
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpiexec (as reported here).
>> ------------------------------------------------------------
>> --------------
>> forrtl: error (78): process killed (SIGTERM)
>> Image PC Routine Line Source
>> CCTM_V5g_Linux2_x 00000000007FEA29 Unknown Unknown
>> Unknown
>> CCTM_V5g_Linux2_x 00000000007FD3A0 Unknown Unknown
>> Unknown
>> CCTM_V5g_Linux2_x 00000000007BA9A2 Unknown Unknown
>> Unknown
>> CCTM_V5g_Linux2_x 0000000000759288 Unknown Unknown
>> Unknown
>>
>> ...
>>
>>
>>
>> On Wed, May 21, 2014 at 2:08 PM, Gus Correa <gus_at_[hidden]
>> <mailto:gus_at_[hidden]>> wrote:
>>
>> Hi Ben
>>
>> My guess is that your sys admins may have built NetCDF
>> with parallel support, pnetcdf, and the latter with OpenMPI,
>> which could explain the dependency.
>> Ideally, they should have built it again with the latest default
>> OpenMPI (1.6.5?)
>>
>> Check if there is a NetCDF module that either doesn't have any
>> dependence on MPI, or depends on the current Open MPI that
>> you are using (1.6.5 I think).
>> A 'module show netcdf/bla/bla'
>> on the available netcdf modules will tell.
>>
>> If the application code is old as you said, it probably doesn't use
>> any pnetcdf. In addition, it should work even with NetCDF 3.X.Y,
>> which probably doesn't have any pnetcdf built in.
>> Newer netcdf (4.Z.W > 4.1.3) should also work, and in this case
>> pick one that requires the default OpenMPI, if available.
>>
>> Just out of curiosity, besides netcdf/4.1.3, did you load
>> openmpi/1.6.5?
>> Somehow the openmpi/1.6.5 should have been marked
>> to conflict with 1.4.4.
>> Is it?
>> Anyway, you may want to do a 'which mpiexec' to see which one is
>> taking precedence in your environment (1.6.5 or 1.4.4)
>> Probably 1.6.5.
>>
>> Does the code work now, or does it continue to fail?
>>
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>>
>> On 05/21/2014 02:36 PM, Ben Lash wrote:
>>
>> Yep, there is is.
>>
>> [bl10_at_login2 USlogsminus10]$ module show netcdf/4.1.3
>> ------------------------------__----------------------------
>> --__-------
>> /opt/apps/modulefiles/netcdf/__4.1.3:
>>
>>
>> module load openmpi/1.4.4-intel
>> prepend-path PATH
>> /opt/apps/netcdf/4.1.3/bin:/__opt/apps/netcdf/4.1.3/deps/__
>> hdf5/1.8.7/bin
>> prepend-path LD_LIBRARY_PATH
>> /opt/apps/netcdf/4.1.3/lib:/__opt/apps/netcdf/4.1.3/deps/__
>> hdf5/1.8.7/lib:/opt/apps/__netcdf/4.1.3/deps/szip/2.1/lib
>>
>> prepend-path MANPATH /opt/apps/netcdf/4.1.3/share/__man
>> ------------------------------__----------------------------
>> --__-------
>>
>>
>>
>>
>> On Wed, May 21, 2014 at 1:34 PM, Douglas L Reeder
>> <dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>
>> <mailto:dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>>>
>> wrote:
>>
>> Ben,
>>
>> The netcdf/4.1.3 module maybe loading the openmpi/1.4.4
>> module. Can
>> you do module show the netcdf module file to to see if
>> there is a
>> module load openmpi command.
>>
>> Doug Reeder
>>
>> On May 21, 2014, at 12:23 PM, Ben Lash <bl10_at_[hidden]
>> <mailto:bl10_at_[hidden]>
>> <mailto:bl10_at_[hidden] <mailto:bl10_at_[hidden]>>> wrote:
>>
>> I just wanted to follow up for anyone else who got a
>> similar
>> problem - module load netcdf/4.1.3 *also* loaded
>> openmpi/1.4.4. <http://1.4.4.>
>> <http://1.4.4./> Don't ask me why. My code doesn't seem
>> to fail as
>>
>> gracefully but otherwise works now. Thanks.
>>
>>
>> On Sat, May 17, 2014 at 6:02 AM, Jeff Squyres (jsquyres)
>> <jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
>> <mailto:jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>>>
>> wrote:
>>
>> Ditto -- Lmod looks pretty cool. Thanks for the
>> heads up.
>>
>>
>> On May 16, 2014, at 6:23 PM, Douglas L Reeder
>> <dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>
>> <mailto:dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>>>
>>
>> wrote:
>>
>> > Maxime,
>> >
>> > I was unaware of Lmod. Thanks for bringing it to
>> my attention.
>> >
>> > Doug
>> > On May 16, 2014, at 4:07 PM, Maxime Boissonneault
>> <maxime.boissonneault@__calculquebec.ca
>> <mailto:maxime.boissonneault_at_[hidden]>
>> <mailto:maxime.boissonneault@__calculquebec.ca
>> <mailto:maxime.boissonneault_at_[hidden]>>> wrote:
>> >
>> >> Instead of using the outdated and not maintained
>> Module
>> environment, why not use Lmod :
>> https://www.tacc.utexas.edu/__tacc-projects/lmod
>>
>> <https://www.tacc.utexas.edu/tacc-projects/lmod>
>> >>
>> >> It is a drop-in replacement for Module
>> environment that
>> supports all of their features and much, much more,
>> such as :
>> >> - module hierarchies
>> >> - module properties and color highlighting (we
>> use it to
>> higlight bioinformatic modules or tools for example)
>> >> - module caching (very useful for a parallel
>> filesystem
>> with tons of modules)
>> >> - path priorities (useful to make sure personal
>> modules
>> take precendence over system modules)
>> >> - export module tree to json
>> >>
>> >> It works like a charm, understand both TCL and
>> Lua modules
>> and is actively developped and debugged. There are
>> litteraly
>> new features every month or so. If it does not do
>> what you
>> want, odds are that the developper will add it
>> shortly (I've
>> had it happen).
>> >>
>> >> Maxime
>> >>
>> >> Le 2014-05-16 17:58, Douglas L Reeder a écrit :
>> >>> Ben,
>> >>>
>> >>> You might want to use module (source forge) to
>> manage
>> paths to different mpi implementations. It is
>> fairly easy to
>> set up and very robust for this type of problem.
>> You would
>> remove contentious application paths from you
>> standard PATH
>> and then use module to switch them in and out as
>> needed.
>> >>>
>> >>> Doug Reeder
>> >>> On May 16, 2014, at 3:39 PM, Ben Lash
>> <bl10_at_[hidden] <mailto:bl10_at_[hidden]>
>> <mailto:bl10_at_[hidden] <mailto:bl10_at_[hidden]>>>
>> wrote:
>> >>>
>> >>>> My cluster has just upgraded to a new version
>> of MPI, and
>> I'm using an old one. It seems that I'm having
>> trouble
>> compiling due to the compiler wrapper file moving
>> (full error
>> here: http://pastebin.com/EmwRvCd9)
>> >>>> "Cannot open configuration file
>>
>> /opt/apps/openmpi/1.4.4-intel/__share/openmpi/mpif90-
>> wrapper-__data.txt"
>>
>> >>>>
>> >>>> I've found the file on the cluster at
>>
>> /opt/apps/openmpi/retired/1.4.__4-intel/share/openmpi/
>> mpif90-__wrapper-data.txt
>>
>> >>>> How do I tell the old mpi wrapper where this
>> file is?
>> >>>> I've already corrected one link to mpich ->
>> /opt/apps/openmpi/retired/1.4.__4-intel/, which is
>>
>> in the
>> software I'm trying to recompile's lib folder
>> (/home/bl10/CMAQv5.0.1/lib/__x86_64/ifort). Thanks
>>
>> for any
>> ideas. I also tried changing $pkgdatadir based on
>> what I read
>> here:
>> >>>>
>> http://www.open-mpi.org/faq/?__category=mpi-apps#default-__
>> wrapper-compiler-flags
>>
>> <http://www.open-mpi.org/faq/?category=mpi-apps#default-
>> wrapper-compiler-flags>
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> --Ben L
>> >>>> ______________________________
>> ___________________
>>
>> >>>> users mailing list
>> >>>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> >>>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> >>>
>> >>>
>> >>>
>> >>> ______________________________
>> ___________________
>> >>> users mailing list
>> >>>
>> >>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> >>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> >>
>> >>
>> >> --
>> >> ------------------------------__---
>> >> Maxime Boissonneault
>> >> Analyste de calcul - Calcul Québec, Université
>> Laval
>> >> Ph. D. en physique
>> >>
>> >> _________________________________________________
>>
>> >> users mailing list
>> >> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> >>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> >
>> > _________________________________________________
>> > users mailing list
>> > users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> >
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
>> <mailto:jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>>
>>
>>
>> For corporate legal information go to:
>> http://www.cisco.com/web/__about/doing_business/legal/__cri/
>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>
>> _________________________________________________
>>
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>>
>>
>> --
>> --Ben L
>> _________________________________________________
>>
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>>
>> _________________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>>
>>
>> --
>> --Ben L
>>
>>
>> _________________________________________________
>>
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>> _________________________________________________
>>
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>>
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>>
>>
>> --
>> --Ben L
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
--Ben L