Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] trunk problem for large-SMP startup?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-03-04 16:35:36


I just ran a 64ppn job without problem. Couple of possibilities come
to mind:

1. you might have some stale lib around - try blowing things away and
rebuilding

2. there may be a problem in your specific situation. Can you provide
some info on what you are doing (e.g., what environment)?

Ralph

On Mar 4, 2009, at 2:22 PM, Ralph Castain wrote:

> I'll take a look - offhand, I don't know of anything limiting you to
> <= 64 ppn
>
>
> On Mar 4, 2009, at 1:49 PM, Eugene Loh wrote:
>
>> I have a problem starting large SMP jobs (e.g., 64 processes on a
>> single SMP) that might be related to a recent trunk change.
>> (Guessing.) Does the following ring any bells?
>>
>> ...
>> ...
>> ...
>> [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
>> file ess_env_module.c at line 299
>> [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
>> file base/grpcomm_base_modex.c at line 416
>> [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
>> file grpcomm_bad_module.c at line 378
>> [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
>> file ess_env_module.c at line 299
>> [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
>> file base/grpcomm_base_modex.c at line 416
>> [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
>> file grpcomm_bad_module.c at line 378
>> [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
>> file ess_env_module.c at line 299
>> [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
>> file base/grpcomm_base_modex.c at line 416
>> [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
>> file grpcomm_bad_module.c at line 378
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel
>> process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems. This failure appears to be an internal failure; here's
>> some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> orte_grpcomm_modex failed
>> --> Returned "Not found" (-13) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [burl-t5440-0:6756] Abort before MPI_INIT completed successfully;
>> not able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [burl-t5440-0:6757] Abort before MPI_INIT completed successfully;
>> not able to guarantee that all other processes were killed!
>> ...
>> ...
>> ...
>> <trunk-problem.tar.gz>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>