Dear OpenMPI developers,
i'm testing checkpoint and restart with OpenMPI 1.4 nightly. Test machine is IBM Blade System over Infiniband with 4 processors every communication node.
At the moment, I have some problems. My application is a simply communication ring between processors, with parametric loop.
First case: 8 procs over 2 nodes.
Start command:
$ mpirun -machinefile machinefile -am ft-enable-cr ./ring -t 5000000
The output is:
[node0316:20037] mca: base: components_open: Looking for filem components
[node0316:20037] mca: base: components_open: including only filem
components that are checkpoint enabled
[node0316:20037] mca: base: components_open: (filem) Component rsh is
Checkpointable
[node0316:20037] mca: base: components_open: opening filem components
[node0316:20037] mca: base: components_open: found loaded component rsh
[node0316:20037] mca: base: components_open: component rsh has no
register function
[node0316:20037] filem:rsh: open()
[node0316:20037] filem:rsh: open: priority = 20
[node0316:20037] filem:rsh: open: verbosity = 0
[node0316:20037] filem:rsh: open: cp command = scp
[node0316:20037] filem:rsh: open: rsh command = ssh
[node0316:20037] mca: base: components_open: component rsh open function
successful
[node0316:20037] mca:base:select: Auto-selecting filem components
[node0316:20037] mca:base:select:(filem) Querying component [rsh]
[node0316:20037] mca:base:select:(filem) Query of component [rsh] set
priority to 20
[node0316:20037] mca:base:select:(filem) Selected component [rsh]
[node0316:20037] mca: base: components_open: Looking for snapc components
[node0316:20037] mca: base: components_open: including only snapc
components that are checkpoint enabled
[node0316:20037] mca: base: components_open: (snapc) Component full is
Checkpointable
[node0316:20037] mca: base: components_open: opening snapc components
[node0316:20037] mca: base: components_open: found loaded component full
[node0316:20037] mca: base: components_open: component full has no
register function
[node0316:20037] snapc:full: open()
[node0316:20037] snapc:full: open: priority = 20
[node0316:20037] snapc:full: open: verbosity = 100
[node0316:20037] snapc:full: open: skip_filem = False
[node0316:20037] mca: base: components_open: component full open
function successful
[node0316:20037] mca:base:select: Auto-selecting snapc components
[node0316:20037] mca:base:select:(snapc) Querying component [full]
[node0316:20037] snapc:full: component_query()
[node0316:20037] mca:base:select:(snapc) Query of component [full] set
priority to 20
[node0316:20037] mca:base:select:(snapc) Selected component [full]
[node0316:20037] snapc:full: module_init(1, 1)
[node0316:20037] snapc:full: module_init: Global Snapshot Coordinator
** HANG**
The application doesn't start, and appears locked.
Strace command before mpirun shows the informations below:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}], 3, 1000) = 0
poll(...
doing nothing..
Second case: 1 node, 4 processor ( without intercommunication over Infiniband)
In this case, mpirun works well, but the checkpoint procedure fails:
ompi-checkpoint 20109
[node0316:20134] Error: Unable to get the current working directory
[node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
orte-checkpoint.c at line 395
[node0316:20134] HNP with PID 20109 Not found!
I don't understand why OpenMPI doesn't find that log file.
Any idea?
Thanks in advance.
--
Gabriele Fatigati
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati@cineca.it