8

I have an mpi4py program that hangs intermittently. How can I trace what the individual processes are doing?

I can run the program in different terminals, for example using pdb

mpiexec -n 4 xterm -e "python -m pdb my_program.py"

But this gets cumbersome if the issue only manifests with a large number of processes (~80 in my case). In addition, it's easy to catch exceptions with pdb but I'd need to see the trace to figure out where the hang occurs.

1 Answer 1

3

The Python trace module allows you to trace program execution. In order to store the trace of each process separately, you need to wrap your code in a function:

def my_program(*args, **kwargs):
    # insert your code here
    pass

And then run it with trace.Trace.runfunc:

import sys
import trace

# define Trace object: trace line numbers at runtime, exclude some modules
tracer = trace.Trace(
    ignoredirs=[sys.prefix, sys.exec_prefix],
    ignoremods=[
        'inspect', 'contextlib', '_bootstrap',
        '_weakrefset', 'abc', 'posixpath', 'genericpath', 'textwrap'
    ],
    trace=1,
    count=0)

# by default trace goes to stdout
# redirect to a different file for each processes
sys.stdout = open('trace_{:04d}.txt'.format(COMM_WORLD.rank), 'w')

tracer.runfunc(my_program)

Now the trace of each process will be written in a separate file trace_0001.txt etc. Use ignoredirs and ignoremods arguments to omit low level calls.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.