MPI Debugging
Debugging an MPI application can be pretty challanging:
- Processes can die from signals (Null Pointer Exceptions, Division by Zero, ...) without any notice and it is not easy to use a debugger in a distributed system like MPI.
- Debugging by printing to stdout, does not work usually, because nodes other than 0 do not necessarily write their output to the console.
The following code should help with both problems. It uses
gdb to generate a stack trace.
/*
* Copyright (C) 2005 Dominic Battre <dominic battre.de>
*
* based in parts on work from http://webcvs.kde.org/kdelibs/kdecore/kcrash.cpp
* Copyright (C) 2000 Timo Hummel <timo.hummel sap.com>
* Tom Braun <braunt fh-konstanz.de>
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Library General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Library General Public License for more details.
*
* You should have received a copy of the GNU Library General Public License
* along with this library; see the file COPYING.LIB. If not, write to
* the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 02111-1307, USA.
*/
#include <sstream>
#include <sys/wait.h>
#include <errno.h>
/**
* \brief opens a temporary file in write mode, copies its filename
* to the filename parameter and returns the file handle
* of the opened file
* \param[out] filename path to opened file
* \return file handle or 0 in case of error
*/
FILE * crashHandlerCreateTmpFile(char filename[100]) {
// follows http://www.rocketaware.com/man/man3/mktemp.3.htm
FILE *file;
int fd = -1;
strcpy(filename, "/tmp/backtraceXXXXXX");
if ((fd = mkstemp(filename)) == -1 ||
(file = fdopen(fd, "w+")) == NULL) {
if (fd != -1) {
unlink(filename);
close(fd);
}
fprintf(stderr, "cannot create %s: %s\n", filename, strerror(errno));
return 0;
}
return file;
}
/**
* creates a backtrace with gdb and prints the output to the logger
*/
void defaultCrashHandler(int signal) {
// idea taken from http://webcvs.kde.org/kdelibs/kdecore/kcrash.cpp
// the crash Recursion counter ensures that exceptions in the handler do
// not lead to recursive calls
static int crashRecursionCounter = 0;
crashRecursionCounter++; // Nothing before this, please !
::signal(SIGALRM, SIG_DFL);
alarm(3); // Kill me... (in case we deadlock in malloc)
if (crashRecursionCounter < 3) {
// write a file that contains nothing but the line "bt\n"
// this is used to pass the "backtrace" command to gdb
char btcommandfilename[100] = "";
FILE * btcommand = crashHandlerCreateTmpFile(btcommandfilename);
fprintf( btcommand, "bt\n" );
fclose( btcommand );
pid_t pid = fork();
if (pid <= 0) {
// parent process id as string, needed for gdb command line
char ppid[10];
snprintf(ppid, 9, "%d", getppid());
// executable
char *cmd = "gdb";
// parameters for gdb
char *argv[24];
int argc = 0;
argv[argc++] = "-nw";
argv[argc++] = "-n"; // do not parse .gdbinit
argv[argc++] = "-batch"; // terminate after completion
argv[argc++] = "-x"; // execute commands from btcommandfilename
argv[argc++] = btcommandfilename;
argv[argc++] = program_invocation_name; // executable
argv[argc++] = ppid; // process id
argv[argc++] = 0;
// path where to create backtrace
char cwd[1024];
getcwd(cwd, sizeof(cwd)-1);
std::ostringstream path;
path << cwd << "/crashlog";
// open file, where we want to pipe stdout and stderr of gdb to
FILE *bt = fopen(path.str().c_str(), "w+");
int id = bt->_fileno;
if ( id<0 ) { fprintf(stderr, "cannot open %s\n", path.str().c_str()); exit(-1); }
dup2(id,stdout->_fileno);
dup2(id,stderr->_fileno);
execvp(cmd, argv);
// regular program flow should not reach this point
fprintf(bt, "error, unable to execute gdb debugger\n");
fclose(bt);
} else
{
fprintf(stderr, "applications crashed, please see file 'crashlog'\n");
alarm(0); // Seems we made it....
// wait for child to exit
waitpid(pid, NULL, 0);
// delete file that contained the backtrace command for gdb
unlink(btcommandfilename);
_exit(253);
}
}
// recursive crash
exit(255);
}
/**
* \brief actiaves a custom crash handler that uses gdb to generate a
* backtrace and write it to the logger.
*
* \param[in] executable name of the executable, necessary for gdb to
* find the symbols for the backtrace
*/
void activateCrashHandler() {
sigset_t mask;
sigemptyset(&mask);
#ifdef SIGSEGV
signal (SIGSEGV, defaultCrashHandler);
sigaddset(&mask, SIGSEGV);
#endif
#ifdef SIGFPE
signal (SIGFPE, defaultCrashHandler);
sigaddset(&mask, SIGFPE);
#endif
#ifdef SIGILL
signal (SIGILL, defaultCrashHandler);
sigaddset(&mask, SIGILL);
#endif
#ifdef SIGABRT
signal (SIGABRT, defaultCrashHandler);
sigaddset(&mask, SIGABRT);
#endif
}
This is how you can use the code:
void crash() {
volatile int a = 99;
volatile int b = 10;
volatile int c = a / ( b - 10 );
printf("%d", c);
}
#include <mpi.h>
int main(int argc, char **argv) {
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
activateCrashHandler();
if ( rank == 1 ) crash();
return 0;
}
The code generats a file crashlog due to the division by zeroe on node 1.
This is an example how a division by zero in looks like:
Using host libthread_db library "/lib/libthread_db.so.1".
`system-supplied DSO at 0xffffe000' has disappeared; keeping its symbols.
0xb7b0e58e in waitpid () from /lib/libc.so.6
#0 0xb7b0e58e in waitpid () from /lib/libc.so.6
#1 0xb7baaff4 in ?? () from /lib/libc.so.6
#2 0x0804b753 in defaultCrashHandler (signal=8) at main.cpp:123
#3 <signal handler called>
#4 0x0804b858 in crash () at main.cpp:166
#5 0x0804b8c3 in main (argc=1, argv=0xbffffc64) at main.cpp:175
The interesting line is right after the
<signal handler called>.
#4 0x0804b858 in crash () at main.cpp:166
Without the crash handler the application would terminate without any output.
--
DominicBattre - 03 May 2005