r2 - 04 May 2005 - 01:51:44 - DominicBattreYou are here: TWiki >  IBG Web  > MpiDebugging

MPI Debugging

Debugging an MPI application can be pretty challanging:

  • Processes can die from signals (Null Pointer Exceptions, Division by Zero, ...) without any notice and it is not easy to use a debugger in a distributed system like MPI.
  • Debugging by printing to stdout, does not work usually, because nodes other than 0 do not necessarily write their output to the console.

The following code should help with both problems. It uses gdb to generate a stack trace.

/*
 * Copyright (C) 2005 Dominic Battre <dominic battre.de>
 *
 * based in parts on work from http://webcvs.kde.org/kdelibs/kdecore/kcrash.cpp
 * Copyright (C) 2000 Timo Hummel <timo.hummel sap.com>
 *                    Tom Braun <braunt fh-konstanz.de>
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Library General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Library General Public License for more details.
 *
 * You should have received a copy of the GNU Library General Public License
 * along with this library; see the file COPYING.LIB.  If not, write to
 * the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 * Boston, MA 02111-1307, USA.
 */

#include <sstream>

#include <sys/wait.h>
#include <errno.h>

/**
 * \brief opens a temporary file in write mode, copies its filename
 * to the filename parameter and returns the file handle
 * of the opened file
 * \param[out] filename  path to opened file
 * \return file handle or 0 in case of error
 */
FILE * crashHandlerCreateTmpFile(char filename[100]) {
   // follows http://www.rocketaware.com/man/man3/mktemp.3.htm
   FILE *file;
   int fd = -1;

   strcpy(filename, "/tmp/backtraceXXXXXX");
   if ((fd = mkstemp(filename)) == -1 ||
         (file = fdopen(fd, "w+")) == NULL) {
      if (fd != -1) {
         unlink(filename);
         close(fd);
      }
      fprintf(stderr, "cannot create %s: %s\n", filename, strerror(errno));
      return 0;
   }
   return file;
}

/**
 * creates a backtrace with gdb and prints the output to the logger
 */
void defaultCrashHandler(int signal) {
   // idea taken from http://webcvs.kde.org/kdelibs/kdecore/kcrash.cpp
   
   // the crash Recursion counter ensures that exceptions in the handler do 
   // not lead to recursive calls
   static int crashRecursionCounter = 0;
   crashRecursionCounter++; // Nothing before this, please !
   
   ::signal(SIGALRM, SIG_DFL);
   alarm(3); // Kill me... (in case we deadlock in malloc)
   
   if (crashRecursionCounter < 3) {
      // write a file that contains nothing but the line "bt\n"
      // this is used to pass the "backtrace" command to gdb
      char btcommandfilename[100] = "";
      FILE * btcommand = crashHandlerCreateTmpFile(btcommandfilename);
      fprintf( btcommand, "bt\n" );
      fclose( btcommand );
      
      pid_t pid = fork();

      if (pid <= 0) {
         // parent process id as string, needed for gdb command line
         char ppid[10];
         snprintf(ppid, 9, "%d", getppid());
         
         // executable
         char *cmd = "gdb";
         
         // parameters for gdb
         char *argv[24];
         int argc = 0;
         argv[argc++] = "-nw";
         argv[argc++] = "-n";     // do not parse .gdbinit
         argv[argc++] = "-batch"; // terminate after completion
         argv[argc++] = "-x";     // execute commands from btcommandfilename
         argv[argc++] = btcommandfilename;
         argv[argc++] = program_invocation_name; // executable
         argv[argc++] = ppid;    // process id
         argv[argc++] = 0;
         
         // path where to create backtrace
         char cwd[1024];
         getcwd(cwd, sizeof(cwd)-1);
         std::ostringstream path;
         path << cwd << "/crashlog";
         
         // open file, where we want to pipe stdout and stderr of gdb to
         FILE *bt = fopen(path.str().c_str(), "w+");
         int id = bt->_fileno;
         if ( id<0 ) { fprintf(stderr, "cannot open %s\n", path.str().c_str()); exit(-1); }
         dup2(id,stdout->_fileno);
         dup2(id,stderr->_fileno);

         execvp(cmd, argv);
         
         // regular program flow should not reach this point
         fprintf(bt, "error, unable to execute gdb debugger\n");
         fclose(bt);
      } else
      {
         fprintf(stderr, "applications crashed, please see file 'crashlog'\n");
         alarm(0); // Seems we made it....
         // wait for child to exit
         waitpid(pid, NULL, 0);
         
         // delete file that contained the backtrace command for gdb
         unlink(btcommandfilename);
         
         _exit(253);
      }
   }
   // recursive crash
   exit(255);
}

/**
 * \brief actiaves a custom crash handler that uses gdb to generate a
 * backtrace and write it to the logger.
 * 
 * \param[in] executable  name of the executable, necessary for gdb to
 *                        find the symbols for the backtrace
 */
void activateCrashHandler() {
   sigset_t mask;
   sigemptyset(&mask);
   #ifdef SIGSEGV
   signal (SIGSEGV, defaultCrashHandler);
   sigaddset(&mask, SIGSEGV);
   #endif
   #ifdef SIGFPE
   signal (SIGFPE, defaultCrashHandler);
   sigaddset(&mask, SIGFPE);
   #endif
   #ifdef SIGILL
   signal (SIGILL, defaultCrashHandler);
   sigaddset(&mask, SIGILL);
   #endif
   #ifdef SIGABRT
   signal (SIGABRT, defaultCrashHandler);
   sigaddset(&mask, SIGABRT);
   #endif
}

This is how you can use the code:

void crash() {
   volatile int a = 99;
   volatile int b = 10;
   volatile int c = a / ( b - 10 );
   printf("%d", c);
}

#include <mpi.h>

int main(int argc, char **argv) {
   int rank;
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   activateCrashHandler();
   if ( rank == 1 ) crash();
   return 0;
}

The code generats a file crashlog due to the division by zeroe on node 1.

This is an example how a division by zero in looks like:

Using host libthread_db library "/lib/libthread_db.so.1".
`system-supplied DSO at 0xffffe000' has disappeared; keeping its symbols.
0xb7b0e58e in waitpid () from /lib/libc.so.6
#0  0xb7b0e58e in waitpid () from /lib/libc.so.6
#1  0xb7baaff4 in ?? () from /lib/libc.so.6
#2  0x0804b753 in defaultCrashHandler (signal=8) at main.cpp:123
#3  <signal handler called>
#4  0x0804b858 in crash () at main.cpp:166
#5  0x0804b8c3 in main (argc=1, argv=0xbffffc64) at main.cpp:175

The interesting line is right after the <signal handler called>.

#4  0x0804b858 in crash () at main.cpp:166

Without the crash handler the application would terminate without any output.

-- DominicBattre - 03 May 2005

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r2 < r1 | More topic actions
 
Illinois Bio-Grid
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback