Selectively parsing log files using Java

Question

I have to parse a big bunch of log files, which are in the following format.

SOME SQL STATEMENT/QUERY

DB20000I  The SQL command completed successfully.

SOME OTHER SQL STATEMENT/QUERY

DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.

EDIT 1: The first 3 lines (including a blank line) indicate an SQL statement executed successfully, while the next three show the statement and the exception it caused. darioo's reply below, suggesting the use of grep instead of Java, works beautifully for a single line SQL statement.

EDIT 2: However, the SQL statement/query might not be a single line, necessarily. Sometimes it is a big CREATE PROCEDURE...END PROCEDURE block. Can this problem be overcome using only Unix commands too?

Now I need to parse through the entire log file and pick all occurrences of the pair of (SQL statement + error) and write them in a separate file.

Please show me how to do this!

You're writing "The first 2 lines", but I count three (one of them empty). Since whitespace is significant in regular expressions, this matters, so could you specify which interpretation is correct? Also, will the SQL statement and the longer message always occupy one line each, or could there be variations? Are there empty lines between pairs of log entries? — Tim Pietzcker
– Tim Pietzcker, Commented Jan 10, 2011 at 10:09
@Joel - I haven't tried anything so far. I just finished a small round of discussion and just posted my problem! — GPX
– GPX, Commented Jan 10, 2011 at 10:22
@Tim - You are correct! The whitespace does matter. 3 lines it is! — GPX
– GPX, Commented Jan 10, 2011 at 10:22
OK, so how can you tell where an SQL procedure starts? Is it the collection of non-blank lines before the line that starts with DB? (And does that line always start with DB?) — Tim Pietzcker
– Tim Pietzcker, Commented Jan 10, 2011 at 12:26

darioo · Accepted Answer · 2011-01-11 10:17:30Z

4

My answer will be non Java based since this is a classic example of a problem that can be solved in a much, much easier manner.

All you need is the tool grep. If you're on Windows, you can find it here.

Assuming your logs are in file log.txt, solution to your problem is a one liner:

grep -hE --before-context 1 "^DB2[0-9]+E" log.txt > filtered.txt

Explanation:

-h - don't print file name
-E - regular expression search
--before-context 1 - this will print one line before found error message (this will work if all your SQL queries are in one line)
^DB2[0-9]+E - search for lines that begin with "DB2", have some numbers and end with "E"

Above expression will print every line that you need in a new file called filtered.txt.

Update: after some fumbling around, I managed to get what's needed using only standard *nix utilities. Beware, it's not pretty. The final expression:

grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk "/E$/{y=$0;print x, y};{x=$0}" | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}" | sed -e "s/$/ >> filtered.txt/g" > run.bat

Explanation:

grep -nE "^DB2[0-9]+" log.txt - prints lines that begin with DB2... and their line number at beginning. Example:

6:DB20000I  The SQL command completed successfully.
12:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
19:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
26:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
34:DB20000I  The SQL command completed successfully.
41:DB20000I  The SQL command completed successfully.
47:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
54:DB20000I  The SQL command completed successfully.

cut -f 1 -d " " - prints only the "first column", that is, removes everything after error message. Example:

6:DB20000I
12:DB21034E
19:DB21034E
26:DB21034E
34:DB20000I
41:DB20000I
47:DB21034E
54:DB20000I

gawk "/E$/{y=$0;print x, y};{x=$0}" - for every line that ends with "E" (an error line), print the line before it and then the error line. Example:

6:DB20000I 12:DB21034E
12:DB21034E 19:DB21034E
19:DB21034E 26:DB21034E
41:DB20000I 47:DB21034E

sed -e "s/:DB2[[:digit:]]\+[IE]//g" - removes colon and the error message, leaving only line numbers. Example:

gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}" - formats above lines for sed processing and increments first line number by one. Example:

sed -n "7,12p" log.txt 
sed -n "13,19p" log.txt 
sed -n "20,26p" log.txt 
sed -n "42,47p" log.txt

sed -e "s/$/ >> filtered.txt/g" - appends >> filtered.txt to lines, for appending to final output file. Example:

sed -n "7,12p" log.txt  >> filtered.txt
sed -n "13,19p" log.txt  >> filtered.txt
sed -n "20,26p" log.txt  >> filtered.txt
sed -n "42,47p" log.txt  >> filtered.txt

> run.bat - finally, prints the last lines to a batch file named run.bat

After you execute this file, content you wanted will appear in filtered.txt.

Update 2:

Here is another version that works on Ubuntu (previous version was written on Windows):

grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk '/E/{y=$0;print x, y};{x=$0}' | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk '{print "sed -n \""$1+1" ,"$2 "p\" log.txt" }' | sed -e "s/$/ >> filtered.txt/g" > run.sh

Two things were not working with previous version:

for some reason, gawk '/E$/' wasn't working (it didn't recognize that E is on end of line), so I just put /E/ since E won't be found anywhere else.
quoting, " were converted to ' for gawk since it doesn't like double quotes; afterwards, quoting inside the last gawk expression was modified

edited Jan 11, 2011 at 10:17

answered Jan 10, 2011 at 9:59

darioo

47.4k10 gold badges79 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Andreas Dolk Over a year ago

@darioo - as far as I understood the question - he does not want to filter the lines with status/error (and those messages may be multi-line), he needs pairs of SQL Message and corresponding database status/error message.

AlexR Over a year ago

Cool regex. I did not know option -h. But I think that he wants to extract the sql statement itself, so I recommended him to use switch -a (after)

Mihai Toader Over a year ago

actually he should add a -b 1 flag. The failed query is before the error message.

darioo Over a year ago

@Toader: --before-context 1 does exactly that

GPX Over a year ago

I've added one more scenario in the original post - a case where a big block of CREATE TABLE or CREATE PROCEDURE statements precedes the line with the error. How do I detect and print the entire block responsible for the error?

|

Tim Pietzcker · Accepted Answer · 2011-01-10 12:37:30Z

1

Assuming that you are looking for a block of non-blank lines, followed by a blank line, followed by a block of non-blank lines the first of which starts with DB, then try:

Pattern regex = Pattern.compile(
    "(?:.+\\n)+    # Match one or more non-blank lines\n" +
    "\\n           # Match one blank line\n" +
    "DB(?:.+\\n)+  # Match one or more non-blank lines, the first one starting with DB", 
    Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group()
    // match start: regexMatcher.start()
    // match end: regexMatcher.end()
}

This assumes a blank line between each match, and assumes Unix line endings. If it's a DOS/Windows file, then replace \\n with \\r\\n.

answered Jan 10, 2011 at 12:37

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

1 Comment

Tim Pietzcker Over a year ago

What doesn't work? No matches? Wrong matches? Could you perhaps copy/paste an actual data excerpt in your question?

Riaan Cornelius · Accepted Answer · 2011-01-10 13:55:51Z

Personally, I would go about it slightly differently. Instead of finding all the errors, I would remove all the successes.

Something like this:

Read the log file (Use a read method, not readLine as the latter will drop newline chars) into a String
Use the following regex with replaceAll(regex, "") on the String to remove all successful entries: (?:.+\r\n)+\r\n+DB2.+I(?:.+\r\n)+
Write the resulting String out to a new file.

And in code (Just call processLog with the File object for the log):

private void openAndProcessLog(){
    JFileChooser chooser = new JFileChooser();
    chooser.showOpenDialog(this);
    if (chooser.getSelectedFile() != null) {
        processLog(chooser.getSelectedFile());
    }
}

private void processLog(File logfile){
    String originalLog = readFile(logfile);
    String onlyFailures = removeAllSuccessFull(originalLog);
    System.out.println(onlyFailures);
}

private String readFile(File file) {
    String ret = "";
    try {
        BufferedReader in = new BufferedReader(
                new FileReader(file));
        StringWriter out = new StringWriter();
        char[] buf = new char[10000];
        int n;
        while( (n = in.read(buf)) >= 0 ) {
            out.write(buf, 0, n);
        }
        ret = out.toString();
    } catch (IOException e) {
    }
    return ret;
}

private String removeAllSuccessFull(String text) {
    String sep = System.getProperty("line.separator");
    Pattern regex = Pattern.compile(
            "(?:.+"+sep+")+"+sep+"+DB2.+I(?:.+"+sep+")+");
    return regex.matcher(text).replaceAll("");
}

Dennis Williamson · Accepted Answer · 2011-01-10 15:44:42Z

1

Give this a try:

#!/usr/bin/awk -f
$1 ~ /^DB.*I$/ {lines=""; nl=""; next} # discard successes
$1 ~ /^DB.*E$/ {print lines; print $0; print "-----"; lines=""; next} # print error blocks
$0 !~ /^$/ { lines = lines nl $0; nl="\n" } # accumulate lines in block

If you don't want to strip blank lines, remove the $0 !~ /^$/.

Run it like this:

./script.awk inputfile

answered Jan 10, 2011 at 15:44

Dennis Williamson

364k95 gold badges386 silver badges446 bronze badges

Comments

AlexR · Accepted Answer · 2011-01-10 10:05:32Z

-1

If you are using linux shell or cygwin on windows I'd recommend you to use grep with flags -a (after) and -b (before):

grep -a 2 "The SQL command completed successfully" mylog.log

Will print 2 lines after the line that matches the given pattern.

if you wish to write your own I'd recommend you to do the following:

Iterate over the lines until you meet line that meets your pattern. Then continue reading N lines (e.g. 2 lines) and print them somewhere. Then continue reading.

answered Jan 10, 2011 at 10:05

AlexR

116k16 gold badges137 silver badges216 bronze badges

1 Comment

Mihai Toader Over a year ago

It might have n success queries before having an error query.

Collectives™ on Stack Overflow

Selectively parsing log files using Java

5 Answers 5

13 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

13 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related