1

I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.

I'm going to have to run this program as a background process at least once a week, on a new CSV file that is around 90% similar to the previous one. I have come up with a few solutions to improve my program. However I don't know much about databases, so I have questions about each of my solutions.

  • Is there a more efficient way to read a CSV file than what I have already?

  • Is instantiating an ObjectOutputStream, and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.

  • Would connection pooling, or changing the way I use the Connection in some other way be more efficient?

  • I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?

  • Are there any obvious mistakes I'm making? (Again, I know very little about databases)

    public class Database{
    
    public void createResultsTable(){
        Statement stmt;
        String sql = "CREATE TABLE results("
                + "ID       INTEGER     NOT NULL    PRIMARY KEY AUTOINCREMENT, "
                + "TITLE    TEXT        NOT NULL, "
                + "URL      TEXT        NOT NULL    UNIQUE, "
                ...
                ...
                + "SELLER   TEXT        NOT NULL, "
                + "BEAN     BLOB);";
        try {
            stmt = c.createStatement();
            stmt.executeUpdate(sql);
        } catch (SQLException e) { e.printStackTrace();}
    
    
    }
    
    
    public void addCSVToDatabase(Connection conn, String src){
    
        BufferedReader reader = null;
        DBEntryBean b;
        String[] vals;
    
        try{
            reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), "UTF-8"));
            for(String line; (line = reader.readLine()) != null;){
                //Each line takes the form: "title|URL|...|...|SELLER"
                vals = line.split("|");
    
                b = new DBEntryBean();
                b.setTitle(vals[0]);
                b.setURL(vals[1]);
                ...
                ...
                b.setSeller(vals[n]);
    
                insert(conn, b);
            }
        } catch(){
    
        }
    }
    
    
    public void insert(Connection conn, DBEntryBean b){
    
        PreparedStatement pstmt = null;
        String sql = "INSERT OR IGNORE INTO results("
                + "TITLE, "
                + "URL, "
                ...
                ...
                + "SELLER, "
                + "BEAN"
                + ");";
    
        try {
            pstmt = c.prepareStatement(sql);
            pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());      
            pstmt.setString(Constants.DB_COL_URL, b.getURL());      
            ...
            ...
            pstmt.setString(Constants.DB_COL_SELLER, b.getSeller());
    
            // ByteArrayOutputStream baos = new ByteArrayOutputStream();
            // oos = new ObjectOutputStream(baos);
            // oos.writeObject(b);
            // byte[] bytes = baos.toByteArray();
            // pstmt.setBytes(Constants.DB_COL_BEAN, bytes);
            pstmt.executeUpdate();
    
        } catch (SQLException e) { e.printStackTrace(); 
        } finally{
            if(pstmt != null){
                try{ pstmt.close(); }
                catch (SQLException e) { e.printStackTrace(); }
            }
    
        }
    }
    
    
    }
    
5
  • Ideally you don't want to be creating a new prepared statement with each line of the file. You want to reuse it. Commented Jan 7, 2017 at 4:52
  • It seems that your code currently works, and you are looking to improve it. Generally these questions are too opinionated for this site, but you might find better luck at CodeReview.SE. Remember to read their requirements as they are a bit more strict than this site. Commented Jan 7, 2017 at 4:56
  • @4castle Thanks. I moved the PreparedStatement out of the loop, and tested it on 1000 lines and gained about a 3 second improvement. So that's a start. Commented Jan 7, 2017 at 5:03
  • @4castle I'll post this in CodeReview.SE as well, I didn't know that existed. Commented Jan 7, 2017 at 5:03
  • I'm voting to close this question as off-topic because it has now been asked (i.e. cross-posted) on codereview.stackexchange.com Commented Jan 7, 2017 at 6:38

1 Answer 1

1

The biggest bottleck in your code is that you are not batching the insert operations. You should really call pstmt.addBatch(); instead of pstmt.executeUpdate(); and execute the batch once you have something like a batch of 10K rows to insert.

On the CSV parsing side should really consider using a csv library to do the parsing for you. Univocity-parsers has the fastest CSV parser around and it should process these 2.5 million lines in less than a second. I'm the author of this library by the way.

String.split() is convenient but not fast. For anything more than a few dozen rows it doesn't make sense to use this.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.