0

I am trying to get values from a csv and put them into the database, I am managing to do this without a great deal of trouble.

But I know need to write back to the csv so on the next time I run the script it will only enter the values into the DB from bellow the mark in the csv file.

Note the CSV file on the system automatically flushes every 24hrs so bear in mind there might not be a mark in the csv. So basically put all values in the database if no mark is found.

I am planning to run this script every 30mins so hence there could be 48 marks in the csv file or even remove the mark and move it down the file each time?

I have been deleting the file and then re making a file in the script so new file every script run but this breaks the system somehow so that is not a great option.

Hope You Guys can help..

Thank You

Python Code:

import csv
import MySQLdb

mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='******',
db='kestrel_keep')

cursor = mydb.cursor()

csv_data = csv.reader(file('data_csv.log'))

for row in csv_data:

    cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)',
   row)
#close the connection to the database.
mydb.commit()
cursor.close()
import os


print "Done"

My CSV file Format:

2013-02-21,21:42:00,-1.0,45.8,27.6,17.3,14.1,22.3,21.1,1,1,2,2
2013-02-21,21:48:00,-1.0,45.8,27.5,17.3,13.9,22.3,20.9,1,1,2,2

3 Answers 3

2

It looks like the first field in your MySQL table is a unique timestamp. It is possible to set up the MySQL table so that the field must be unique, and to ignore INSERTs that would violate that uniqueness property. At a mysql> prompt enter the command:

ALTER IGNORE TABLE heating ADD UNIQUE heatingidx (thedate, thetime)    

(Change thedate and thetime to the names of the columns holding the date and time.)


Once you make this change to your database, you only need to change one line in your program to make MySQL ignore duplicate insertions:

cursor.execute('INSERT IGNORE INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)', row)

Yes, it is a little wasteful to be running INSERT IGNORE ... on lines that have already been processed, but given the frequency of your data (every 6 minutes?), it is not going to matter much in terms of performance.

The advantage to doing it this way is that it is now impossible to accidentally insert duplicates into your table. It also keeps the logic of your program simple and easy to read.

It also avoids having two programs write to the same CSV file at the same time. Even if your program usually succeeds without error, every so often -- maybe once in a blue moon -- your program and the other program may try to write to the file at the same time, which could result in an error or mangled data.


You can also make your program a little faster by using cursor.executemany instead of cursor.execute:

rows = list(csv_data)
cursor.executemany('''INSERT IGNORE INTO `heating` VALUES
    ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)''', rows)

is equivalent to

for row in csv_data:    
    cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,)',
   row)

except that it packs all the data into one command.

Sign up to request clarification or add additional context in comments.

3 Comments

@ZeroG: That's no problem. Just list all the fields needed to define a unique row. I've edited the post above to show what I mean.
will this take into account that date and time will need to be different i.e. in 2 days there are 2 14:00's even though dates will be different?
Yes. Two rows would be considered the same only if they shared the same day and the same time.
1

I think that a better option than "marking" the CSV file is to keep a file were you store the number of the last line you processed.

So if the file does not exist (the one were you store the number of the last processed line), you process the whole CSV file. If this file exists you only process records after this line.

Final Code On Working System:

#!/usr/bin/python
import csv
import MySQLdb
import os

mydb = MySQLdb.connect(host='localhost',
user='root',
passwd='*******',
db='kestrel_keep')

cursor = mydb.cursor()

csv_data = csv.reader(file('data_csv.log'))

start_row = 0

def getSize(fileobject):
fileobject.seek(0,2) # move the cursor to the end of the file
size = fileobject.tell()
return size

file = open('data_csv.log', 'rb')
curr_file_size = getSize(file)

# Get the last file Size
if os.path.exists("file_size"):
with open("file_size") as f:
    saved_file_size = int(f.read())


# Get the last processed line
if os.path.exists("lastline"):
with open("lastline") as f:
    start_row = int(f.read())


if curr_file_size < saved_file_size: start_row = 0

cur_row = 0
for row in csv_data:
 if cur_row >= start_row:
     cursor.execute('INSERT INTO `heating` VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s,    %s, %s, %s, %s, %s, %s, %s ,%s)', row)

     # Other processing if necessary

 cur_row += 1

 mydb.commit()
 cursor.close()


# Store the last processed line
with open("lastline", 'w') as f:
start_line = f.write(str(cur_row + 1)) # you want to start at the **next** line
                                      # next time
# Store Current  File Size To Find File Flush    
with open("file_size", 'w') as f:
start_line = f.write(str(curr_file_size))

# not necessary but good for debug
print (str(cur_row))



 print "Done"

Edit: Final Code Submited by ZeroG and now working on the system!! Thank You also Too Xion345 For Helping

7 Comments

I like this answer but I cant get the Row No we are putting into lastline file above 0 even '(str(cur_row))' reviles 0... also bear in mind when the file is flushed at 00:01:00 the line no will not be relative to the new csv file so I suppose we need to check the time somewhere
Yes, you are right the code is wrong, you need to move the cur_row += 1 statement at the end of the for loop. As regards the flush at 00:01, you need to check the current time and the write date of the lastline file.
@ZeroG : A better idea to detect if the file has been flushed, is to store the size of the CSV file in the lastline file (in addition to the last processed line). If the file size has decreased between two subsequent executions of your script, you know that the CSV file has been flushed.
ok so we basicaly have it the script runs and adds 1 to the row count in lastline every time I run the script and so hence first time 1 second time = 2 after row count = 1 nothing is put in the db sorry am I missing something here?
@ZeroG : Every time the script is run, the number of the last line + 1 is written to the lastline file. For example, if there are 15 lines in the CSV file the first time you run your script, the last processed line is line 14 so 15 will be written to the lastline file. If there are 100 lines the second time you run your script, lines 0-14 will be skipped, lines 15-99 will be processed and 100 will be written to the lastline file etc...
|
1

Each csv row seems to contain a timestamp. If these are always increasing, you could query the db for the maximum timestamp already recorded, and skip all rows before that time when reading the csv.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.