0

I have a text file that needs to be sorted by the first column and merge all repeats with the count to the left of the data, and then write the sorted/counted data into an already created csv file.

Ex text file:

, 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00

Ex result:

, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00

My code:

for ip in open("list.txt"):
    with open(ip.strip()+".txt", "a") as ip_file:
        for line in open("data.txt"):
            new_line = line.split(" ")
            if "blocked" in new_line:
                if "src="+ip.strip() in new_line:
                    ip_file.write(", " + new_line[11])
                    ip_file.write(", " + new_line[12])
                    ip_file.write(", " + new_line[13])

for ip_file in os.listdir(sub_dir):
        with open(os.path.join(sub_dir, ip_file), "a") as f:
            data = f.readlines()
            data.sort(key = lambda l: float(l.split()[0]), reverse = True)

Whenever I test the code, I get the error TypeError: 'str' object is not callable or something similar. I can't use .split() .read() .strip() etc without getting the error.

Question: How can I sort the files' contents and count repeating lines (without defining a function)?

I'm basically trying to:

sort -k1 | uniq -c | sed 's/^/,/' >> test.csv
10
  • 1
    Where does the error occur? I see no point where the code could try to call a str. Commented Aug 23, 2013 at 13:55
  • 1
    For counting repetitions you could use collections.Counter or itertools.groupby(). Commented Aug 23, 2013 at 13:57
  • @Alfe: I don't either but it occurs at: data = file(f).readlines() Commented Aug 23, 2013 at 14:18
  • Did you assign file to a string elsewhere in the code? That would explain the error. You shouldn't be using file anyway, just f.readlines(). Commented Aug 23, 2013 at 14:27
  • @PauloAlmeida: file was a mistake, but I edited my code. filename = ip_file and f should be what was written to ip_file in the previous block of code. Commented Aug 23, 2013 at 14:41

3 Answers 3

1
D = {}
for k in open('data.txt'): #use dictionary to count and filter duplicate lines
    if k in D:
        D[k] += 1 #increase k by one if already seen.
    else:
        D[k]  = 1 #initialize key with one if seen for first time.

for sk in sorted(D): #sort keys 
    print(',', D[sk], sk.rstrip(), file=open('test.csv', 'a')) #print a comma, followed by number of lines plus line.   

#Output
, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00    
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! This worked perfectly. How would I convert print to csv_file.write as write only allows one argument?
I edited the print function argument to redirect the output to a file in append mode. Please let me know if that will do it for you.
That worked so perfectly. I have been working on this problem for way too long. Thanks so very much!
1

How about this:

input = ''', 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00'''.split('\n')

input.sort(key=lambda line: line.split(',')[1])

for key, values in itertools.groupby(input, lambda line: line.split(',')[1]):
  values = list(values)
  print ', %d%s' % (len(values), values[0])

This lacks all error checking (like unfit lines etc.), but maybe you can add that yourself according to your needs. Also, the split is performed twice; once for the sorting and once for the grouping. That probably can be improved.

Comments

0

I would consider using the Pandas Data Processing Module

import pandas as pd
my_data = pd.read_csv("C:\Where My Data Lives\Data.txt", header=None)
sorted_data = my_data.sort_index(by=[1], ascending=1)  # sort my data
sorted_data = sorted_data.drop_duplicates([1])         # leaves only unique values, sorted in order
counted_data = list(my_data.groupby(1).size())         #counts the unique values in data, coverts to a list
sorted_data[0] = counted_data                          # inserts the list into your data frame

3 Comments

The only problem with using pandas is that installation of the package is required, and everything must be done from the py script, with no work required from the user
@hjames, I am not sure I follow your statement. Are you looking for a base Python solution (no extra modules)? This script will not require the user to do anything besides type and run the code :)
Hm, I just assumed that I would need to install the pandas package in order to import the module?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.