Sort text file by first column and count repeats python

Question

I have a text file that needs to be sorted by the first column and merge all repeats with the count to the left of the data, and then write the sorted/counted data into an already created csv file.

Ex text file:

, 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00

Ex result:

, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00

My code:

for ip in open("list.txt"):
    with open(ip.strip()+".txt", "a") as ip_file:
        for line in open("data.txt"):
            new_line = line.split(" ")
            if "blocked" in new_line:
                if "src="+ip.strip() in new_line:
                    ip_file.write(", " + new_line[11])
                    ip_file.write(", " + new_line[12])
                    ip_file.write(", " + new_line[13])

for ip_file in os.listdir(sub_dir):
        with open(os.path.join(sub_dir, ip_file), "a") as f:
            data = f.readlines()
            data.sort(key = lambda l: float(l.split()[0]), reverse = True)

Whenever I test the code, I get the error TypeError: 'str' object is not callable or something similar. I can't use .split() .read() .strip() etc without getting the error.

Question: How can I sort the files' contents and count repeating lines (without defining a function)?

I'm basically trying to:

sort -k1 | uniq -c | sed 's/^/,/' >> test.csv

Where does the error occur? I see no point where the code could try to call a str. — Alfe
– Alfe, Commented Aug 23, 2013 at 13:55
For counting repetitions you could use collections.Counter or itertools.groupby(). — Alfe
– Alfe, Commented Aug 23, 2013 at 13:57
@Alfe: I don't either but it occurs at: data = file(f).readlines() — hjames
– hjames, Commented Aug 23, 2013 at 14:18
Did you assign file to a string elsewhere in the code? That would explain the error. You shouldn't be using file anyway, just f.readlines(). — Paulo Almeida
– Paulo Almeida, Commented Aug 23, 2013 at 14:27
@PauloAlmeida: file was a mistake, but I edited my code. filename = ip_file and f should be what was written to ip_file in the previous block of code. — hjames
– hjames, Commented Aug 23, 2013 at 14:41

edi_allen · Accepted Answer · 2013-08-23 17:37:13Z

1

D = {}
for k in open('data.txt'): #use dictionary to count and filter duplicate lines
    if k in D:
        D[k] += 1 #increase k by one if already seen.
    else:
        D[k]  = 1 #initialize key with one if seen for first time.

for sk in sorted(D): #sort keys 
    print(',', D[sk], sk.rstrip(), file=open('test.csv', 'a')) #print a comma, followed by number of lines plus line.   

#Output
, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00

edited Aug 23, 2013 at 17:37

answered Aug 23, 2013 at 16:45

edi_allen

1,9021 gold badge11 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hjames Over a year ago

Thanks! This worked perfectly. How would I convert print to csv_file.write as write only allows one argument?

edi_allen Over a year ago

I edited the print function argument to redirect the output to a file in append mode. Please let me know if that will do it for you.

hjames Over a year ago

That worked so perfectly. I have been working on this problem for way too long. Thanks so very much!

Alfe · Accepted Answer · 2013-08-23 14:08:31Z

1

How about this:

input = ''', 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00'''.split('\n')

input.sort(key=lambda line: line.split(',')[1])

for key, values in itertools.groupby(input, lambda line: line.split(',')[1]):
  values = list(values)
  print ', %d%s' % (len(values), values[0])

This lacks all error checking (like unfit lines etc.), but maybe you can add that yourself according to your needs. Also, the split is performed twice; once for the sorting and once for the grouping. That probably can be improved.

answered Aug 23, 2013 at 14:08

Alfe

60.2k21 gold badges117 silver badges172 bronze badges

Comments

thron of three · Accepted Answer · 2013-08-23 15:02:46Z

0

I would consider using the Pandas Data Processing Module

import pandas as pd
my_data = pd.read_csv("C:\Where My Data Lives\Data.txt", header=None)
sorted_data = my_data.sort_index(by=[1], ascending=1)  # sort my data
sorted_data = sorted_data.drop_duplicates([1])         # leaves only unique values, sorted in order
counted_data = list(my_data.groupby(1).size())         #counts the unique values in data, coverts to a list
sorted_data[0] = counted_data                          # inserts the list into your data frame

answered Aug 23, 2013 at 15:02

thron of three

5212 gold badges6 silver badges20 bronze badges

3 Comments

hjames Over a year ago

The only problem with using pandas is that installation of the package is required, and everything must be done from the py script, with no work required from the user

thron of three Over a year ago

@hjames, I am not sure I follow your statement. Are you looking for a base Python solution (no extra modules)? This script will not require the user to do anything besides type and run the code :)

hjames Over a year ago

Hm, I just assumed that I would need to install the pandas package in order to import the module?

Collectives™ on Stack Overflow

Sort text file by first column and count repeats python

3 Answers 3

3 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related