Split one column in CSV file into multiple columns while grouping the data in Python (without Pandas)

Question

I am currently learning Python and would request some help with one of my question. I have a ";" separated file (given below) which I am trying to brush and extract some data in excel and csv format.

My Raw CSV file..

    COUNTRY  COUNTRY_TIME            COUNTRY_REF        PRODUCT
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221     APPLE%BOX%LYON%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221     APPLE%BOX%LYON%032018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221     APPLE%BOX%LYON%052018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221     APPLE%BOX%LYON%062018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221     APPLE%BOX%NICE%032018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221    APPLE%BOX%LILLE%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221    APPLE%BOX%NEM%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221    APPLE%COVER%CWF%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221   APPLE%COVER%FZF%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221   APPLE%COVER%MX1%022018
    FRANCE  FRANCE20180222.16.30.00  FRANCE20180221 APPLE%BIGBOX%DIJON%022018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%012019
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%022019
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%032018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%042018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%052018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%BODEN%062018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%FLEN%012019
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%FLEN%032018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%FLEN%042018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%FLEN%052018
    SWEDEN  SWEDEN20180223.02.11.00  SWEDEN20180222 APPLE%SMALLBOX%FLEN%062018

My final expected data should be like,

COUNTRY EXCHANGE_CODE   TOWN_CODE   MONTH_CODE
FRANCE  BOX              LYON       022018;032018;052018;062018
FRANCE  BOX              NICE       032018
FRANCE  BOX              LILLE      022018
FRANCE  BOX              NEM        022018
FRANCE  COVER            CWF        022018
FRANCE  COVER            FZF        022018
FRANCE  COVER            MX1        022018
FRANCE  BIGBOX           DIJON      022018
SWEDEN  SMALLBOX         BODEN      012019;022019;032018;042018;052018;062018
SWEDEN  SMALLBOX         FLEN       012019;032018;042018;052018;062018

I have created the below script but was only able to achieve till the below given table.

import csv
import os
from collections import defaultdict, OrderedDict
import itertools
from operator import itemgetter

in_path = os.path.expanduser("~/Desktop/FUTURES.csv")
out_path = os.path.expanduser("~/Desktop/Finalresult.csv")

with open(in_path, 'r') as f_in, open(out_path, 'w', newline='') as f_out:
    csv_reader = csv.reader(f_in, delimiter=';')
    writer = csv.writer(f_out)

    all = []
    row = next(csv_reader)
    row.append('LFU')
    row.append('EXCHANGE_CODE')
    row.append('TOWN_CODE')
    row.append('MONTH_CODE')
    all.append(row)

    for row in csv_reader:
        if row[0] in ['FRANCE', 'SWEDEN']:

            row.append(row[3].split('%')[0])
            row.append(row[3].split('%')[1])
            row.append(row[3].split('%')[2])
            row.append(row[3].split('%')[3])
            all.append(row)

    writer.writerows(map(itemgetter(0, 5, 6, 7), all))

My current result..

COUNTRY FRUIT   EXCHANGE_CODE   TOWN_CODE   MONTH_CODE
FRANCE  APPLE   BOX                 LYON    022018
FRANCE  APPLE   BOX                 LYON    032018
FRANCE  APPLE   BOX                 LYON    052018
FRANCE  APPLE   BOX                 LYON    062018
FRANCE  APPLE   BOX                 NICE    032018
FRANCE  APPLE   BOX                 LILLE   022018
FRANCE  APPLE   BOX                 NEM     022018
FRANCE  APPLE   COVER               CWF     022018
FRANCE  APPLE   COVER               FZF     022018
FRANCE  APPLE   COVER               MX1     022018
FRANCE  APPLE   BIGBOX              DIJON   022018
SWEDEN  APPLE   SMALLBOX            BODEN   012019
SWEDEN  APPLE   SMALLBOX            BODEN   022019
SWEDEN  APPLE   SMALLBOX            BODEN   032018
SWEDEN  APPLE   SMALLBOX            BODEN   042018
SWEDEN  APPLE   SMALLBOX            BODEN   052018
SWEDEN  APPLE   SMALLBOX            BODEN   062018
SWEDEN  APPLE   SMALLBOX            FLEN    012019
SWEDEN  APPLE   SMALLBOX            FLEN    032018
SWEDEN  APPLE   SMALLBOX            FLEN    042018
SWEDEN  APPLE   SMALLBOX            FLEN    052018
SWEDEN  APPLE   SMALLBOX            FLEN    062018

I would really appreciate any help that I can get.

P.S - I don't want to use Pandas, Numpy.

You shouldn't write to row while iterating through it. This can lead to unexpected results. — MrLeeh
– MrLeeh, Commented Apr 26, 2018 at 6:19
Is there a specific reason you don't want to use pandas? I wrote an answer with pandas but deleted it once I saw your P.S. :) — zipa
– zipa, Commented Apr 26, 2018 at 11:30
@zipa - Actually I have already used Pandas for this but when I have made and .exe application using PytQ5 it became huge like around 500MB which majorly was because of Pandas and Numpy. - Here is my code using Pandas link — Ashwaq
– Ashwaq, Commented Apr 26, 2018 at 11:46

Serge Ballesta · Accepted Answer · 2018-04-26 12:10:21Z

1

You cannot write a line for each read line, because one single output line can be composed from a number of input line. But if you can assume that input file is sorted according to COUNTRY, EXCHANGE_CODE and TOWN, you can just add the new month at the end of the one of the previous line if COUNTRY, EXCHANGE_CODE and TOWN are the same.

Your code could become:

...
with open(in_path, 'r') as f_in, open(out_path, 'w', newline='') as f_out:
    csv_reader = csv.reader(f_in, delimiter=';')
    writer = csv.writer(f_out)

    all = []
    row = next(csv_reader)
    row.append('LFU')
    row.append('EXCHANGE_CODE')
    row.append('TOWN_CODE')
    row.append('MONTH_CODE')

    old = row                       # just remember it

    for row in csv_reader:
        if row[0] in ['FRANCE', 'SWEDEN']:

            row.append(row[3].split('%')[0])
            row.append(row[3].split('%')[1])
            row.append(row[3].split('%')[2])
            row.append(row[3].split('%')[3])
            if row[0] == old[0] and row[5] == old[5] and row[6] == old[6]:
                old[7] += ';' + row[7]
            else:
                all.append(old)                      # write down previous row
                old = row
    all.append(old)                                  # do not forget last row

    writer.writerows(map(itemgetter(0, 5, 6, 7), all))

answered Apr 26, 2018 at 12:10

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ashwaq Over a year ago

Thank you Serge, this is what I was breaking my head for.

zipa · Accepted Answer · 2018-04-26 14:00:34Z

1

If you want to omit all the libraries, here is a solution without imports:

with open('smntg.csv') as fin, open('smntg_else.csv', 'w') as fout:
    header = ['COUNTRY', 'EXCHANGE_CODE', 'TOWN_CODE', 'MONTH_CODE']
    data = fin.readlines()
    needed = list(map(str.strip, data))[1:]   
    dealtWith = []
    for line in needed:
        apart = line.split(';')
        country = apart[0]
        exchange, town, month = apart[-1].split('%')[1:]
        dealtWith.append([country, exchange, town, month])        
    packed = {tuple(dealtWith[0][:3]): [dealtWith[0][3]]}
    for item in dealtWith[1:]:
        key = tuple(item[:3])
        value = item[3]
        if key in packed:
            packed[key].append(value)
        else:
            packed[key] = [value]
    joined = {k: ';'.join(v) for k, v in packed.items()}
    finalized = [list(i) + [j] for i, j in joined.items()]
    finalized.sort()
    commaDelimited = [','.join(fline) + '\n' for fline in finalized]
    fout.write(','.join(header) + '\n')
    fout.writelines(commaDelimited)

edited Apr 26, 2018 at 14:00

answered Apr 26, 2018 at 11:26

zipa

28k6 gold badges45 silver badges62 bronze badges

1 Comment

Ashwaq Over a year ago

Thank you for the code, as of now I am trying to understand how it works.(Looks a bit complicated for a Novice like me :p)

Collectives™ on Stack Overflow

Split one column in CSV file into multiple columns while grouping the data in Python (without Pandas)

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related