4

I need to parse a text table which has the following format:

-----------------------------------------
| Serial |     Name             | marks |
| Number |First | Middle | Last |       |
-----------------------------------------
| 1      | john |   s    |  doe |  56   |
| 2      | jim  |   d    |  bill|  60   |

After parsing the table, the output should be a nested dictionary with the data as lists.

TableData = {'Serial Number':[1,2], 
             'Name': {'First':[john, jim]} 
                     {'Middle':[s, d]} 
                     {'Last':[doe, bill]}
             'marks': [56, 60]
            }

As of now I have logic to get the positions of the delimiters (|), and I can extract the text in between the delimiters.

posList = [[0,9,32,40],[0,9,16,25,32]]
nameList = [['Serial','Name','marks'],['Number ','First','Middle','Last','  ']]

But I am having difficulty converting this to the nested dictionary structure.

2
  • Where are you stuck? The next step here seems to be comparing the two lists inside posList and distinguishing two cases: either the column in the second line has the same extent as in the line above (e.g. "Serial Number" between 0 and 9), or the column is divided (e.g. "Name" goes from 9 to 32, but then you have "sub-columns" 9-16, 16-25 and 25-32). Commented Jun 26, 2013 at 22:58
  • I was stuck at how to make the data structure (row-wise or column-wise). Bill's solution looks good, so I accepted this answer for now. But I was wondering if there was a different approach I could try like building a tree like structure that could parse any number of subcolumns with any level of subcolumns. Commented Jun 27, 2013 at 17:54

1 Answer 1

4

If you know what the data structure should look like, then can't you forget about the first 3 rows and extract data from the rest of the rows? For example, assuming the table is located in a text file table_file, then

table_data = {'Serial Number':[],
              'Name':{'First': [],
                      'Middle': []
                      'Last': []},
              'Marks': []}

with open(table_file, 'r') as table:
    # skip first 3 rows
    for _ in range(3):
        table.next()

    for row in table:
        row = row.strip('\n').split('|')
        values = [r.strip() for r in row if r != '']
        assert len(values) == 5
        table_data['Serial Number'].append(int(values[0]))
        table_data['Name']['First'].append(values[1])
        table_data['Name']['Middle'].append(values[2])
        table_data['Name']['Last'].append(values[3])
        table_data['Marks'].append(values[4])

EDIT: To construct the table_data dictionary, consider the following pseudocode. Fair warning, I tested this and it seems to work for your example and should work for anything with two rows of header. However, it is sloppy because I wrote in about 10 minutes. However, it could be an OK start from which you can improve and expand. This also assumes you have code for extracting pos_list and name_list.

for itertools import tee, izip
def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def create_table_dict(pos_list, name_list):
    intervals = []
    for sub_list in pos_list:
        intervals.append(list(pairwise(sub_list)))

    items = []
    for interval, name in zip(intervals, name_list):
        items.append([ (i, n) for i, n in zip(interval, name) ])

    names = []
    for int1, name1 in items[0]:
        past_names = []
            for int2, name2 in items[1]:
        if int1[0] == int2[0]:
            if int1[1] == int2[1]:
                names.append(' '.join((name1, name2)).strip())
        elif int2[1] < int1[1]:
                past_names.append(name2)
        elif int1[0] < int2[0]:
            if int2[1] < int1[1]:
            past_names.append(name2)
            elif int1[1] == int2[1]:
            names.append('{0}:{1}'.format(name1, 
                                          ','.join(past_names + [name2])))

    table = {}
    for name in names:
        if ':' not in name:
            table[name] = []
        else:
            upper, nested = name.split(':')
            nested = nested.split(',')
            table[upper] = {}
            for n in nested:
                table[upper][n] = []

    print table
Sign up to request clarification or add additional context in comments.

2 Comments

The table parser is supposed to parse any arbitrary table with any heading that follows the table format that i posted. So we need to make the data structure as we parse the table.
I edited the post to include some somewhat sloppy code to create the table format as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.