2

I printed out composed array and saved to text file, it like:

({
    ngram_a67e6f3205f0-n: 1,
    logreg_c120232d9faa-regParam: 0.01,
    cntVec_9c0e7831261d-vocabSize: 10000
},0.8580469779197205)
({
    ngram_a67e6f3205f0-n: 2,
    logreg_c120232d9faa-regParam: 0.01,
    cntVec_9c0e7831261d-vocabSize: 10000
},0.8880895806519427)
({
    ngram_a67e6f3205f0-n: 3,
    logreg_c120232d9faa-regParam: 0.01,
    cntVec_9c0e7831261d-vocabSize: 10000
},0.8656452460818544)

I hope extract data to produce python Dataframe, it like:

1, 10000, 0.8580469779197205
2, 10000, 0.8880895806519427
2
  • You saved it to a txt file exactly like that? Commented Oct 4, 2019 at 0:48
  • Yes, the content of files is result of cross validation. I print out it, then copied it to files. Commented Oct 4, 2019 at 0:59

2 Answers 2

3

My advice is to change the input format of your file, if possible. It would greatly simplify your life.
If this is not possible, the following code solves your problem:

import pandas as pd
import re

pattern_tuples = '(?<=\()[^\)]*'
pattern_numbers = '[ ,](?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?'
col_name = ['ngram', 'logreg', 'vocabSize', 'score']

with open('test.txt','r') as f:
    matchs = re.findall(pattern_tuples, f.read())
    arr_data = [[float(val.replace(',','')) for val in re.findall(pattern_numbers, match)] for match in matchs]
    df = pd.DataFrame(arr_data, columns=col_name).astype({'ngram':'int', 'vocabSize': 'int'})

and gives:

   ngram  logreg  vocabSize     score
0      1    0.01      10000  0.858047
1      2    0.01      10000  0.888090
2      3    0.01      10000  0.865645

Brief explanation

  1. Read the file
  2. Using re.findall and the regex pattern_tuples finds all the tuples in the file

  3. For each tuple, using the regex pattern_numbers you will find the 4 numerical values ​​that interest you. In this way you will get a list of lists containing your data

  4. Enter the results in a pandas dataframe


Extra

Here's how you could save your CV results in json format, so you can manage them more easily:

  1. Create an cv_results array to keep the CV results

  2. For each loop of CVs you will get a tuple t with the results, which you will have to transform into a dictionary and hang in the array cv_results

  3. At the end of the CV loops, save the results in json format

.

cv_results = []

for _ in range_cv: # Loop CV
    # ... Calculate results of CV in t
    t = ({'ngram_a67e6f3205f0-n': 1,
       'logreg_c120232d9faa-regParam': 0.01,
       'cntVec_9c0e7831261d-vocabSize': 10000},
      0.8580469779197205) # FAKE DATA for this example

    # append results like a dict
    cv_results.append({'res':t[0], 'score':t[1]})

# Store results in json format
with open('cv_results.json', 'w') as outfile:
    json.dump(cv_results, outfile, indent=4)

Now you can read the json file and you can access all the fields like a normal python dictionary:

with open('cv_results.json') as json_file:
    data = json.load(json_file)

data[0]['score']
# output: 0.8580469779197205
Sign up to request clarification or add additional context in comments.

2 Comments

Yes. it is powerful!
if you want to switch to using a json, I updated the answer giving you some advice. Good luck :) @IvanLee
0

Why not do:

import pandas as pd
With open(file.txt) as file:
    df = pd.DataFrame([i for i in eval(file.readline())])

Eval takes a string and converts it to the literal python representation which is pretty nifty. That would convert each parenthetical to a single item iterator which is then stored into a list. Pd dataframe class can take a list of dictionaries with identical keys and create a dataframe

1 Comment

Have you tried this on the given text? I don't think it works

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.