My advice is to change the input format of your file, if possible. It would greatly simplify your life.
If this is not possible, the following code solves your problem:
import pandas as pd
import re
pattern_tuples = '(?<=\()[^\)]*'
pattern_numbers = '[ ,](?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?'
col_name = ['ngram', 'logreg', 'vocabSize', 'score']
with open('test.txt','r') as f:
matchs = re.findall(pattern_tuples, f.read())
arr_data = [[float(val.replace(',','')) for val in re.findall(pattern_numbers, match)] for match in matchs]
df = pd.DataFrame(arr_data, columns=col_name).astype({'ngram':'int', 'vocabSize': 'int'})
and gives:
ngram logreg vocabSize score
0 1 0.01 10000 0.858047
1 2 0.01 10000 0.888090
2 3 0.01 10000 0.865645
Brief explanation
- Read the file
Using re.findall and the regex pattern_tuples finds all the tuples in the file
For each tuple, using the regex pattern_numbers you will find the 4 numerical values that interest you. In this way you will get a list of lists containing your data
Enter the results in a pandas dataframe
Extra
Here's how you could save your CV results in json format, so you can manage them more easily:
Create an cv_results array to keep the CV results
For each loop of CVs you will get a tuple t with the results, which you will have to transform into a dictionary and hang in the array cv_results
At the end of the CV loops, save the results in json format
.
cv_results = []
for _ in range_cv: # Loop CV
# ... Calculate results of CV in t
t = ({'ngram_a67e6f3205f0-n': 1,
'logreg_c120232d9faa-regParam': 0.01,
'cntVec_9c0e7831261d-vocabSize': 10000},
0.8580469779197205) # FAKE DATA for this example
# append results like a dict
cv_results.append({'res':t[0], 'score':t[1]})
# Store results in json format
with open('cv_results.json', 'w') as outfile:
json.dump(cv_results, outfile, indent=4)
Now you can read the json file and you can access all the fields like a normal python dictionary:
with open('cv_results.json') as json_file:
data = json.load(json_file)
data[0]['score']
# output: 0.8580469779197205