I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
-
2Use python regexp, read the file line by line and get the string, then replace it tutorialspoint.com/python/string_replace.htmlinello– linello2012-11-07 11:30:18 +00:00Commented Nov 7, 2012 at 11:30
-
1Simple replace of :2009tb with :2009 won't work for you?Yevgen Yampolskiy– Yevgen Yampolskiy2012-11-07 11:49:23 +00:00Commented Nov 7, 2012 at 11:49
-
It's difficult to get a pattern from only one example. Could you post, say, five to ten different different occurrences of these references as they appear, and the corresponding desired outputs.heltonbiker– heltonbiker2012-11-07 12:02:32 +00:00Commented Nov 7, 2012 at 12:02
2 Answers
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers. (Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
5 Comments
with when dealing with files instead of open/close.xxxxx:2009tb, but actually a PATTERN containing some (undefined) string followed by a colon and a year date and some letters.tb and keep the date (2009 in the example), and note remove the date like your last edit suggests. It's difficult to guess what to do given the limited info provided by the question...You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!