0

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?

3
  • 2
    Use python regexp, read the file line by line and get the string, then replace it tutorialspoint.com/python/string_replace.htm Commented Nov 7, 2012 at 11:30
  • 1
    Simple replace of :2009tb with :2009 won't work for you? Commented Nov 7, 2012 at 11:49
  • It's difficult to get a pattern from only one example. Could you post, say, five to ten different different occurrences of these references as they appear, and the corresponding desired outputs. Commented Nov 7, 2012 at 12:02

2 Answers 2

1

It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers. (Edit, to incorporate suggestions)

import re                                                                                                                                                                                          

inf = 'temp.txt'                                                                                      
outf = 'out.txt'                                                                                      

with open(inf) as f,open(outf,'w') as o:                                                              
    all = f.read()                                                                                    
    all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here                                                  
    o.write(all)                                                                                      
    o.close()                                  
Sign up to request clarification or add additional context in comments.

5 Comments

Why split the file into lines? If you are taking this approach you may as well do it for the full file.
user996018 probably wants to capture (xxxxxx), not replace it. RParadox, use with when dealing with files instead of open/close.
The OP obviously doesn't want to replace the hardcoded string xxxxx:2009tb, but actually a PATTERN containing some (undefined) string followed by a colon and a year date and some letters.
Now he wants to REMOVE tb and keep the date (2009 in the example), and note remove the date like your last edit suggests. It's difficult to guess what to do given the limited info provided by the question...
@heltonbiker I am sorry about the delay in replying. As you have suggested, the string 2009 is not hardcoded. It could be anything like rtwruyeqy:2008xc or ahdjkhjk:2005gf or djkhdkjhjk:1999gh... Its basically the author and year followed by two alphabets. Thanks for the responses.
0

You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):

import re

s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""

new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s

Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.

Hope this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.