2

I want to parse text from a xml file.Consider that I have a some lines in a file.xml

<s id="1792387-2">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>

How can I extract the following text from the above line:

Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).

And after making some changes with the text, I want to get return the change text with the same tag as like below.

<s id="1792387-2"> Changed Text </s>

Any suggestion please.Thanks!

2
  • 2
    What exactly is your question? Commented Aug 1, 2011 at 15:20
  • Do you want to parse the text, the XML or both? Commented Aug 1, 2011 at 15:22

3 Answers 3

5

LXML makes this particularly easy.

>>> from lxml import etree
>>> text = '''<s id="1792387-2">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>'''
>>> def edit(s):
...     return 'Changed Text'
... 
>>> t = etree.fromstring(text)
>>> t.text = edit(t.text)
>>> etree.tostring(t)
'<s id="1792387-2">Changed Text</s>'
Sign up to request clarification or add additional context in comments.

4 Comments

getting Traceback Traceback (most recent call last): File "<string>", line 1, in <fragment> builtins.ImportError: No module named lxml
@Blue Ice: LXML is not a Python built-in module, you have to install it separately. lxml.de
If you'd like to just use the standard library (python 2.5+) you can use the ElementTree module (see my answer).
But, I am working in server & for the momentum not possible to do it , since no administration access.Any other alternatives please!
4

There are a couple stdlib methods for parsing xml… But in general ElementTree is the simplest:

from xml.etree import ElementTree
from StringIO import StringIO
doc = ElementTree.parse(StringIO("""<doc><s id="1792387-2">Castro…</s><s id="1792387-3">Other stuff</s></doc>"""))
for elem in doc.findall("s"):
    print "Text:", elem.text
    elem.text = "new text"
    print "New:", ElementTree.dump(elem)

And if your XML is coming from a file, you can use:

f = open("path/to/foo.xml")
doc = ElementTree.parse(f)
f.close()
… use `doc` …

4 Comments

Could you please have a look the following Traceback Traceback (most recent call last): File "<string>", line 1, in <fragment> builtins.ImportError: No module named StringIO
What version of Python are you using? (python --version)
Hrm… Is it a custom or restricted installation? Because StringIO should exist. Anyway, you can try loading it from a file (as per the second portion of my answer).
Just to be sure: is it possible you have multiple versions of Python installed? In Python 3 it is changed to from StringIO import StringIO. I have two Pythons on my system (2.6 and 3.2) and get into such a situation from time to time.
1

Parsing XML using the dom package (part of Python) http://docs.python.org/py3k/library/xml.dom.minidom.html is my favorite:

import xml.dom.minidom
d = xml.dom.minidom.parseString("<s id=\"1792387-2\">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>")
oldText = d.childNodes[0].childNodes[0].data
d.childNodes[0].childNodes[0].data = "Changed text"
d.toxml()

But this does not help you parse the text, so I am not sure what you exactly want there.

1 Comment

I want to extract the following text from the above line: Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.