Parsing specific field in XML file in Python

Question

I have an xml file that looks like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="http://data.treasury.gov:8001/Feed.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">DailyTreasuryYieldCurveRateData</title>
  <id>http://data.treasury.gov:8001/feed.svc/DailyTreasuryYieldCurveRateData</id>
  <updated>2015-08-30T15:17:09Z</updated>
  <link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6404)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6404)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6404</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-03T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.03</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.17</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.33</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.68</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">0.99</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.89</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.16</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.55</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.86</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.86</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6405)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6405)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6405</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-04T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.05</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.18</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.37</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.74</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">1.08</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.6</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.97</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.23</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.59</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.9</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.9</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
</feed>

How can I parse out the '2.16' for 'BC_10YEAR'? I've been looking at other examples with ElementTree and lxml and I just can't seem to match up the xml format in those examples with that of my file.

The last thing I've tried was:

from lxml import etree
doc = etree.parse(yield_xml)
memoryElem = doc.find('content')
print memoryElem.text        # element text
print memoryElem.get('type') # attribute

I get an error: AttributeError: 'NoneType' object has no attribute 'text'

Is there a simple way to do this?

mmachine · Accepted Answer · 2015-08-31 03:03:35Z

1

You may try built-in split method:

>>>[data.split('>')[1].split('<')[0] for data in str(xml_file).split('<d:') if 'BC_10YEAR' in data][0]
'2.16'

answered Aug 31, 2015 at 3:03

mmachine

9266 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Colin Over a year ago

I tried 'with open('test.xml', 'rb') as xml_file: [data.split('>')[1].split('<')[0] for data in str(xml_file).split('<d:') if 'BC_10YEAR' in data][0]' but I get "IndexError: list index out of range" error. What am I doing wrong?

mmachine Over a year ago

It means you test.xml file object differs from example above.

Colin Over a year ago

That's strange, I'm pretty sure my file has exactly what I pasted above. Anyway, I modified to this to get it to work:

with open(yield_xml, 'rb') as yield_file:         for line in yield_file:             if 'BC_10YEAR' in line:                 cur_yield = float(line.split('>')[1].split('<')[0])                 break

Community · Accepted Answer · 2017-05-23 10:26:39Z

0

I'd suggest to use lxml's xpath() method which provide better XPath expression support :

from lxml import etree

doc = etree.parse(yield_xml)

#register prefixes to be used in xpath
ns = {"foo": "http://www.w3.org/2005/Atom",
      "d": "http://schemas.microsoft.com/ado/2007/08/dataservices",
      "m": "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"}

#select element <d:BC_10YEAR>, and convert the value to number
result = doc.xpath("number(//foo:content/m:properties/d:BC_10YEAR)", namespaces=ns)

#print the result
print(result)
print(type(result))

output :

2.16
<type 'float'>

In case you wonder why foo:content instead of just foo in the xpath expression above, that's because content inherits default namespace from the root element, implicitly. And the default namespace uri is mapped to prefix foo in the above code; related question : parsing xml containing default namespace to get an element value using lxml

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered Aug 31, 2015 at 8:49

har07

89.5k12 gold badges87 silver badges143 bronze badges

2 Comments

Colin Over a year ago

Thanks the code works. Unfortunately my knowledge of xml is very limited so I could not understand much of what you said. I do have a question though: how does the code differentiate between the two 'BC_10YEAR' values in the xml file? The first one is 2.16 but there is another one that's 2.23.

har07 Over a year ago

The code will return the first only. Getting the the other one, or all BC_10YEAR is perfectly possible with a little change in the xpath argument

Collectives™ on Stack Overflow

Parsing specific field in XML file in Python

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related