0

Without any 3rd party libraries (such as beautiful soup) what is the cleanest way to parse a string in PYTHON.

Given the text below I'd like the content of "uber_token" be parsed out ie. "123456789"

....

<form id="blah" action="/p-submi.html" method="post"><input type="hidden" id="" name="uber_token" value="123456789"/><div class="container-info">

....

Thanks!

5
  • do you need to tokenize all the elements and attributes or simply extract the value="XXX" part? If its just the latter, use a regex. Commented Jun 26, 2014 at 4:18
  • just need the value="xxx". But there are multiple instances of value="**" which may have a different associated name. Commented Jun 26, 2014 at 4:20
  • If the attributes and their ordering is consistent in every element you can use a regex for that, but why are you averse to using a library? Commented Jun 26, 2014 at 4:22
  • Note that if you need the names that accompany the values too, maybe update your question. Commented Jun 26, 2014 at 4:24
  • If each <input type="hidden" id="" name="uber_token" value="123456789"/> is one per line. Then you can just seatch for name and parse the two quotations after. If its equal to uber_token then find value and parse between the two quotations after. Commented Jun 26, 2014 at 4:58

3 Answers 3

2

regular expression is the solution.

use import re

>>> import re
>>> s = '<form id="blah" action="/p-submi.html" method="post"><input type="hidden" id="" name="uber_token" value="123456789"/><div class="container-info"'
>>> regex=re.search(r'name="uber_token" value="([0-9]+)"',s)
>>> print regex.group(1)
123456789
Sign up to request clarification or add additional context in comments.

Comments

0

Disclaimer: This answer is for quick-and-dirty scripts, and may lack in robustness and efficiency. Suggestions here should probably not be used for code that survives more than a few hours.

If you're unwilling to learn regex (and you should be willing to learn regex!), you can split on value=". Probably really inefficient, but simple is easier to debug.

values = []

with open('myfile.txt') as infile:
    for line in infile:
        candidates = line.split('value="')
        for s in candidates[1:]: #the first token is not a value
            try: #test if value is a number
                val = int(s.split('"')[0]) 
            except:
                continue
            values.append(val)

If you're specifically looking at HTML or XML, Python has libraries for both.

Then, for example, you can write code to search through the tree for a node with an attribute "name" that has value "uber_token", and get the "value" attribute from it.

Very dumb Python 2 example that doesn't require learning too much about ElementTrees (may need simple corrections):

import xml.etree.ElementTree as ET
tree = ET.parse('myfile.xml')
root = tree.getroot()

values = []

for element in root:
    if element.attrib['name'] == 'uber_token':
        values.append(element.attrib['value'])

Comments

0

Python comes with it's own xml parsing module : https://docs.python.org/3.2/library/xml.html?highlight=xml#xml so you don't have to use any third party parsing lib. If you're unwilling or not allowed to use that..... you can always drop to regex , but i'd stay clear of that when it comes to parsing XML

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.