1

I want to scrape and parse a London Stock Exchange news article.

Almost the entire content of the site comes from a JSON that's consumed by JavaScript. However, this can be easily extracted with BeautifulSoup and parsed with the JSON module.

But the encoding of the script is a bit funky.

The <script> tag has an id of "ng-lseg-state", which means this is Angular's custom HTML encoding.

For example:

&l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xhtml\"&g;\n&l;head&g;\n&l;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /&g;\n&l;title&g;&l;/title&g;\n&l;meta name=\"generator\"

I handle this with a .replace() chain:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
    article_body
    .replace('&l;', '<')
    .replace('&g;', '>')
    .replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p"))

But there are still some characters that I'm not sure how to handle:

  • &amp;a;#160;
  • &amp;a;amp;
  • &amp;s;

just to name a few.

So, the question is, how do I deal with the rest of the chars? Or maybe there's a parser or a reliable char mapping out there that I don't know of?

2
  • 1
    Related: stackoverflow.com/questions/62127215/… Same problem, same website. Commented Apr 10, 2021 at 21:57
  • 2
    Thanks @QHarr for pointing that one out. Seems like we could all benefit from a more generic solution than a chain of .replace() methods. Commented Apr 10, 2021 at 22:06

1 Answer 1

2

Angular encodes transfer state using a special escape function located here:

export function escapeHtml(text: string): string {
  const escapedText: {[k: string]: string} = {
    '&': '&a;',
    '"': '&q;',
    '\'': '&s;',
    '<': '&l;',
    '>': '&g;',
  };
  return text.replace(/[&"'<>]/g, s => escapedText[s]);
}

export function unescapeHtml(text: string): string {
  const unescapedText: {[k: string]: string} = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
  };
  return text.replace(/&[^;]+;/g, s => unescapedText[s]);
}

You can reproduce the unescapeHtml function in python, and add html.unescape to resolve additionnal html entities:

import json
import requests
from bs4 import BeautifulSoup
import html

unescapedText = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
}

def unescape(str):
    for key, value in unescapedText.items():
        str = str.replace(key, value)
    return html.unescape(str)

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {
    "id": "ng-lseg-state"
})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p"))

you were missing &s; and &a;

repl.it: https://replit.com/@bertrandmartel/AngularTransferStateDecode

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.