Boilerpipe Integration
& Improvement
Allan Huang @ esobi Inc.
Known Issues
 本文內容空白
 本文內容亂碼
 特殊字元亂碼
 缺少本文主體
 與本文無關的內容
Integration
 必要的參數有…



URL 網址 或…
HTML 全文


<base> tag 的 href

 可選的參數有…


Extractor




Boilerpipe 演算法

Output Mode


HTML Extraction, HTML Highlighting, Plain Text,
JSON
Improvement






強化 HTTP 和 HTML 編碼的判斷與處理
支援 HTTP Response 解壓縮演算法
安插 <base> tag 以改善 Image 於相對路徑的顯示
更換成最新版的 Boilerpipe 和相關的 nekohtml
library
測試結果


共有 150 則新聞

則繁中新聞
 80 則英文新聞
 2 則簡中新聞
 66



目前成功率為 94%
Failure Cases






只抓到 HTML Title 而抓不到本文
 2 則新聞,中時電子報、臉書的動態時報照片
缺少本文主體
 2 則新聞, UrCosme 美容討論區、青年日報
抓到 JavaScript code 或 HTML escape 字元
 2 則新聞,香港成報、 The Wall Street Journal
Solved Cases




時常抓到亂碼的本文
 2 則新聞,中時電子報的焦點新聞
 起因為無法下載整個 HTML 全文
解決方案
 避免使用 Java PushbackStream ,改以一次性下載整
個 HTML 全文後,再進行 HTML 字串取樣,以利於
HTML 全文編碼的判斷
Solved Cases




CJK 特殊字元亂碼
 宏碁 R7 筆電 「星際爭霸戰」款限量出擊
 朱镕基退休前后“判若两人” 非常注重晚节
 起因為 Java 引用同一字元集缺少特殊字元
解決方案
 繁中 Big5-HKSCS 替代 Big5
 簡中 GB18030 替代 GB2312
 日文 Windows-31J 替代 Shift_JIS
 韓文尚未找到案例
Algorithm Comparison







Structure retainment
Inner content cleaning
Implementation
Language dependency
Source parameter
Additional features and remarks
Structure
retainment

Inner content
cleaning

Boilerpipe

plain text only

uses a classifier to
determine whether or
not the atomic text
open source java library
block holds useful
content

Alchemy API

text only (has an
option to include
relevant hyperlinks)

n/a

Name

Diffbot

Readability

Goose

Extractiv
Repustate API

Webstemmer

plain text or html

an option to remove
inline ads

retains original
structure

uses hardcoded
heuristics to extract
content divided by
ads

plain text

n/a

depends on the
chosen output format
n/a
– e.g. xml format
breaks the content
plain text
n/a

plain text

NCleaner (paper) plain text

Implementation

commercial web api

web api (private beta)

Source parameter

should be language
you can fetch
independent since the
documents by yourself
text block classifier
or use built-in utilities
observes language
to fetch them for you
independent text
observation: returns an
include the whole
error for non-english
document in the post
content e.g. the
request or provide an
document contains
url
“unsupported text
does fetching for you
n/a
via provided url

open source javascript
bookmarklet

via browser

open source java library

url only (my fork
enables you to fetch
the document by
yourself)

commercial web api
commercial web api

n/a

open source python
library

uses character level
n-grams to detect
content text blocks

open source perl library

Language dependancy

language independent
but it relies on language
dependent regular
expressions to match id
and class labels
language independent
but it relies on language
dependent regular
expressions to match id
and class labels

include the whole
document in post
n/a
request or provide an
url
url only
n/a
first runs a crawler to
obtain seed pages,
then it learns layout
language independent
patterns that are later
put to work to extract
arbitrary html
document

Additional features and
remarks
implements many
extractors with different
classification rules trained
on different datasets

extra API call to extract
the title
extracts: relevant media,
titile, tags, xpath descriptor
for wrappers, comments
and comment count, article
summary

uses hardcoded heuristics
to search for related
images and embedded
media
capable of enriching the
extracted text with
semantic entities and
relationships

the only piece of software
on this list that requires a
cluster of similar
documents obtained by
crawling
reliant on lynx browser for
depends on the training
converting html to
language
structured plain text
Reference
 Evaluating

Text Extraction Algorithms
 List of resources: Article text extraction from
HTML documents
 Feature-wise Comparison of HTML Article
Text Extractors
 Overview: Extracting article text from HTML
documents
 Readability for Java - Snacktory
Conclusion
 Next



step…

Boilerpipe 抓取本文並未包含 Image 資訊
URL 對應的 HTML 全文或本文 Cache 機制

 Q&A

Boilerpipe Integration And Improvement

  • 1.
  • 2.
    Known Issues  本文內容空白 本文內容亂碼  特殊字元亂碼  缺少本文主體  與本文無關的內容
  • 3.
    Integration  必要的參數有…   URL 網址或… HTML 全文  <base> tag 的 href  可選的參數有…  Extractor   Boilerpipe 演算法 Output Mode  HTML Extraction, HTML Highlighting, Plain Text, JSON
  • 4.
    Improvement      強化 HTTP 和HTML 編碼的判斷與處理 支援 HTTP Response 解壓縮演算法 安插 <base> tag 以改善 Image 於相對路徑的顯示 更換成最新版的 Boilerpipe 和相關的 nekohtml library 測試結果  共有 150 則新聞 則繁中新聞  80 則英文新聞  2 則簡中新聞  66  目前成功率為 94%
  • 5.
    Failure Cases    只抓到 HTMLTitle 而抓不到本文  2 則新聞,中時電子報、臉書的動態時報照片 缺少本文主體  2 則新聞, UrCosme 美容討論區、青年日報 抓到 JavaScript code 或 HTML escape 字元  2 則新聞,香港成報、 The Wall Street Journal
  • 6.
    Solved Cases   時常抓到亂碼的本文  2則新聞,中時電子報的焦點新聞  起因為無法下載整個 HTML 全文 解決方案  避免使用 Java PushbackStream ,改以一次性下載整 個 HTML 全文後,再進行 HTML 字串取樣,以利於 HTML 全文編碼的判斷
  • 7.
    Solved Cases   CJK 特殊字元亂碼 宏碁 R7 筆電 「星際爭霸戰」款限量出擊  朱镕基退休前后“判若两人” 非常注重晚节  起因為 Java 引用同一字元集缺少特殊字元 解決方案  繁中 Big5-HKSCS 替代 Big5  簡中 GB18030 替代 GB2312  日文 Windows-31J 替代 Shift_JIS  韓文尚未找到案例
  • 8.
    Algorithm Comparison       Structure retainment Innercontent cleaning Implementation Language dependency Source parameter Additional features and remarks
  • 9.
    Structure retainment Inner content cleaning Boilerpipe plain textonly uses a classifier to determine whether or not the atomic text open source java library block holds useful content Alchemy API text only (has an option to include relevant hyperlinks) n/a Name Diffbot Readability Goose Extractiv Repustate API Webstemmer plain text or html an option to remove inline ads retains original structure uses hardcoded heuristics to extract content divided by ads plain text n/a depends on the chosen output format n/a – e.g. xml format breaks the content plain text n/a plain text NCleaner (paper) plain text Implementation commercial web api web api (private beta) Source parameter should be language you can fetch independent since the documents by yourself text block classifier or use built-in utilities observes language to fetch them for you independent text observation: returns an include the whole error for non-english document in the post content e.g. the request or provide an document contains url “unsupported text does fetching for you n/a via provided url open source javascript bookmarklet via browser open source java library url only (my fork enables you to fetch the document by yourself) commercial web api commercial web api n/a open source python library uses character level n-grams to detect content text blocks open source perl library Language dependancy language independent but it relies on language dependent regular expressions to match id and class labels language independent but it relies on language dependent regular expressions to match id and class labels include the whole document in post n/a request or provide an url url only n/a first runs a crawler to obtain seed pages, then it learns layout language independent patterns that are later put to work to extract arbitrary html document Additional features and remarks implements many extractors with different classification rules trained on different datasets extra API call to extract the title extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary uses hardcoded heuristics to search for related images and embedded media capable of enriching the extracted text with semantic entities and relationships the only piece of software on this list that requires a cluster of similar documents obtained by crawling reliant on lynx browser for depends on the training converting html to language structured plain text
  • 10.
    Reference  Evaluating Text ExtractionAlgorithms  List of resources: Article text extraction from HTML documents  Feature-wise Comparison of HTML Article Text Extractors  Overview: Extracting article text from HTML documents  Readability for Java - Snacktory
  • 11.
    Conclusion  Next   step… Boilerpipe 抓取本文並未包含Image 資訊 URL 對應的 HTML 全文或本文 Cache 機制  Q&A