Boilerpipe Integration And Improvement

Boilerpipe Integration
& Improvement
Allan Huang @ esobi Inc.

Known Issues
 本文內容空白
 本文內容亂碼
 特殊字元亂碼
 缺少本文主體
 與本文無關的內容

Integration
 必要的參數有…



URL 網址或…
HTML 全文


<base> tag 的 href

 可選的參數有…


Extractor




Boilerpipe 演算法

Output Mode


HTML Extraction, HTML Highlighting, Plain Text,
JSON

Improvement






強化 HTTP 和 HTML 編碼的判斷與處理
支援 HTTP Response 解壓縮演算法
安插 <base> tag 以改善 Image 於相對路徑的顯示
更換成最新版的 Boilerpipe 和相關的 nekohtml
library
測試結果


共有 150 則新聞

則繁中新聞
 80 則英文新聞
 2 則簡中新聞
 66



目前成功率為 94%

Failure Cases






只抓到 HTML Title 而抓不到本文
 2 則新聞，中時電子報、臉書的動態時報照片
缺少本文主體
 2 則新聞， UrCosme 美容討論區、青年日報
抓到 JavaScript code 或 HTML escape 字元
 2 則新聞，香港成報、 The Wall Street Journal

Solved Cases




時常抓到亂碼的本文
 2 則新聞，中時電子報的焦點新聞
 起因為無法下載整個 HTML 全文
解決方案
 避免使用 Java PushbackStream ，改以一次性下載整
個 HTML 全文後，再進行 HTML 字串取樣，以利於
HTML 全文編碼的判斷

Solved Cases




CJK 特殊字元亂碼
 宏碁 R7 筆電「星際爭霸戰」款限量出擊
 朱镕基退休前后“判若两人” 非常注重晚节
 起因為 Java 引用同一字元集缺少特殊字元
解決方案
 繁中 Big5-HKSCS 替代 Big5
 簡中 GB18030 替代 GB2312
 日文 Windows-31J 替代 Shift_JIS
 韓文尚未找到案例

Algorithm Comparison







Structure retainment
Inner content cleaning
Implementation
Language dependency
Source parameter
Additional features and remarks

Structure
retainment

Inner content
cleaning

Boilerpipe

plain text only

uses a classifier to
determine whether or
not the atomic text
open source java library
block holds useful
content

Alchemy API

text only (has an
option to include
relevant hyperlinks)

n/a

Name

Diffbot

Readability

Goose

Extractiv
Repustate API

Webstemmer

plain text or html

an option to remove
inline ads

retains original
structure

uses hardcoded
heuristics to extract
content divided by
ads

plain text

n/a

depends on the
chosen output format
n/a
– e.g. xml format
breaks the content
plain text
n/a

plain text

NCleaner (paper) plain text

Implementation

commercial web api

web api (private beta)

Source parameter

should be language
you can fetch
independent since the
documents by yourself
text block classifier
or use built-in utilities
observes language
to fetch them for you
independent text
observation: returns an
include the whole
error for non-english
document in the post
content e.g. the
request or provide an
document contains
url
“unsupported text
does fetching for you
n/a
via provided url

open source javascript
bookmarklet

via browser

open source java library

url only (my fork
enables you to fetch
the document by
yourself)

commercial web api
commercial web api

n/a

open source python
library

uses character level
n-grams to detect
content text blocks

open source perl library

Language dependancy

language independent
but it relies on language
dependent regular
expressions to match id
and class labels
but it relies on language
dependent regular
expressions to match id
and class labels

include the whole
document in post
n/a
request or provide an
url
url only
n/a
first runs a crawler to
obtain seed pages,
then it learns layout
patterns that are later
put to work to extract
arbitrary html
document

Additional features and
remarks
implements many
extractors with different
classification rules trained
on different datasets

extra API call to extract
the title
extracts: relevant media,
titile, tags, xpath descriptor
for wrappers, comments
and comment count, article
summary

uses hardcoded heuristics
to search for related
images and embedded
media
capable of enriching the
extracted text with
semantic entities and
relationships

the only piece of software
on this list that requires a
cluster of similar
documents obtained by
crawling
reliant on lynx browser for
depends on the training
converting html to
language
structured plain text

Reference
 Evaluating

Text Extraction Algorithms
 List of resources: Article text extraction from
HTML documents
 Feature-wise Comparison of HTML Article
Text Extractors
 Overview: Extracting article text from HTML
documents
 Readability for Java - Snacktory

Conclusion
 Next



step…

Boilerpipe 抓取本文並未包含 Image 資訊
URL 對應的 HTML 全文或本文 Cache 機制

 Q&A

Boilerpipe Integration And Improvement

More Related Content

What's hot

More from Allan Huang

Recently uploaded

Boilerpipe Integration And Improvement