Wrapper Induction: Construct                                                      Outline:
wrappers automatically to extract
information from web sources                                •   What is wrapper
                                                            •   Wrapper Induction
                                                            •   WIEN
     Hongfei Qu                                             •   STALKER
     Computing Science Department                           •   Remaining Questions
     Simon Fraser University                                •   HTML DOM Tree
                                                            •   Other Related Works
     CMPT 882 Presentation                                  •   References
     March 28, 2001




                What is wrapper                                              What is wrapper
• Wrapper is a procedure to extract all kinds of data       • execLR(wrapper(<B>, </B>, <I>, </I>), page P):
  from a specific web source                                  m=0
• First find a vector of strings to delimit the extracted
                                                                while there are more occurrences in P of <B>
  text
• <HTML><TITLE>Country Codes</TITLE>                               m=m+1
  <BODY><B>Congo</B> <I>242</I><BR>                                for each (lk, rk) in {(<B>, </B>), (<I>, </I>)}
  <B>Spain</B> <I>34</I><BR>                                          scan in P to the next occurrence of lk in P;
  <HR><B>END</B></BODY></HTML>                                        save position as bm,k
• To extract pair (country, codes), we find a vector of
                                                                      scan in P to the next occurrence of rk in P;
  strings (<B>, </B>, <I>, </I>) to distinguish left &
  right of extracted text.                                            save position as e m,k
                                                                 Return label{…(bm,1, e m,1), (bm,2, e m,2)…}




              Wrapper Induction                                            Wrapper Induction

• Motivations: hand-coded wrapper is                        • Actually we are trying to learn a vector of
  tedious and error-prone. How about web                      delimiters, which is used to instantiate some
  pages get changed?                                          wrapper classes (templates), which describe
• Wrapper induction –- automatically                          the document structure
  generate wrapper --- is a typical                         • Free text & Web pages
  machine learning technology.                              • A good wrapper induction system should be:
• Input: a set E of example pages Pn and                        – Expressiveness: concern how the wrapper handles
                                                                  a particular web site
  the corresponding label pages Ln
                                                                – Efficiency: how many samples are needed? How
• Output: a wrapper w such that w(Pn) =                           much computational is required?
  Ln




                                                                                                                     1
WIEN                                                          WIEN

• First wrapper induction system implemented               • Procedure learnLR(examples E)
  by U. Washington. Works for both Web page                  for each 1<= k <=K
  and free text.                                                   for each u in Candl(k, E): if u is valid for the kth
• WIEN defines 6 wrapper classes (templates) to                   attribute in E, then lk = u and terminate the loop
  express the structures of web sites.                        for each 1<= k <=K
• The simplest and powerful one is LR (left-                        for each u in Candr(k, E): if u is valid for the kth
  right) wrapper class. It uses left- and right-                  attribute in E, then lr = u and terminate the loop
  hand delimiter to extract the relevant
                                                             return LR wrapper(l1, r1 , …, lk, rk)
  information
                                                           • Procedure Candl(k, E) returns candidates for lk by
• To extract tuples with K attributes from a set             enumerating the suffixes of the shortest string occurring
  of examples E, the learning algorithm is:                  to the left of each attribute k instances




                       WIEN                                                          WIEN

• Procedure Cand r(k, E) returns candidates for lr by      • Which wrapper class do we choose for a web site?
  enumerating the prefixes of the shortest string          • How many examples are required? PAC model
  occurring to the right of each attribute k instances;      N: number of examples;
• Each wrapper class has a set of validating constraints     e: accuracy parameter. 0 < e < 1
• Other wrapper classes:                                     a: confidence parameter. 0 < a < 1
   – HLRT: add head delimiter h & tail delimiter t           For a learning wrapper W, if we want error(W) < e
                                                             with probability at least a, the PAC model for the LR
   – OCLR: using open and close delimiers to indicate
                                                             class is:
     the beginning and end of each tuple
                                                             N >= 1/(1-a) * (2K*ln( R ) - ln(1 - a ) ), where R is the
   – HOCLRT: combination of HLRT and OCLR                    length of the shortest example.
   – N-LR and N-HLRT: handle nested structure              • A way to terminate the learning precedure
• Combination of 6 classes can handle 70% web sites        • A loose bound compared with test results




                    STALKER                                                      STALKER

• A wrapper induction project by U. Southern               • Landmarks: a sequence of tokens, argument
  California. Only works for Web page.                       of some functions.
• More expressive and efficient than WIEN.                   SkipTo(<b>): start from beginning, skip
• Treat a web page as a tree-like structure and              everything until find <b> landmarks
  handle information extraction hierarchically               SkipTo(<b>)SkipTo(<I>)
• Use disjunctions to deal with the variations.            • These functions represent the rules to extract
  Disjunctive rules are ordered lists of                     the information
  individual disjuncts. The wrapper will                   • Start rule: identify the beginning of an
  successively apply each disjunct in the list               attribute
  until it finds one that matches                          • End rule: identify the end of an attribute




                                                                                                                           2
STALKER                                                      STALKER
                                                        <body><p>Name:<b>Hongfei</b><p>ID:<b>1111</b>
• These SkipTo( ) functions represent a finite
                                                        <P>Address:<br><b>4000 Main St, Vancouver, BC, (604)333-3233
  state machine model                                   </b><br><b>3000 Hastings St, LA, CA, 1-805-486-5675</b></body>
• Extraction rules: get information
                                                        •      Document                     Extraction rule: SkipTo(<br>)&
                                 landmark                                                                  SkipTo(</body>)
                                                        •
                       Si                     Sj

• Iteration rules: handle nested structure              •   Name   ID     List of Address
                                                                                                  Iteration rule: SkipTo(<b>)
                                                                                                        & SkipTo(</b>)
                                 landmark
                                                        •
                                                        •   St city province area_code phone      extraction rule: either
                                    Si                                                             SkipTo( ( ) or SkipTo( 1- )
                                                        •




                   STALKER                                              Remaining Questions

• Use a sequential covering algorithm                   • Find more expressive model to express
• STALKER(examples)                                       document structure
  Set setRule be empty
  While there are more examples                         • Select only the informative examples to
       Get a disjunct D by learning examples              learn a wrapper.(active learning? Data
       Remove all examples covered D                      mining?)
       Add D into setRule
  Return setRule
                                                        • How to generate label pages automatically
• STALKER can handle 90% and more efficient.              instead of hand-markup?
• Generate imperfect rules




                 HTML DOM Tree                                      Other Related Works
• Using a DOM-like tree model on HTML tags              • TrIAs---html tree
                    HTML                                • SOFTMEALY---first use disjunction rule and
         Head                        Body                 finite state machine model
                                                        • WISK---works for web page and free text, more
         Title              LI           LI        LI
                                                          expressive than WIEN, decision-making is based
• The navigation methods are similar to XML               on limited context. Slower.
  DOM tree. Only works for web pages.
                                                        • SRV
• Using the tree path to extract information
                                                        • CRYSTAL
• Also can follow the document flow like
  STALKER to extract information                        • RAPIER
• Get rid of imperfect rules and more efficient




                                                                                                                                 3
References
•   Nicholas Kushmerick, Wrapper Induction: Efficiency and
    expressiveness, Artificial Intelligence 118, 2000
•   Ion Muslea, Steven Minton, Craig A. Knoblock, A Hierarchical
    Approach to Wrapper Induction, Conference Autonomous Agents,
    Seattle, WA, 1999
•   S. Soderland, Learning information extraction rules for semi-
    structured and free text, Machine Learning 34, 1999
•   C. Hsu, M. Dung, Generating finite-state transducers for
    semistructured data extraction from the web, Information Systems
    23, 1998
•   M. Bauer, D.Dengler, TrIAs—An architecture for trainable
    information assistants, Worksshop on AI and Information Integration,
    Madison, WI, 1998
•   D. Freitag, Information extraction from HTML: Application of a
    general machine learning approach, AIII-98, Madison, WI, 1998




                                                                           4

Wrapper induction construct wrappers automatically to extract information from web sources

  • 1.
    Wrapper Induction: Construct Outline: wrappers automatically to extract information from web sources • What is wrapper • Wrapper Induction • WIEN Hongfei Qu • STALKER Computing Science Department • Remaining Questions Simon Fraser University • HTML DOM Tree • Other Related Works CMPT 882 Presentation • References March 28, 2001 What is wrapper What is wrapper • Wrapper is a procedure to extract all kinds of data • execLR(wrapper(<B>, </B>, <I>, </I>), page P): from a specific web source m=0 • First find a vector of strings to delimit the extracted while there are more occurrences in P of <B> text • <HTML><TITLE>Country Codes</TITLE> m=m+1 <BODY><B>Congo</B> <I>242</I><BR> for each (lk, rk) in {(<B>, </B>), (<I>, </I>)} <B>Spain</B> <I>34</I><BR> scan in P to the next occurrence of lk in P; <HR><B>END</B></BODY></HTML> save position as bm,k • To extract pair (country, codes), we find a vector of scan in P to the next occurrence of rk in P; strings (<B>, </B>, <I>, </I>) to distinguish left & right of extracted text. save position as e m,k Return label{…(bm,1, e m,1), (bm,2, e m,2)…} Wrapper Induction Wrapper Induction • Motivations: hand-coded wrapper is • Actually we are trying to learn a vector of tedious and error-prone. How about web delimiters, which is used to instantiate some pages get changed? wrapper classes (templates), which describe • Wrapper induction –- automatically the document structure generate wrapper --- is a typical • Free text & Web pages machine learning technology. • A good wrapper induction system should be: • Input: a set E of example pages Pn and – Expressiveness: concern how the wrapper handles a particular web site the corresponding label pages Ln – Efficiency: how many samples are needed? How • Output: a wrapper w such that w(Pn) = much computational is required? Ln 1
  • 2.
    WIEN WIEN • First wrapper induction system implemented • Procedure learnLR(examples E) by U. Washington. Works for both Web page for each 1<= k <=K and free text. for each u in Candl(k, E): if u is valid for the kth • WIEN defines 6 wrapper classes (templates) to attribute in E, then lk = u and terminate the loop express the structures of web sites. for each 1<= k <=K • The simplest and powerful one is LR (left- for each u in Candr(k, E): if u is valid for the kth right) wrapper class. It uses left- and right- attribute in E, then lr = u and terminate the loop hand delimiter to extract the relevant return LR wrapper(l1, r1 , …, lk, rk) information • Procedure Candl(k, E) returns candidates for lk by • To extract tuples with K attributes from a set enumerating the suffixes of the shortest string occurring of examples E, the learning algorithm is: to the left of each attribute k instances WIEN WIEN • Procedure Cand r(k, E) returns candidates for lr by • Which wrapper class do we choose for a web site? enumerating the prefixes of the shortest string • How many examples are required? PAC model occurring to the right of each attribute k instances; N: number of examples; • Each wrapper class has a set of validating constraints e: accuracy parameter. 0 < e < 1 • Other wrapper classes: a: confidence parameter. 0 < a < 1 – HLRT: add head delimiter h & tail delimiter t For a learning wrapper W, if we want error(W) < e with probability at least a, the PAC model for the LR – OCLR: using open and close delimiers to indicate class is: the beginning and end of each tuple N >= 1/(1-a) * (2K*ln( R ) - ln(1 - a ) ), where R is the – HOCLRT: combination of HLRT and OCLR length of the shortest example. – N-LR and N-HLRT: handle nested structure • A way to terminate the learning precedure • Combination of 6 classes can handle 70% web sites • A loose bound compared with test results STALKER STALKER • A wrapper induction project by U. Southern • Landmarks: a sequence of tokens, argument California. Only works for Web page. of some functions. • More expressive and efficient than WIEN. SkipTo(<b>): start from beginning, skip • Treat a web page as a tree-like structure and everything until find <b> landmarks handle information extraction hierarchically SkipTo(<b>)SkipTo(<I>) • Use disjunctions to deal with the variations. • These functions represent the rules to extract Disjunctive rules are ordered lists of the information individual disjuncts. The wrapper will • Start rule: identify the beginning of an successively apply each disjunct in the list attribute until it finds one that matches • End rule: identify the end of an attribute 2
  • 3.
    STALKER STALKER <body><p>Name:<b>Hongfei</b><p>ID:<b>1111</b> • These SkipTo( ) functions represent a finite <P>Address:<br><b>4000 Main St, Vancouver, BC, (604)333-3233 state machine model </b><br><b>3000 Hastings St, LA, CA, 1-805-486-5675</b></body> • Extraction rules: get information • Document Extraction rule: SkipTo(<br>)& landmark SkipTo(</body>) • Si Sj • Iteration rules: handle nested structure • Name ID List of Address Iteration rule: SkipTo(<b>) & SkipTo(</b>) landmark • • St city province area_code phone extraction rule: either Si SkipTo( ( ) or SkipTo( 1- ) • STALKER Remaining Questions • Use a sequential covering algorithm • Find more expressive model to express • STALKER(examples) document structure Set setRule be empty While there are more examples • Select only the informative examples to Get a disjunct D by learning examples learn a wrapper.(active learning? Data Remove all examples covered D mining?) Add D into setRule Return setRule • How to generate label pages automatically • STALKER can handle 90% and more efficient. instead of hand-markup? • Generate imperfect rules HTML DOM Tree Other Related Works • Using a DOM-like tree model on HTML tags • TrIAs---html tree HTML • SOFTMEALY---first use disjunction rule and Head Body finite state machine model • WISK---works for web page and free text, more Title LI LI LI expressive than WIEN, decision-making is based • The navigation methods are similar to XML on limited context. Slower. DOM tree. Only works for web pages. • SRV • Using the tree path to extract information • CRYSTAL • Also can follow the document flow like STALKER to extract information • RAPIER • Get rid of imperfect rules and more efficient 3
  • 4.
    References • Nicholas Kushmerick, Wrapper Induction: Efficiency and expressiveness, Artificial Intelligence 118, 2000 • Ion Muslea, Steven Minton, Craig A. Knoblock, A Hierarchical Approach to Wrapper Induction, Conference Autonomous Agents, Seattle, WA, 1999 • S. Soderland, Learning information extraction rules for semi- structured and free text, Machine Learning 34, 1999 • C. Hsu, M. Dung, Generating finite-state transducers for semistructured data extraction from the web, Information Systems 23, 1998 • M. Bauer, D.Dengler, TrIAs—An architecture for trainable information assistants, Worksshop on AI and Information Integration, Madison, WI, 1998 • D. Freitag, Information extraction from HTML: Application of a general machine learning approach, AIII-98, Madison, WI, 1998 4