Tutorial: Text Analytics for Security 
William Enck 
North Carolina State University 
http://www.enck.org 
enck@cs.ncsu.edu 
Tao Xie 
University of Illinois at Urbana-Champaign 
http://web.engr.illinois.edu/~taoxie/ 
taoxie@illinois.edu
What is Computer Security? 
“A computer is secure if you can depend on it and its software to behave as you expect.”
User Expectations 
•User expectations are a form of context. 
•Other forms of context for security decisions 
–Temporal context (e.g., time of day) 
–Environmental context (e.g., location) 
–Execution context 
•OS level (e.g., UID, arguments) 
•Program analysis level (e.g., control flow, data flow)
Defining User Expectations 
•User expectations are difficult to formally (and even informally) define. 
–Based on an individual’s perception the results from past experiences and education 
–... so, we can’t be perfect 
•Starting place: look at the user interface
Why Text Analytics? 
•User interface consists of graphics and text 
–End users: includes finding, installing, and running the software (e.g., first run vs. subsequent) 
–Developers: includes API documentation, comments in code, and requirements documents 
•Goal: process natural language textual sources to aid security decisions
Outline 
•Introduction 
•Background on text analytics 
•Case Study 1: App Markets 
•Case Study 2: ACP Rules 
•Wrap-up
Challenges in Analyzing NL Data 
•Unstructured 
–Hard to parse, sometimes wrong grammar 
•Ambiguous: often has no defined or precise semantics (as opposed to source code) 
–Hard to understand 
•Many ways to represent similar concepts 
–Hard to extract information from 
/* We need toacquirethe write IRQ lock beforecalling ep_unlink(). */ 
/* Lock must beacquired on entryto this function. */ 
/* Callermust holdinstance lock!*/
Why Analyzing NL Data is Easy(?) 
•Redundant data 
•Easy to get “good” results for simple tasks 
–Simple algorithms without much tuning effort 
•Evolution/version history readily available 
•Many techniques to borrow from text analytics: NLP, Machine Learning (ML), Information Retrieval (IR), etc.
Text Analytics 
Data Analysis 
Computational 
Linguistics 
Search & DB 
Knowledge Rep. & 
Reasoning / Tagging 
©M. Grobelnik, D. Mladenic
Why Analyzing NL Data is Hard(?) 
•Domain specific words/phrases, and meanings 
–“Calla function” vs. calla friend 
–“Computer memory” vs. human memory 
–“This method also returns falseif path is null” 
•Poor quality of text 
–Inconsistent 
–grammar mistakes 
•“trueif path is an absolute path; otherwise false” for the File class in .NET framework 
–Incomplete information
Some Major NLP/Text Analytics Tools 
Text Miner 
Stanford Parser 
http://nlp.stanford.edu/links/statnlp.html 
http://www.kdnuggets.com/software/text.html 
http://uima.apache.org/ 
http://nlp.stanford.edu/software/lex-parser.shtml 
Text Analytics for Surveys
Dimensions in Text Analytics 
•Three major dimensions of text analytics: 
–Representations 
•…from words to partial/full parsing 
–Techniques 
•…from manual work to learning 
–Tasks 
•…from search, over (un-)supervised learning, summarization, … 
©M. Grobelnik, D. Mladenic
Major Text Representations 
•Words (stop words, stemming) 
•Part-of-speech tags 
•Chunk parsing (chunking) 
•Semantic role labeling 
•Vector space model 
©M. Grobelnik, D. Mladenic
Words’ Properties 
•Relations among word surface forms and their senses: 
–Homonymy: same form, but different meaning (e.g. bank: river bank, financial institution) 
–Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) 
–Synonymy: different form, same meaning (e.g. singer, vocalist) 
–Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal) 
•General thesaurus: WordNet, existing in many other languages (e.g. EuroWordNet) 
–http://wordnet.princeton.edu/ 
–http://www.illc.uva.nl/EuroWordNet/ 
©M. Grobelnik, D. Mladenic
Stop Words 
•Stop words are words that from non-linguistic view do not carry information 
–…they have mainly functional role 
–…usually we remove them to help mining techniques to perform better 
•Stop words are language dependent – examples: 
–English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ... 
©M. Grobelnik, D. Mladenic
Stemming 
•Different forms of the same word are usually problematic for text analysis, because they have different spelling and similar meaning(e.g. learns, learned, learning,…) 
•Stemmingis a process of transforming a word into its stem (normalized form) 
–…stemming provides an inexpensive mechanism to merge 
©M. Grobelnik, D. Mladenic
Stemming cont. 
•For English is mostly used Porter stemmerat http://www.tartarus.org/~martin/PorterStemmer/ 
•Example cascade rules used in English Porter stemmer 
–ATIONAL -> ATE relational-> relate 
–TIONAL -> TION conditional->condition 
–ENCI -> ENCEvalenci->valence 
–ANCI -> ANCE hesitanci->hesitance 
–IZER -> IZE digitizer-> digitize 
–ABLI -> ABLE conformabli -> conformable 
–ALLI -> AL radicalli->radical 
–ENTLI -> ENT differentli->different 
–ELI -> E vileli->vile 
–OUSLI -> OUS analogousli ->analogous 
©M. Grobelnik, D. Mladenic
Part-of-Speech Tags 
•Part-of-speech tags specify word types enabling to differentiate words functions 
–For text analysis, part-of-speech tag is used mainly for “information extraction” where we are interested in e.g., named entities (“noun phrases”) 
–Another possible use is reduction of the vocabulary (features) 
•…it is known that nouns carry most of the information in text documents 
•Part-of-Speech taggers are usually learned on manually tagged data 
©M. Grobelnik, D. Mladenic
Part-of-Speech Tablehttp://www.englishclub.com/grammar/parts-of-speech_1.htm 
©M. Grobelnik, D. Mladenic 
http://www.clips.ua.ac.be/pages/mbsp-tags
Part-of-Speech Exampleshttp://www.englishclub.com/grammar/parts-of-speech_2.htm 
©M. Grobelnik, D. Mladenic
Partof Speech Tags 
http://www2.sis.pitt.edu/~is2420/class-notes/2.pdf
Full Parsing 
•Parsing provides maximum structural information per sentence 
•Input: a sentence output: a parse tree 
•For most text analysis techniques, the information in parse trees is too complex 
•Problems with full parsing: 
–Low accuracy 
–Slow 
–Domain Specific 
©M. Grobelnik, D. Mladenic
Chunk Parsing 
•Break text up into non-overlapping contiguous subsets of tokens. 
–aka. partial/shallow parsing, light parsing. 
•What is it useful for? 
–Entity recognition 
•people, locations, organizations 
–Studying linguistic patterns 
•gave NP 
•gave up NP in NP 
•gave NP NP 
•gave NP to NP 
–Can ignore complex structure when not relevant 
©M. Hearst
Chunk Parsing 
Goal: divide a sentence into a sequence of chunks. 
•Chunks are non-overlapping regions of a text 
[I] saw [a tall man] in [the park] 
•Chunks are non-recursive 
–A chunk cannot contain other chunks 
•Chunks are non-exhaustive 
–Not all words are included in the chunks 
©S. Bird
Chunk Parsing Techniques 
•Chunk parsers usually ignore lexical content 
•Only need to look at part-of-speech tags 
•Techniques for implementing chunk parsing 
–E.g., Regular expression matching 
©S. Bird
Regular Expression Matching 
•Define a regular expression that matches the sequences of tags in a chunk 
–A simple noun phrase chunk regrexp: 
<DT> ? <JJ> * <NN.?> 
•Chunk all matching subsequences: 
The /DT little /JJ cat /NN sat /VBD on /IN the /DT mat /NN 
[The/DT little/JJ cat/NN] sat /VBD on /IN [the/DT mat/NN] 
•If matching subsequences overlap, the first one gets priority 
©S. Bird 
DT: DeterminnerJJ: Adjective NN: Noun, sing, or mass 
VBD: Verb, past tense IN: Prepostion/sub-conj Verb
Semantic Role Labeling Giving Semantic Labels to Phrases 
•[AGENT John]broke[THEMEthe window] 
•[THEMEThe window]broke 
•[AGENTSotheby’s].. offered[RECIPIENTthe Dorranceheirs] [THEMEa money-back guarantee] 
•[AGENT Sotheby’s]offered[THEME a money-back guarantee]to[RECIPIENTthe Dorranceheirs] 
•[THEMEa money-back guarantee]offeredby [AGENT Sotheby’s] 
•[RECIPIENTthe Dorranceheirs]will [ARM-NEG not] 
be offered[THEMEa money-back guarantee] 
©S.W. Yih&K. Toutanova
Semantic Role Labeling Good for Question Answering 
Q: What was the name of the first computer system that defeated Kasparov? 
A: [PATIENT Kasparov] was defeated by[AGENTDeep Blue] [TIME in 1997]. 
Q: When was Napoleon defeated? 
Look for: [PATIENTNapoleon][PREDdefeat-synset] [ARGM-TMP*ANS*] 
More generally: 
©S.W. Yih&K. Toutanova
Typical Semantic Roles 
©S.W. Yih&K. Toutanova
Example Semantic Roles 
©S.W. Yih&K. Toutanova
Outline 
•Introduction 
•Background on text analytics 
•Case Study 1: App Markets 
•Case Study 2: ACP Rules 
•Wrap-up
Case Study: App Markets 
•App Markets have played an important role in the popularity of mobile devices 
•Provide users with a textual description of each application’s functionality 
Apple App Store 
Google Play 
Microsoft Windows Phone
Current Practice 
•Apple: market’sresponsibility 
–Apple performs manual inspection 
•Google: user’sresponsibility 
–Users approve permissions for security/privacy 
–Bouncer (static/dynamic malware analysis) 
•Windows Phone: hybrid 
–Permissions / manual inspection
Is Program Analysis Sufficient? 
•Previous approaches look at permissions, code, and runtime behaviors 
•Caveat: what does the user expect? 
–GPS Tracker: record and send location 
–Phone-call Recorder: record audio during call 
–One-Click Root: exploit vulnerability 
–Others are more subtle
Vision 
•Goal: bridge gap between user expectation and app behavior 
•WHYPER is a first step in this direction 
•Focus on permission and app descriptions 
–Limited to permissions that protect “user understandable” resources
Use Cases 
•Enhance user experience while installing apps 
•Functionality disclosure to during application submission to market 
•Complementing program analysis to ensure more appropriate justifications 
Application Market 
WHYPER 
DEVELOPERS 
USERS
Straw man: Keyword Search 
•Confounding effects: 
–Certain keywords such as “contact” have a confounding meaning, e.g., “... displays user contacts, ...”vs“... contact me at abc@xyz.com” 
•Semantic Interference: 
–Sentences often describe a sensitive operation such as reading contacts without actually referring to the keyword “contact”, e.g., “share yoga exercises with your friends via email, sms”
WHYPER Framework 
APP Description 
APP Permission 
Semantic 
Graphs 
Preprocessor 
Intermediate 
Representation 
Generator 
Semantic 
Engine 
NLP Parser 
Semantic Graph 
Generator 
API Docs 
Annotated 
Description 
FOL 
Representation 
WHYPER
Preprocessor 
•Period Handling 
–Decimals, ellipsis, shorthand notations (Mr., Dr.) 
•Sentence Boundaries 
–Tabs, bullet points, delimiters (:) 
–Symbols (*,-) and enumeration sentence 
•Named Entity Handling 
–E.g., “Pandora internet radio” 
•Abbreviation Handling 
–E.g., “Instant Message (IM)”
Intermediate Representation Generator 
Also 
you 
can 
share 
yoga 
exercise 
to 
your 
friends 
via 
Email 
and 
SMS 
Also 
you 
can 
share 
exercise 
your 
friends 
Email 
SMS 
VB 
RB 
PRP 
MD 
NN 
DT 
NN 
NNS 
PRP 
NNP 
NNP 
the 
yoga 
advmod 
nsubj 
aux 
dobj 
det 
nn 
prep_to 
poss 
prep_via 
conj_and 
the 
share 
to 
you 
yoga exercise 
owned 
you 
via 
friends 
and 
email 
SMS 
RB: adverb; PRP: pronoun; MD: verb, modal auxillary; VB: verb, base form; DT: determiner; NN: noun, singular or mass; NNS: noun, plural; NNP: noun, proper singular 
http://www.clips.ua.ac.be/pages/mbsp-tags
Semantic-Graph Generator
Semantic-Graph Generator 
•Systematic approach to infer graphs 
–Find related API documents using Pscout[CCS’12] 
–Identify resource associated with permissions from the API class name 
•ContactsContract.Contacts 
–Inspect the member variables and member methods to identify actions and subordinate resources 
•ContactsContract.CommonDataKinds.Email
Semantic Engine 
share 
to 
you 
yoga exercise 
owned 
you 
via 
friends 
and 
email 
SMS 
WordNetSimilarity 
“Also you can share the yoga exercise to your friends via Email and SMS.”
Evaluation 
•Subjects 
–Permissions: READ_CONTACTS, READ_CALENDAR, RECORD_AUDIO 
–581/600* application descriptions (English only) 
–9,953 sentences 
•Research Questions 
–RQ1: What are the precision, recall, and F-Score of WHYPER in identifying permission sentences? 
–RQ2: How effective is WHYPER in identifying permission sentences, compared to keyword-based searching
Subject Statistics 
Permissions 
#N 
#S 
Sp 
READ_CONTACTS 
190 
3,379 
235 
READ_CALENDAR 
191 
2,752 
283 
RECORD_AUDIO 
200 
3,822 
245 
TOTAL 
581 
9,953 
763
RQ1 Results: Effectiveness 
•Out of 9,061 sentences, only 129 flagged as FPs 
•Among 581 apps, 109 apps (18.8%) contain at least one FP 
•Among 581 apps, 86 apps (14.8%) contain at least one FN 
Permission 
SI 
TP 
FP 
FN 
TN 
Prec. 
Recall 
F-Score 
Acc 
READ_CONTACTS 
204 
186 
18 
49 
2,930 
91.2 
79.2 
84.8 
97.9 
READ_CALENDAR 
288 
241 
47 
42 
2,422 
83.7 
85.2 
84.5 
96.8 
RECORD_AUDIO 
259 
195 
64 
50 
3,470 
75.3 
79.6 
77.4 
97.0 
TOTAL 
751 
622 
129 
141 
9,061 
82.8 
81.5 
82.2 
97.3
R2 Results: Comparison to Keyword- based search 
Permission 
Delta 
Precision 
Delta 
Recall 
Delta 
F-score 
Delta 
Accuracy 
READ_CONTACTS 
50.4 
1.3 
31.2 
7.3 
READ_CALENDAR 
39.3 
1.5 
26.4 
9.2 
RECORD_AUDIO 
36.9 
-6.6 
24.3 
6.8 
WHYPER Improvement 
41.6 
-1.2 
27.2 
7.7 
Permission 
Keywords 
READ_CONTACTS 
contact, data, number, name, email 
READ_CALENDAR 
calendar, event, date, month, day, year 
RECORD_AUDIO 
record, audio, voice, capture, microphone
Results Analysis: False Positives 
•Incorrect Parsing 
–“MyLinkAdvanced provides full synchronization of all Microsoft Outlook emails (inbox, sent, outbox and drafts), contacts, calendar, tasks and notes with all Android phones via USB” 
•Synonym Analysis 
–“You can now turnrecordings into ringtones.”
Results Analysis: False Negatives 
•Incorrect parsing 
–Incorrect identification of sentence boundaries and limitations of underlying NLP infrastructure 
•Limitations of Semantic Graphs 
–Manual Augmentation 
•Microphone (blow into) and call (record) 
•Significant improvement of delta recalls: -6.6% to 0.6% 
–Future: automatic mining from user comments and forums
Broader Applicability 
•Generalization to other permissions 
–User-understandable permissions: calls, SMS 
–Problem areas 
•Location and phone identifiers (widely abused) 
•Internet (nearly every app requires)
Dataset and Paper 
•Our code and datasets are available athttps://sites.google.com/site/whypermission/ 
•Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. WHYPER: Towards Automating Risk Assessment of Mobile Applications. In Proc. 22nd USENIX Security Symposium (USENIX Security 2013) http://www.enck.org/pubs/pandita-sec13.pdf
Outline 
•Introduction 
•Background on text analytics 
•Case Study 1: App Markets 
•Case Study 2: ACP Rules 
•Wrap-up
Access Control Policies (ACP) 
•Access control is often governed by security policies called Access Control Policies (ACP) 
–Includes rules to control which principalshave accessto which resources 
•A policy rule includes four elements 
–Subject –HCP 
–Action –edit 
–Resource -patient's account 
–Effect -deny 
“The Health Care Personnel (HCP)does not have the abilityto editthe patient's account.” 
ex.
Access Control Vulnerabilities 
54 
2010 Report 
1.Cross-site scripting 
2.SQL injection 
3.Classic buffer overflow 
4.Cross-site request forgery 
5.Improper access control (Authorization) 
6.... 
Improper access control causes problems (e.g., information exposures) 
•Incorrect specification 
•Incorrect enforcement
Problems of ACP Practice 
•In practice, ACPs 
–Buried in requirement documents 
–Written in NL and not checkable 
•NL documents could be large in size 
–Manual extraction is labor-intensive and tedious
Overview of Text2Policy 
A HCP should not change patient’s account. 
An [subject: HCP] should not [action: change] [resource: patient’s account]. 
ACP Rule 
Effect 
Subject 
Action 
Resource 
HCP 
UPDATE - change 
patient’s account 
deny 
Linguistic Analysis 
Model-Instance Construction 
Transformation
Linguistic Analysis 
•Incorporate syntactic and semantic analysis 
–syntacticstructure -> noun group, verb group, etc. 
–semanticmeaning -> subject, action, resource, negative meaning, etc. 
•Provide New techniques for model extraction 
–Identify ACP sentences 
–Infer semantic meaning
Common Techniques 
•Shallow parsing 
•Domain dictionary 
•Anaphora resolution 
An HCP can view patient’s account. 
He is disallowed to change the patient’s account. 
Subject 
Main Verb Group 
Object 
NP 
PNP 
UPDATE 
HCP 
VG 
NP: noun phrase 
VG: verb chunk 
PNP: prepositional noun phrase 
http://www.clips.ua.ac.be/pages/mbsp-tags
Technical Challenges (TC) in ACP Extraction 
•TC1: Semantic Structure Variance 
–different ways to specify the same rule 
•TC2: Negative Meaning Implicitness 
–verb could have negative meaning 
ACP 1: An HCP cannot change patient’s account. 
ACP2: An HCP is disallowed to change patient’s account.
Semantic-Pattern Matching 
•Address TC1 Semantic Structure Variance 
•Compose pattern based on grammatical function 
An HCP is disallowed to change the patient’s account. 
ex. 
passive voice 
to-infinitive phrase 
followed by
Negative-Expression Identification 
•Address TC2 Negative Meaning Implicitness 
•Negative expression 
–“not” in subject: 
–“not” in verb group: 
•Negative meaning words in main verb group 
NoHCP can edit patient’s account. 
ex. 
HCP can notedit patient’s account. 
HCP can neveredit patient’s account. 
ex. 
ex. 
An HCP is disallowedto change the patient’s account.
Overview of Text2Policy 
A HCP should not change patient’s account. 
An [subject: HCP] should not [action: change] [resource: patient’s account]. 
ACP Rule 
Effect 
Subject 
Action 
Resource 
HCP 
UPDATE - change 
patient’s account 
deny 
Linguistic Analysis 
Model-Instance Construction 
Transformation
ACP Model-Instance Construction 
•Identify subject, action, and resource: 
–Subject: HCP 
–Action: change 
–Resource: patient’s account 
•Infer effect: 
–Negative Expression: none 
–Negative Verb: disallow 
–Inferred Effect: deny 
•Access Control Rule Extraction (ACRE) approach [ACSAC’14] discovers more patterns 
–Able to handle existing, unconstrained NL texts 
An HCPis disallowed to change thepatient’s account. 
ex. 
ACP Rule 
Effect 
Subject 
Action 
Resource 
HCP 
UPDATE - change 
patient’s account 
deny
Evaluation –RQs 
•RQ1: How effectively does Text2Policy identify ACP sentences in NL documents? 
•RQ2: How effectively does Text2Policy extract ACP rules from ACP sentences?
Evaluation –Subject 
•iTrustopen source project 
–http://agile.csc.ncsu.edu/iTrust/wiki/ 
–448 use-case sentences (37 use cases) 
–preprocessed use cases 
•Collected ACP sentences 
–100 ACP sentences 
–From 17 sources (published papers and websites) 
•A module of an IBMApp(financial domain) 
–25 use cases
RQ1 ACP Sentence Identification 
•Apply Text2Policy to identify ACP sentences in iTrustuse cases and IBMAppuse cases 
•Text2Policy effectively identifies ACP sentences with precision and recall more than 88% 
•Precision on IBMAppuse cases is better 
–proprietary use cases are often of higher quality compared to open-source use cases
Evaluation – RQ2 Accuracy of Policy Extraction 
•Apply Text2Policy to extract ACP rules from ACP sentences 
•Text2Policy effectively extracts ACP model instances with accuracy above 86%
Dataset and Paper 
•Our datasets are available athttps://sites.google.com/site/asergrp/projects/text2policy 
•Xusheng Xiao, Amit Paradkar, Suresh Thummalapenta, and Tao Xie. Automated Extraction of Security Policies from Natural-Language Software Documents. In Proc. 20th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE 2012) http://web.engr.illinois.edu/~taoxie/publications/fse12-nlp.pdf 
•John Slankas, Xusheng Xiao, Laurie Williams, and Tao Xie. Relation Extraction for Inferring Access Control Rules from Natural Language Artifacts. In Proc. 30th Annual Computer Security Applications Conference (ACSAC 2014) http://web.engr.illinois.edu/~taoxie/publications/acsac14-nlp.pdf
Outline 
•Introduction 
•Background on text analytics 
•Case Study 1: App Markets 
•Case Study 2: ACP rules 
•Wrap-up
Take-away 
•Computing systems contain textual data that partially represents expectation context. 
•Text analytics and natural language processing offers an opportunity to automatically extract that semantic context 
–Need to be careful in the security domain (e.g., social engineering) 
–But potential for improved security decisions
Future Directions 
•Only beginning to study text analytics for security 
–Many sources of natural language text 
–Many unexplored domains 
–Use text analytics in software engineering as inspiration 
•https://sites.google.com/site/text4se/ 
•Hard problem: to what extent can we formalize “expectation context”? 
•Creation of open datasets (annotation is time intensive) 
•Apply to real-world problems
Thank you! 
William Enck 
North Carolina State University 
http://www.enck.org 
enck@cs.ncsu.edu 
Tao Xie 
University of Illinois at Urbana-Champaign 
http://web.engr.illinois.edu/~taoxie/ 
taoxie@illinois.edu 
Questions? 
Acknowledgment: We thank authors of the original slides that some slides from this tutorial were adapted from. The work is supported in part by a Google Research Faculty Award, NSA Science of Security Labletgrants, NSF grants CCF-1349666, CCF-1409423, CNS-1434582, CCF-1434596, CCF-1434590, CNS-1439481, CNS-1253346, CNS-1222680

Tutorial: Text Analytics for Security

  • 1.
    Tutorial: Text Analyticsfor Security William Enck North Carolina State University http://www.enck.org enck@cs.ncsu.edu Tao Xie University of Illinois at Urbana-Champaign http://web.engr.illinois.edu/~taoxie/ taoxie@illinois.edu
  • 2.
    What is ComputerSecurity? “A computer is secure if you can depend on it and its software to behave as you expect.”
  • 3.
    User Expectations •Userexpectations are a form of context. •Other forms of context for security decisions –Temporal context (e.g., time of day) –Environmental context (e.g., location) –Execution context •OS level (e.g., UID, arguments) •Program analysis level (e.g., control flow, data flow)
  • 4.
    Defining User Expectations •User expectations are difficult to formally (and even informally) define. –Based on an individual’s perception the results from past experiences and education –... so, we can’t be perfect •Starting place: look at the user interface
  • 5.
    Why Text Analytics? •User interface consists of graphics and text –End users: includes finding, installing, and running the software (e.g., first run vs. subsequent) –Developers: includes API documentation, comments in code, and requirements documents •Goal: process natural language textual sources to aid security decisions
  • 6.
    Outline •Introduction •Backgroundon text analytics •Case Study 1: App Markets •Case Study 2: ACP Rules •Wrap-up
  • 7.
    Challenges in AnalyzingNL Data •Unstructured –Hard to parse, sometimes wrong grammar •Ambiguous: often has no defined or precise semantics (as opposed to source code) –Hard to understand •Many ways to represent similar concepts –Hard to extract information from /* We need toacquirethe write IRQ lock beforecalling ep_unlink(). */ /* Lock must beacquired on entryto this function. */ /* Callermust holdinstance lock!*/
  • 8.
    Why Analyzing NLData is Easy(?) •Redundant data •Easy to get “good” results for simple tasks –Simple algorithms without much tuning effort •Evolution/version history readily available •Many techniques to borrow from text analytics: NLP, Machine Learning (ML), Information Retrieval (IR), etc.
  • 9.
    Text Analytics DataAnalysis Computational Linguistics Search & DB Knowledge Rep. & Reasoning / Tagging ©M. Grobelnik, D. Mladenic
  • 10.
    Why Analyzing NLData is Hard(?) •Domain specific words/phrases, and meanings –“Calla function” vs. calla friend –“Computer memory” vs. human memory –“This method also returns falseif path is null” •Poor quality of text –Inconsistent –grammar mistakes •“trueif path is an absolute path; otherwise false” for the File class in .NET framework –Incomplete information
  • 11.
    Some Major NLP/TextAnalytics Tools Text Miner Stanford Parser http://nlp.stanford.edu/links/statnlp.html http://www.kdnuggets.com/software/text.html http://uima.apache.org/ http://nlp.stanford.edu/software/lex-parser.shtml Text Analytics for Surveys
  • 12.
    Dimensions in TextAnalytics •Three major dimensions of text analytics: –Representations •…from words to partial/full parsing –Techniques •…from manual work to learning –Tasks •…from search, over (un-)supervised learning, summarization, … ©M. Grobelnik, D. Mladenic
  • 13.
    Major Text Representations •Words (stop words, stemming) •Part-of-speech tags •Chunk parsing (chunking) •Semantic role labeling •Vector space model ©M. Grobelnik, D. Mladenic
  • 14.
    Words’ Properties •Relationsamong word surface forms and their senses: –Homonymy: same form, but different meaning (e.g. bank: river bank, financial institution) –Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution) –Synonymy: different form, same meaning (e.g. singer, vocalist) –Hyponymy: one word denotes a subclass of an another (e.g. breakfast, meal) •General thesaurus: WordNet, existing in many other languages (e.g. EuroWordNet) –http://wordnet.princeton.edu/ –http://www.illc.uva.nl/EuroWordNet/ ©M. Grobelnik, D. Mladenic
  • 15.
    Stop Words •Stopwords are words that from non-linguistic view do not carry information –…they have mainly functional role –…usually we remove them to help mining techniques to perform better •Stop words are language dependent – examples: –English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN, AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ... ©M. Grobelnik, D. Mladenic
  • 16.
    Stemming •Different formsof the same word are usually problematic for text analysis, because they have different spelling and similar meaning(e.g. learns, learned, learning,…) •Stemmingis a process of transforming a word into its stem (normalized form) –…stemming provides an inexpensive mechanism to merge ©M. Grobelnik, D. Mladenic
  • 17.
    Stemming cont. •ForEnglish is mostly used Porter stemmerat http://www.tartarus.org/~martin/PorterStemmer/ •Example cascade rules used in English Porter stemmer –ATIONAL -> ATE relational-> relate –TIONAL -> TION conditional->condition –ENCI -> ENCEvalenci->valence –ANCI -> ANCE hesitanci->hesitance –IZER -> IZE digitizer-> digitize –ABLI -> ABLE conformabli -> conformable –ALLI -> AL radicalli->radical –ENTLI -> ENT differentli->different –ELI -> E vileli->vile –OUSLI -> OUS analogousli ->analogous ©M. Grobelnik, D. Mladenic
  • 18.
    Part-of-Speech Tags •Part-of-speechtags specify word types enabling to differentiate words functions –For text analysis, part-of-speech tag is used mainly for “information extraction” where we are interested in e.g., named entities (“noun phrases”) –Another possible use is reduction of the vocabulary (features) •…it is known that nouns carry most of the information in text documents •Part-of-Speech taggers are usually learned on manually tagged data ©M. Grobelnik, D. Mladenic
  • 19.
    Part-of-Speech Tablehttp://www.englishclub.com/grammar/parts-of-speech_1.htm ©M.Grobelnik, D. Mladenic http://www.clips.ua.ac.be/pages/mbsp-tags
  • 20.
  • 21.
    Partof Speech Tags http://www2.sis.pitt.edu/~is2420/class-notes/2.pdf
  • 22.
    Full Parsing •Parsingprovides maximum structural information per sentence •Input: a sentence output: a parse tree •For most text analysis techniques, the information in parse trees is too complex •Problems with full parsing: –Low accuracy –Slow –Domain Specific ©M. Grobelnik, D. Mladenic
  • 23.
    Chunk Parsing •Breaktext up into non-overlapping contiguous subsets of tokens. –aka. partial/shallow parsing, light parsing. •What is it useful for? –Entity recognition •people, locations, organizations –Studying linguistic patterns •gave NP •gave up NP in NP •gave NP NP •gave NP to NP –Can ignore complex structure when not relevant ©M. Hearst
  • 24.
    Chunk Parsing Goal:divide a sentence into a sequence of chunks. •Chunks are non-overlapping regions of a text [I] saw [a tall man] in [the park] •Chunks are non-recursive –A chunk cannot contain other chunks •Chunks are non-exhaustive –Not all words are included in the chunks ©S. Bird
  • 25.
    Chunk Parsing Techniques •Chunk parsers usually ignore lexical content •Only need to look at part-of-speech tags •Techniques for implementing chunk parsing –E.g., Regular expression matching ©S. Bird
  • 26.
    Regular Expression Matching •Define a regular expression that matches the sequences of tags in a chunk –A simple noun phrase chunk regrexp: <DT> ? <JJ> * <NN.?> •Chunk all matching subsequences: The /DT little /JJ cat /NN sat /VBD on /IN the /DT mat /NN [The/DT little/JJ cat/NN] sat /VBD on /IN [the/DT mat/NN] •If matching subsequences overlap, the first one gets priority ©S. Bird DT: DeterminnerJJ: Adjective NN: Noun, sing, or mass VBD: Verb, past tense IN: Prepostion/sub-conj Verb
  • 27.
    Semantic Role LabelingGiving Semantic Labels to Phrases •[AGENT John]broke[THEMEthe window] •[THEMEThe window]broke •[AGENTSotheby’s].. offered[RECIPIENTthe Dorranceheirs] [THEMEa money-back guarantee] •[AGENT Sotheby’s]offered[THEME a money-back guarantee]to[RECIPIENTthe Dorranceheirs] •[THEMEa money-back guarantee]offeredby [AGENT Sotheby’s] •[RECIPIENTthe Dorranceheirs]will [ARM-NEG not] be offered[THEMEa money-back guarantee] ©S.W. Yih&K. Toutanova
  • 28.
    Semantic Role LabelingGood for Question Answering Q: What was the name of the first computer system that defeated Kasparov? A: [PATIENT Kasparov] was defeated by[AGENTDeep Blue] [TIME in 1997]. Q: When was Napoleon defeated? Look for: [PATIENTNapoleon][PREDdefeat-synset] [ARGM-TMP*ANS*] More generally: ©S.W. Yih&K. Toutanova
  • 29.
    Typical Semantic Roles ©S.W. Yih&K. Toutanova
  • 30.
    Example Semantic Roles ©S.W. Yih&K. Toutanova
  • 31.
    Outline •Introduction •Backgroundon text analytics •Case Study 1: App Markets •Case Study 2: ACP Rules •Wrap-up
  • 32.
    Case Study: AppMarkets •App Markets have played an important role in the popularity of mobile devices •Provide users with a textual description of each application’s functionality Apple App Store Google Play Microsoft Windows Phone
  • 33.
    Current Practice •Apple:market’sresponsibility –Apple performs manual inspection •Google: user’sresponsibility –Users approve permissions for security/privacy –Bouncer (static/dynamic malware analysis) •Windows Phone: hybrid –Permissions / manual inspection
  • 34.
    Is Program AnalysisSufficient? •Previous approaches look at permissions, code, and runtime behaviors •Caveat: what does the user expect? –GPS Tracker: record and send location –Phone-call Recorder: record audio during call –One-Click Root: exploit vulnerability –Others are more subtle
  • 35.
    Vision •Goal: bridgegap between user expectation and app behavior •WHYPER is a first step in this direction •Focus on permission and app descriptions –Limited to permissions that protect “user understandable” resources
  • 36.
    Use Cases •Enhanceuser experience while installing apps •Functionality disclosure to during application submission to market •Complementing program analysis to ensure more appropriate justifications Application Market WHYPER DEVELOPERS USERS
  • 37.
    Straw man: KeywordSearch •Confounding effects: –Certain keywords such as “contact” have a confounding meaning, e.g., “... displays user contacts, ...”vs“... contact me at abc@xyz.com” •Semantic Interference: –Sentences often describe a sensitive operation such as reading contacts without actually referring to the keyword “contact”, e.g., “share yoga exercises with your friends via email, sms”
  • 38.
    WHYPER Framework APPDescription APP Permission Semantic Graphs Preprocessor Intermediate Representation Generator Semantic Engine NLP Parser Semantic Graph Generator API Docs Annotated Description FOL Representation WHYPER
  • 39.
    Preprocessor •Period Handling –Decimals, ellipsis, shorthand notations (Mr., Dr.) •Sentence Boundaries –Tabs, bullet points, delimiters (:) –Symbols (*,-) and enumeration sentence •Named Entity Handling –E.g., “Pandora internet radio” •Abbreviation Handling –E.g., “Instant Message (IM)”
  • 40.
    Intermediate Representation Generator Also you can share yoga exercise to your friends via Email and SMS Also you can share exercise your friends Email SMS VB RB PRP MD NN DT NN NNS PRP NNP NNP the yoga advmod nsubj aux dobj det nn prep_to poss prep_via conj_and the share to you yoga exercise owned you via friends and email SMS RB: adverb; PRP: pronoun; MD: verb, modal auxillary; VB: verb, base form; DT: determiner; NN: noun, singular or mass; NNS: noun, plural; NNP: noun, proper singular http://www.clips.ua.ac.be/pages/mbsp-tags
  • 41.
  • 42.
    Semantic-Graph Generator •Systematicapproach to infer graphs –Find related API documents using Pscout[CCS’12] –Identify resource associated with permissions from the API class name •ContactsContract.Contacts –Inspect the member variables and member methods to identify actions and subordinate resources •ContactsContract.CommonDataKinds.Email
  • 43.
    Semantic Engine share to you yoga exercise owned you via friends and email SMS WordNetSimilarity “Also you can share the yoga exercise to your friends via Email and SMS.”
  • 44.
    Evaluation •Subjects –Permissions:READ_CONTACTS, READ_CALENDAR, RECORD_AUDIO –581/600* application descriptions (English only) –9,953 sentences •Research Questions –RQ1: What are the precision, recall, and F-Score of WHYPER in identifying permission sentences? –RQ2: How effective is WHYPER in identifying permission sentences, compared to keyword-based searching
  • 45.
    Subject Statistics Permissions #N #S Sp READ_CONTACTS 190 3,379 235 READ_CALENDAR 191 2,752 283 RECORD_AUDIO 200 3,822 245 TOTAL 581 9,953 763
  • 46.
    RQ1 Results: Effectiveness •Out of 9,061 sentences, only 129 flagged as FPs •Among 581 apps, 109 apps (18.8%) contain at least one FP •Among 581 apps, 86 apps (14.8%) contain at least one FN Permission SI TP FP FN TN Prec. Recall F-Score Acc READ_CONTACTS 204 186 18 49 2,930 91.2 79.2 84.8 97.9 READ_CALENDAR 288 241 47 42 2,422 83.7 85.2 84.5 96.8 RECORD_AUDIO 259 195 64 50 3,470 75.3 79.6 77.4 97.0 TOTAL 751 622 129 141 9,061 82.8 81.5 82.2 97.3
  • 47.
    R2 Results: Comparisonto Keyword- based search Permission Delta Precision Delta Recall Delta F-score Delta Accuracy READ_CONTACTS 50.4 1.3 31.2 7.3 READ_CALENDAR 39.3 1.5 26.4 9.2 RECORD_AUDIO 36.9 -6.6 24.3 6.8 WHYPER Improvement 41.6 -1.2 27.2 7.7 Permission Keywords READ_CONTACTS contact, data, number, name, email READ_CALENDAR calendar, event, date, month, day, year RECORD_AUDIO record, audio, voice, capture, microphone
  • 48.
    Results Analysis: FalsePositives •Incorrect Parsing –“MyLinkAdvanced provides full synchronization of all Microsoft Outlook emails (inbox, sent, outbox and drafts), contacts, calendar, tasks and notes with all Android phones via USB” •Synonym Analysis –“You can now turnrecordings into ringtones.”
  • 49.
    Results Analysis: FalseNegatives •Incorrect parsing –Incorrect identification of sentence boundaries and limitations of underlying NLP infrastructure •Limitations of Semantic Graphs –Manual Augmentation •Microphone (blow into) and call (record) •Significant improvement of delta recalls: -6.6% to 0.6% –Future: automatic mining from user comments and forums
  • 50.
    Broader Applicability •Generalizationto other permissions –User-understandable permissions: calls, SMS –Problem areas •Location and phone identifiers (widely abused) •Internet (nearly every app requires)
  • 51.
    Dataset and Paper •Our code and datasets are available athttps://sites.google.com/site/whypermission/ •Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. WHYPER: Towards Automating Risk Assessment of Mobile Applications. In Proc. 22nd USENIX Security Symposium (USENIX Security 2013) http://www.enck.org/pubs/pandita-sec13.pdf
  • 52.
    Outline •Introduction •Backgroundon text analytics •Case Study 1: App Markets •Case Study 2: ACP Rules •Wrap-up
  • 53.
    Access Control Policies(ACP) •Access control is often governed by security policies called Access Control Policies (ACP) –Includes rules to control which principalshave accessto which resources •A policy rule includes four elements –Subject –HCP –Action –edit –Resource -patient's account –Effect -deny “The Health Care Personnel (HCP)does not have the abilityto editthe patient's account.” ex.
  • 54.
    Access Control Vulnerabilities 54 2010 Report 1.Cross-site scripting 2.SQL injection 3.Classic buffer overflow 4.Cross-site request forgery 5.Improper access control (Authorization) 6.... Improper access control causes problems (e.g., information exposures) •Incorrect specification •Incorrect enforcement
  • 55.
    Problems of ACPPractice •In practice, ACPs –Buried in requirement documents –Written in NL and not checkable •NL documents could be large in size –Manual extraction is labor-intensive and tedious
  • 56.
    Overview of Text2Policy A HCP should not change patient’s account. An [subject: HCP] should not [action: change] [resource: patient’s account]. ACP Rule Effect Subject Action Resource HCP UPDATE - change patient’s account deny Linguistic Analysis Model-Instance Construction Transformation
  • 57.
    Linguistic Analysis •Incorporatesyntactic and semantic analysis –syntacticstructure -> noun group, verb group, etc. –semanticmeaning -> subject, action, resource, negative meaning, etc. •Provide New techniques for model extraction –Identify ACP sentences –Infer semantic meaning
  • 58.
    Common Techniques •Shallowparsing •Domain dictionary •Anaphora resolution An HCP can view patient’s account. He is disallowed to change the patient’s account. Subject Main Verb Group Object NP PNP UPDATE HCP VG NP: noun phrase VG: verb chunk PNP: prepositional noun phrase http://www.clips.ua.ac.be/pages/mbsp-tags
  • 59.
    Technical Challenges (TC)in ACP Extraction •TC1: Semantic Structure Variance –different ways to specify the same rule •TC2: Negative Meaning Implicitness –verb could have negative meaning ACP 1: An HCP cannot change patient’s account. ACP2: An HCP is disallowed to change patient’s account.
  • 60.
    Semantic-Pattern Matching •AddressTC1 Semantic Structure Variance •Compose pattern based on grammatical function An HCP is disallowed to change the patient’s account. ex. passive voice to-infinitive phrase followed by
  • 61.
    Negative-Expression Identification •AddressTC2 Negative Meaning Implicitness •Negative expression –“not” in subject: –“not” in verb group: •Negative meaning words in main verb group NoHCP can edit patient’s account. ex. HCP can notedit patient’s account. HCP can neveredit patient’s account. ex. ex. An HCP is disallowedto change the patient’s account.
  • 62.
    Overview of Text2Policy A HCP should not change patient’s account. An [subject: HCP] should not [action: change] [resource: patient’s account]. ACP Rule Effect Subject Action Resource HCP UPDATE - change patient’s account deny Linguistic Analysis Model-Instance Construction Transformation
  • 63.
    ACP Model-Instance Construction •Identify subject, action, and resource: –Subject: HCP –Action: change –Resource: patient’s account •Infer effect: –Negative Expression: none –Negative Verb: disallow –Inferred Effect: deny •Access Control Rule Extraction (ACRE) approach [ACSAC’14] discovers more patterns –Able to handle existing, unconstrained NL texts An HCPis disallowed to change thepatient’s account. ex. ACP Rule Effect Subject Action Resource HCP UPDATE - change patient’s account deny
  • 64.
    Evaluation –RQs •RQ1:How effectively does Text2Policy identify ACP sentences in NL documents? •RQ2: How effectively does Text2Policy extract ACP rules from ACP sentences?
  • 65.
    Evaluation –Subject •iTrustopensource project –http://agile.csc.ncsu.edu/iTrust/wiki/ –448 use-case sentences (37 use cases) –preprocessed use cases •Collected ACP sentences –100 ACP sentences –From 17 sources (published papers and websites) •A module of an IBMApp(financial domain) –25 use cases
  • 66.
    RQ1 ACP SentenceIdentification •Apply Text2Policy to identify ACP sentences in iTrustuse cases and IBMAppuse cases •Text2Policy effectively identifies ACP sentences with precision and recall more than 88% •Precision on IBMAppuse cases is better –proprietary use cases are often of higher quality compared to open-source use cases
  • 67.
    Evaluation – RQ2Accuracy of Policy Extraction •Apply Text2Policy to extract ACP rules from ACP sentences •Text2Policy effectively extracts ACP model instances with accuracy above 86%
  • 68.
    Dataset and Paper •Our datasets are available athttps://sites.google.com/site/asergrp/projects/text2policy •Xusheng Xiao, Amit Paradkar, Suresh Thummalapenta, and Tao Xie. Automated Extraction of Security Policies from Natural-Language Software Documents. In Proc. 20th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE 2012) http://web.engr.illinois.edu/~taoxie/publications/fse12-nlp.pdf •John Slankas, Xusheng Xiao, Laurie Williams, and Tao Xie. Relation Extraction for Inferring Access Control Rules from Natural Language Artifacts. In Proc. 30th Annual Computer Security Applications Conference (ACSAC 2014) http://web.engr.illinois.edu/~taoxie/publications/acsac14-nlp.pdf
  • 69.
    Outline •Introduction •Backgroundon text analytics •Case Study 1: App Markets •Case Study 2: ACP rules •Wrap-up
  • 70.
    Take-away •Computing systemscontain textual data that partially represents expectation context. •Text analytics and natural language processing offers an opportunity to automatically extract that semantic context –Need to be careful in the security domain (e.g., social engineering) –But potential for improved security decisions
  • 71.
    Future Directions •Onlybeginning to study text analytics for security –Many sources of natural language text –Many unexplored domains –Use text analytics in software engineering as inspiration •https://sites.google.com/site/text4se/ •Hard problem: to what extent can we formalize “expectation context”? •Creation of open datasets (annotation is time intensive) •Apply to real-world problems
  • 72.
    Thank you! WilliamEnck North Carolina State University http://www.enck.org enck@cs.ncsu.edu Tao Xie University of Illinois at Urbana-Champaign http://web.engr.illinois.edu/~taoxie/ taoxie@illinois.edu Questions? Acknowledgment: We thank authors of the original slides that some slides from this tutorial were adapted from. The work is supported in part by a Google Research Faculty Award, NSA Science of Security Labletgrants, NSF grants CCF-1349666, CCF-1409423, CNS-1434582, CCF-1434596, CCF-1434590, CNS-1439481, CNS-1253346, CNS-1222680