Intelligent Software Engineering:
Working at the Intersection of AI and
Software Engineering
Tao Xie
Department of Computer Science and Technology
Peking University
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligence Software Engineering
Speakers
SIGSOFT Webinar: Intelligent Software
Engineering: Synergy between AI and Software
Engineering (Feb 21, 2019)
What can AI do for software engineering, and how can we as software
engineers design and build better AI systems? As AI continues to disrupt
many fields from agriculture to manufacturing, it’s important to explore
the essential connections between AI and software engineering.
1. The creation of AI software (How do we architect, build, maintain,
deploy, test, and verify AI software?)
2. The application of AI to software engineering (How can AI help
software engineers better do their jobs and advance the state of the
practice?)
3. AI and software engineering in use (How have applications blended
AI and software engineering so far?)
4. The AI landscape and its effect on software engineering (How do
related topics such as AI technology investment, ethics, data collection,
and security affect the work of software developers?)
Artificial Intelligence  Software Engineering
Artificial
Intelligence
Software
Engineering
Intelligent Software Engineering
Intelligence Software Engineering
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
Challenges in Autonomics: “In search of a foundation for
next generation autonomous systems”
• How to specify autonomous system behavior in the face of
unpredictability;
• How to carry out faithful analysis of system behavior with respect to rich
environments that include humans, physical artifacts, and other systems?
• How to build such systems by combining executable modeling techniques
from software engineering with AI and ML?
Harel, Marron, Sifakis. Autonomics: In search of a foundation for next-generation autonomous systems. Proc. Natl. Acad. Sci. USA 2020
Sample Requirements in Autonomous Driving System (ADS)
• Stability: ADS must assure stable control and avoid dangerous actions for the vehicle
• R1.1: ADS must avoid impossible steering angles
• Safety: ADS must avoid collision with moving or static objects along the path
• R2.1: ADS must keep a safe distance from other objects
• Compliance: the ADS must respect the traffic regulations enforced by law in a
geographical area
• R3.1: the velocity of the vehicle should be less than the speed limit
• R3.2: the vehicle should not run the red light
• R3.3: the vehicle should stay in the correct lane
• Comfort: the planned trajectory should be comfortable for the passenger
• R4.1: the vehicle’s velocity should not change too much
• R4.2: the vehicle’s acceleration should not change too much
Czarnecki, “Automated Driving System (ADS) High–Level Quality 66 Requirements Analysis– Driving Behavior Safety,” Univ Waterloo, 2018.
Tuncali, Fainekos, Prokhorov, Ito, Kapinski, “Requirements-driven test generation for autonomous vehicles with machine learning components,” IEEE Transactions
on Intelligent Vehicles, 2019.
Microsoft's Teen Chatbot Tay
Turned into Genocidal Racist (2016 March 23/24)
http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3
"There are a number of precautionary
steps they [Microsoft] could have taken.
It wouldn't have been too hard to create
a blacklist of terms; or narrow the scope
of replies. They could also have simply
manually moderated Tay for the first few
days, even if that had meant slower
responses."
“businesses and other AI developers will
need to give more thought to the
protocols they design for testing and
training AIs like Tay.”
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
Adversarial Machine Learning/Testing
● Generate adversarial examples derived from legitimate examples
with slight modification (imperceptible to human) to induce
misclassification
13
Turn right
Go straight
Slight
modification
Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018
Example Detected Erroneous Behaviors
Turn right
Go straight
14
Go straight Turn left
Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017.
Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018.
Example Detected Erroneous Behaviors
Turn right
Go straight
15
Go straight Turn left
Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017.
Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018.
Lu et al. NO Need to Worry about Adversarial Examples in
Object Detection in Autonomous Vehicles. CVPR’17.
https://arxiv.org/abs/1707.03501
“
“
Example Detected Erroneous Behaviors
Turn right
Go straight
16
Go straight Turn left
Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017.
Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018.
“
“
Zhou et al. DeepBillboard: Systematic Physical-World
Testing of Autonomous Driving Systems. ICSE 2020.
Robustness Certification for Deep Learning Model
Li, Weber, Xu, Rimanic, Kailkhura, Xie, Zhang, Li. TSS: Transformation-Specific Smoothing for Robustness Certification. CCS 2021.
Transformation-Specific Smoothing-based
robustness certification
• A general robustness certification
framework against various semantic
transformations.
• A range of different transformation-
specific smoothing protocols and
techniques to provide substantially
better certified robustness bounds than
state-of-the-art approaches on large-
scale datasets
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
DL Software Development and Deployment
19
Trained
DL Models
DL Software Development
DL frameworks
Training Data
DL programs
[ISSTA’18, ISSRE’19, ESEC/FSE’19, ICSE’20]
DL Software Development and Deployment
20
Trained
DL Models
Server/Cloud Platforms
Mobile Platforms
Browser Platforms
DL Software Development
DL frameworks
Training Data
DL programs
[ISSTA’18, ISSRE’19, ESEC/FSE’19, ICSE’20]
DL Software Deployment
Converted
DL Models
Converted
DL Models
Converted
DL Models
TF
Serving
Core ML
TF Lite
TF.js
Deployment-related
frameworks Knowledge Gap!
Research Questions
21
RQ1: Popularity of DL software deployment
RQ2: Difficulty level of DL software deployment
RQ3: Challenges in DL software deployment
Chen, Cao, Liu, Wang, Xie, Liu. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. ESEC/FSE 2020.
Collect Relevant Questions from SO
22
TF
Serving
Google
Cloud AI
Amazon
SageMaker
Core ML
TF Lite
TF.js
Server/cloud
deployment
Mobile deployment
Browser deployment
Extraction
&Filtering
Platforms Counts
Cloud/server 1,325
Mobile 1,533
Browser 165
In total 3,023
Identification of relevant questions
RQ1: Popularity
23
Trend of users Trend of questions
DL software deployment is gaining increasing attention,
demonstrating timeliness and urgency of this study.
RQ2: Difficulty Level
24
% of questions with no accepted answer (%no acc.)
time needed to receive an accepted answer (acc. time)
Metrics
High %no acc.
Long acc. time
Low %no acc.
Short acc. time
Topics %no acc.
acc. time
(median)
DL software deployment 70.7% 405 min
Other aspects of DL software 62.7% 146 min
Big data [ESEC/FSE’19] 60.5% 198 min
Concurrency [ESEM’18] 43.8% 42 min
Mobile [EMSE’16] 55.0% 55 min
Questions related to DL software deployment are difficult to
resolve, motivating us to identify challenges behind it.
DIFFICULTY LEVEL
25
Data Processing (19.8%)
Procedure
(1.3%)
Setting size /
shape of input
data (1.8%)
Setting format /
datatype of input
data (8.8%)
Parsing
output (4.8%)
Migrating pre-
processing
(3.1%)
Authenticating
client (1.8%)
Procedure
(0.4%)
Parsing
request
(1.3%)
Serving (13.2%)
Model
loading (3.1%)
Configuration
of batching
(2.6%)
Serving multiple
models simultaneously
(3.5%)
Bidirectional
streaming
(0.4%)
Server / Cloud (100%)
Model Update (2.6%)
Environment (19.4%)
Installing /
building
frameworks (7.5%)
Avoiding version
incompatibility
(4.0%)
Configuration
of environment
variables (7.9%)
Limitations of
platforms /
frameworks (2.6%)
Model Export (15.0%)
Procedure
(4.4%)
Specification of
model
information (6.2%)
Export of
unsupported
models (3.1%)
Selection / usage
of APIs (0.9%)
Model
quantization
(0.4%)
Request (13.3%)
Procedure (0.9%)
Setting request
parameters /
body (8.4%)
Batching
request (2.2%)
General Questions (16.7%)
Entire procedure
of deployment
(9.7%)
Conceptual
questions
(4.4%)
Getting information
of exposed model
(1.8%)
Model Update (1.6%)
Data Extraction (3.2%)
Model Loading (24.0%)
Loading from
local storage
(8.0%)
Loading from a
Http endpoint
(2.4%)
Asynchronous
loading (5.6%)
Selection / usage
of APIs (2.4%)
Improving loading
speed (0.8%)
Procedure
(4.8%)
Inference Speed (7.2%)
Environment (19.2%)
Importing
libraries (10.4%)
Avoiding version
incompatibility
(8.8%)
Procedure
(3.2%)
Specification of
model information
(5.6%)
Conversion of
unsupported
models (4.0%)
Selection /
usage of APIs
(2.4%)
Saving models
(3.2%)
General Questions (5.6%)
Entire procedure
of deployment
(3.2%)
Limitations of
frameworks (2.4%)
Procedure (1.6%)
Data Processing (18.4%)
Setting size /
shape of input
data (5.6%)
Setting format /
datatype of input
data (4.8%)
Migrating pre-
processing
(2.4%)
Browser (100%)
Model Conversion (18.4%)
Model Security (2.4%)
Data Loading
(4.0%)
RQ3: Challenges
A wide spectrum of challenges
for each of three platforms
 organized as taxonomies
 72 categories of challenges
Model Update (3.0%)
Data Extraction (1.7%)
General Questions (18.6%)
Entire procedure
of deployment
(13.4%)
Conceptual
questions
(4.8%)
Limitations of
frameworks
(0.4%)
Avoiding version
incompatibility
(1.7%)
Configuration of
input / output
information (8.2%)
DL Integration into Projects (21.2%)
Importing /
loading models
(4.3%)
Build
configuration
(3.9%)
Inference Speed (3.9%)
Model Conversion (26.5%)
Procedure
(3.9%)
Saving models
(1.3%)
Conversion of
unsupported
models (6.1%)
Model
quantization
(4.8%)
Specification of
model information
(8.2%)
Selection /
usage of APIs
(0.9%)
Parsing
converted
models (1.3%)
DL Library Compilation (7.8%)
Usage of
prebuilt
libraries (0.4%)
Register of
unsupported
operators (3.0%)
Build
configuration
(2.6%)
Procedure (1.7%)
Data Processing (16.9%)
Setting size /
shape of input
data (3.0%)
Setting format /
datatype of input
data (5.2%)
Parsing output
(2.2%)
Migrating pre-
processing (4.8%)
Mobile (100%)
Procedure
(1.7%)
Thread
management
(2.2%)
Model Security (0.4%)
Procedure
(0.9%)
Server/Cloud Mobile
Browser
Common Challenges across Three Platforms
26
Cloud/Server 15.0%
Mobile 26.5%
Browser 18.4%
Model Conversion
Unsupported models
Specification of model
information
Selection/usage of APIs
Model quantization
Cloud/server 19.8%
Mobile 16.9%
Browser 18.4%
Setting size / shape / format
/ datatype of input data
Migrating pre-processing
Parsing output
Data Processing
Unique Challenges in
Client Platforms (Mobile and Browser)
27
Model Security
Server/Cloud 0.0%
Mobile 0.4%
Browser 2.4%
Inference Speed
Server/Cloud 0.0%
Mobile 3.9%
Browser 7.2%
Models on client platforms easier
to obtain than those on
server/cloud platforms
Client platforms with weaker
computing power than server/cloud
platforms
Summary of Challenges
in DL Software Deployment
28
Deploying DL Software
RQ1:
Popularity
RQ2:
Difficulty Level
RQ3:
Challenges
DL software deployment is
gaining increasing attention.
Questions about DL software
deployment are difficult to resolve.
We build a taxonomy consisting of
72 categories, linked to challenges
in deploying DL software.
Server/Cloud
Mobile Browser
Chen, Cao, Liu, Wang, Xie, Liu. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. ESEC/FSE 2020.
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
Neural Machine Translation
Screen snapshot captured on April 5, 2018
• Overall better than statistical
machine translation
• Worse controllability
• Existing translation quality
assurance
 Need reference translation,
not applicable online
Translation Quality Assurance
● Key idea:black-box algorithms specialized for common problems
○ No need for reference translation; need only the original sentence and generated
translation
○ Precise problem localization
● Common problems
○ Under-translation
○ Over-translation
Collaborative Work with Tencent
Wang, Zheng, Liu, Zhang, Zeng, Deng, Yang, He, Xie. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. DSN 2019 Industry Track
Deployment of Translation Quality Assurance
● Adopted to improve WeChat translation service (over 1 billion users,
online serving 12 million translation tasks)
○ Offline monitoring (regression testing)
○ Online monitoring (real time selection of best model)
● Large scale test data for translation
○ ~130K English/180K Chinese words/phrases
○ Detect numerous problems in other vendors as well
BLEU Score Improvement %Problems Reduction
Problem Cases in Other Translation Services
Collaborative Work with Tencent
Wang, Zheng, Liu, Zhang, Zeng, Deng, Yang, He, Xie.
Detecting Failures of Neural Machine Translation in the
Absence of Reference Translations. DSN 2019 Industry Track
Figure created by Christian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
Neural Network Architecture Neural Network Model
-0.2
Training
Existing work on NN model:
Testing
Verification
Bug Detection
…
?
34
Neural Network Architecture
An architecture vendor Many Developers
3. Quality assurance needs to be provided for architectures
A NN Architecture Many NN Models
Software
Systems
35
-0.2
Data
Code (Architecture) Training (Magic) NN Model
1. Bugs at model level are difficult to fix
Hours, Days, Weeks, Months, …
2. Bugs in architectures may cause failures in training
Why Neural Network Architecture?
Bugs leading to errors in numerical operations, such as “NaN”,
“INF”, or crashes during training or inference.
36
Numerical Bugs
37
…
1. y_softmax = tf.nn.softmax(h_fc)
2. cross_entropy = y_ * tf.log(y_softmax)
…
y_
h_fc
Mul
Softmax
Log
y_softmax ∈ [0,1]
h_fc ∈ [-100,100]
We can use static analysis to infer
the range of tensors.
An Example of Numerical Bugs
38
NN Architecture Computation Graph Static Analysis Check Unsafe Operations
Log
Exp
…
Detecting Numerical Bugs in
Neural Network Architectures
Zhang, Ren, Chen, Xiong, Cheung, Xie. Detecting Numerical Bugs in Neural Network Architectures. ESEC/FSE’20.
ACM SIGSOFT Distinguished Paper Award
Found 11 buggy statements in the code repository
Submitted pull requests, and 3 buggy statements have been repaired by the developers
39
Bugs Detected in Real-World Architectures
Zhang, Ren, Chen, Xiong, Cheung, Xie. Detecting Numerical Bugs in Neural Network Architectures. ESEC/FSE’20.
ACM SIGSOFT Distinguished Paper Award
Open Topics in Intelligence Software Engineering (ISE)
• How to solicit and specify requirements for intelligence software?
• How to tackle the complexity of integrating intelligence software with the rest of the
software system?
• How to define test oracles (or properties) for intelligence software?
• How to design high-quality library/framework APIs for developing intelligence
software?
• How to transfer ISE research results into industrial/open source practice?
• …
(SE  AI)  Practice Impact
Problem
Domain
Solution
Domain
Practice
Intelligent Software Engineering
Intelligence Software Engineering
42
Thank You!
Q & A
Tao Xie
Peking University
taoxie@pku.edu.cn
https://taoxiease.github.io/

DSML 2021 Keynote: Intelligent Software Engineering: Working at the Intersection of AI and Software Engineering

  • 1.
    Intelligent Software Engineering: Workingat the Intersection of AI and Software Engineering Tao Xie Department of Computer Science and Technology Peking University
  • 2.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligence Software Engineering
  • 3.
    Speakers SIGSOFT Webinar: IntelligentSoftware Engineering: Synergy between AI and Software Engineering (Feb 21, 2019)
  • 4.
    What can AIdo for software engineering, and how can we as software engineers design and build better AI systems? As AI continues to disrupt many fields from agriculture to manufacturing, it’s important to explore the essential connections between AI and software engineering. 1. The creation of AI software (How do we architect, build, maintain, deploy, test, and verify AI software?) 2. The application of AI to software engineering (How can AI help software engineers better do their jobs and advance the state of the practice?) 3. AI and software engineering in use (How have applications blended AI and software engineering so far?) 4. The AI landscape and its effect on software engineering (How do related topics such as AI technology investment, ethics, data collection, and security affect the work of software developers?)
  • 6.
    Artificial Intelligence Software Engineering Artificial Intelligence Software Engineering Intelligent Software Engineering Intelligence Software Engineering
  • 7.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 8.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 9.
    Challenges in Autonomics:“In search of a foundation for next generation autonomous systems” • How to specify autonomous system behavior in the face of unpredictability; • How to carry out faithful analysis of system behavior with respect to rich environments that include humans, physical artifacts, and other systems? • How to build such systems by combining executable modeling techniques from software engineering with AI and ML? Harel, Marron, Sifakis. Autonomics: In search of a foundation for next-generation autonomous systems. Proc. Natl. Acad. Sci. USA 2020
  • 10.
    Sample Requirements inAutonomous Driving System (ADS) • Stability: ADS must assure stable control and avoid dangerous actions for the vehicle • R1.1: ADS must avoid impossible steering angles • Safety: ADS must avoid collision with moving or static objects along the path • R2.1: ADS must keep a safe distance from other objects • Compliance: the ADS must respect the traffic regulations enforced by law in a geographical area • R3.1: the velocity of the vehicle should be less than the speed limit • R3.2: the vehicle should not run the red light • R3.3: the vehicle should stay in the correct lane • Comfort: the planned trajectory should be comfortable for the passenger • R4.1: the vehicle’s velocity should not change too much • R4.2: the vehicle’s acceleration should not change too much Czarnecki, “Automated Driving System (ADS) High–Level Quality 66 Requirements Analysis– Driving Behavior Safety,” Univ Waterloo, 2018. Tuncali, Fainekos, Prokhorov, Ito, Kapinski, “Requirements-driven test generation for autonomous vehicles with machine learning components,” IEEE Transactions on Intelligent Vehicles, 2019.
  • 11.
    Microsoft's Teen ChatbotTay Turned into Genocidal Racist (2016 March 23/24) http://www.businessinsider.com/ai-expert-explains-why-microsofts-tay-chatbot-is-so-racist-2016-3 "There are a number of precautionary steps they [Microsoft] could have taken. It wouldn't have been too hard to create a blacklist of terms; or narrow the scope of replies. They could also have simply manually moderated Tay for the first few days, even if that had meant slower responses." “businesses and other AI developers will need to give more thought to the protocols they design for testing and training AIs like Tay.”
  • 12.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 13.
    Adversarial Machine Learning/Testing ●Generate adversarial examples derived from legitimate examples with slight modification (imperceptible to human) to induce misclassification 13 Turn right Go straight Slight modification Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018
  • 14.
    Example Detected ErroneousBehaviors Turn right Go straight 14 Go straight Turn left Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017. Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018.
  • 15.
    Example Detected ErroneousBehaviors Turn right Go straight 15 Go straight Turn left Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017. Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018. Lu et al. NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles. CVPR’17. https://arxiv.org/abs/1707.03501 “ “
  • 16.
    Example Detected ErroneousBehaviors Turn right Go straight 16 Go straight Turn left Pei et al. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. SOSP 2017. Tian et al. DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. ICSE 2018. “ “ Zhou et al. DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems. ICSE 2020.
  • 17.
    Robustness Certification forDeep Learning Model Li, Weber, Xu, Rimanic, Kailkhura, Xie, Zhang, Li. TSS: Transformation-Specific Smoothing for Robustness Certification. CCS 2021. Transformation-Specific Smoothing-based robustness certification • A general robustness certification framework against various semantic transformations. • A range of different transformation- specific smoothing protocols and techniques to provide substantially better certified robustness bounds than state-of-the-art approaches on large- scale datasets
  • 18.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 19.
    DL Software Developmentand Deployment 19 Trained DL Models DL Software Development DL frameworks Training Data DL programs [ISSTA’18, ISSRE’19, ESEC/FSE’19, ICSE’20]
  • 20.
    DL Software Developmentand Deployment 20 Trained DL Models Server/Cloud Platforms Mobile Platforms Browser Platforms DL Software Development DL frameworks Training Data DL programs [ISSTA’18, ISSRE’19, ESEC/FSE’19, ICSE’20] DL Software Deployment Converted DL Models Converted DL Models Converted DL Models TF Serving Core ML TF Lite TF.js Deployment-related frameworks Knowledge Gap!
  • 21.
    Research Questions 21 RQ1: Popularityof DL software deployment RQ2: Difficulty level of DL software deployment RQ3: Challenges in DL software deployment Chen, Cao, Liu, Wang, Xie, Liu. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. ESEC/FSE 2020.
  • 22.
    Collect Relevant Questionsfrom SO 22 TF Serving Google Cloud AI Amazon SageMaker Core ML TF Lite TF.js Server/cloud deployment Mobile deployment Browser deployment Extraction &Filtering Platforms Counts Cloud/server 1,325 Mobile 1,533 Browser 165 In total 3,023 Identification of relevant questions
  • 23.
    RQ1: Popularity 23 Trend ofusers Trend of questions DL software deployment is gaining increasing attention, demonstrating timeliness and urgency of this study.
  • 24.
    RQ2: Difficulty Level 24 %of questions with no accepted answer (%no acc.) time needed to receive an accepted answer (acc. time) Metrics High %no acc. Long acc. time Low %no acc. Short acc. time Topics %no acc. acc. time (median) DL software deployment 70.7% 405 min Other aspects of DL software 62.7% 146 min Big data [ESEC/FSE’19] 60.5% 198 min Concurrency [ESEM’18] 43.8% 42 min Mobile [EMSE’16] 55.0% 55 min Questions related to DL software deployment are difficult to resolve, motivating us to identify challenges behind it. DIFFICULTY LEVEL
  • 25.
    25 Data Processing (19.8%) Procedure (1.3%) Settingsize / shape of input data (1.8%) Setting format / datatype of input data (8.8%) Parsing output (4.8%) Migrating pre- processing (3.1%) Authenticating client (1.8%) Procedure (0.4%) Parsing request (1.3%) Serving (13.2%) Model loading (3.1%) Configuration of batching (2.6%) Serving multiple models simultaneously (3.5%) Bidirectional streaming (0.4%) Server / Cloud (100%) Model Update (2.6%) Environment (19.4%) Installing / building frameworks (7.5%) Avoiding version incompatibility (4.0%) Configuration of environment variables (7.9%) Limitations of platforms / frameworks (2.6%) Model Export (15.0%) Procedure (4.4%) Specification of model information (6.2%) Export of unsupported models (3.1%) Selection / usage of APIs (0.9%) Model quantization (0.4%) Request (13.3%) Procedure (0.9%) Setting request parameters / body (8.4%) Batching request (2.2%) General Questions (16.7%) Entire procedure of deployment (9.7%) Conceptual questions (4.4%) Getting information of exposed model (1.8%) Model Update (1.6%) Data Extraction (3.2%) Model Loading (24.0%) Loading from local storage (8.0%) Loading from a Http endpoint (2.4%) Asynchronous loading (5.6%) Selection / usage of APIs (2.4%) Improving loading speed (0.8%) Procedure (4.8%) Inference Speed (7.2%) Environment (19.2%) Importing libraries (10.4%) Avoiding version incompatibility (8.8%) Procedure (3.2%) Specification of model information (5.6%) Conversion of unsupported models (4.0%) Selection / usage of APIs (2.4%) Saving models (3.2%) General Questions (5.6%) Entire procedure of deployment (3.2%) Limitations of frameworks (2.4%) Procedure (1.6%) Data Processing (18.4%) Setting size / shape of input data (5.6%) Setting format / datatype of input data (4.8%) Migrating pre- processing (2.4%) Browser (100%) Model Conversion (18.4%) Model Security (2.4%) Data Loading (4.0%) RQ3: Challenges A wide spectrum of challenges for each of three platforms  organized as taxonomies  72 categories of challenges Model Update (3.0%) Data Extraction (1.7%) General Questions (18.6%) Entire procedure of deployment (13.4%) Conceptual questions (4.8%) Limitations of frameworks (0.4%) Avoiding version incompatibility (1.7%) Configuration of input / output information (8.2%) DL Integration into Projects (21.2%) Importing / loading models (4.3%) Build configuration (3.9%) Inference Speed (3.9%) Model Conversion (26.5%) Procedure (3.9%) Saving models (1.3%) Conversion of unsupported models (6.1%) Model quantization (4.8%) Specification of model information (8.2%) Selection / usage of APIs (0.9%) Parsing converted models (1.3%) DL Library Compilation (7.8%) Usage of prebuilt libraries (0.4%) Register of unsupported operators (3.0%) Build configuration (2.6%) Procedure (1.7%) Data Processing (16.9%) Setting size / shape of input data (3.0%) Setting format / datatype of input data (5.2%) Parsing output (2.2%) Migrating pre- processing (4.8%) Mobile (100%) Procedure (1.7%) Thread management (2.2%) Model Security (0.4%) Procedure (0.9%) Server/Cloud Mobile Browser
  • 26.
    Common Challenges acrossThree Platforms 26 Cloud/Server 15.0% Mobile 26.5% Browser 18.4% Model Conversion Unsupported models Specification of model information Selection/usage of APIs Model quantization Cloud/server 19.8% Mobile 16.9% Browser 18.4% Setting size / shape / format / datatype of input data Migrating pre-processing Parsing output Data Processing
  • 27.
    Unique Challenges in ClientPlatforms (Mobile and Browser) 27 Model Security Server/Cloud 0.0% Mobile 0.4% Browser 2.4% Inference Speed Server/Cloud 0.0% Mobile 3.9% Browser 7.2% Models on client platforms easier to obtain than those on server/cloud platforms Client platforms with weaker computing power than server/cloud platforms
  • 28.
    Summary of Challenges inDL Software Deployment 28 Deploying DL Software RQ1: Popularity RQ2: Difficulty Level RQ3: Challenges DL software deployment is gaining increasing attention. Questions about DL software deployment are difficult to resolve. We build a taxonomy consisting of 72 categories, linked to challenges in deploying DL software. Server/Cloud Mobile Browser Chen, Cao, Liu, Wang, Xie, Liu. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. ESEC/FSE 2020.
  • 29.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 30.
    Neural Machine Translation Screensnapshot captured on April 5, 2018 • Overall better than statistical machine translation • Worse controllability • Existing translation quality assurance  Need reference translation, not applicable online
  • 31.
    Translation Quality Assurance ●Key idea:black-box algorithms specialized for common problems ○ No need for reference translation; need only the original sentence and generated translation ○ Precise problem localization ● Common problems ○ Under-translation ○ Over-translation Collaborative Work with Tencent Wang, Zheng, Liu, Zhang, Zeng, Deng, Yang, He, Xie. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. DSN 2019 Industry Track
  • 32.
    Deployment of TranslationQuality Assurance ● Adopted to improve WeChat translation service (over 1 billion users, online serving 12 million translation tasks) ○ Offline monitoring (regression testing) ○ Online monitoring (real time selection of best model) ● Large scale test data for translation ○ ~130K English/180K Chinese words/phrases ○ Detect numerous problems in other vendors as well BLEU Score Improvement %Problems Reduction Problem Cases in Other Translation Services Collaborative Work with Tencent Wang, Zheng, Liu, Zhang, Zeng, Deng, Yang, He, Xie. Detecting Failures of Neural Machine Translation in the Absence of Reference Translations. DSN 2019 Industry Track
  • 33.
    Figure created byChristian Kaestner, taken from https://ckaestne.github.io/seai/S2020/
  • 34.
    Neural Network ArchitectureNeural Network Model -0.2 Training Existing work on NN model: Testing Verification Bug Detection … ? 34 Neural Network Architecture
  • 35.
    An architecture vendorMany Developers 3. Quality assurance needs to be provided for architectures A NN Architecture Many NN Models Software Systems 35 -0.2 Data Code (Architecture) Training (Magic) NN Model 1. Bugs at model level are difficult to fix Hours, Days, Weeks, Months, … 2. Bugs in architectures may cause failures in training Why Neural Network Architecture?
  • 36.
    Bugs leading toerrors in numerical operations, such as “NaN”, “INF”, or crashes during training or inference. 36 Numerical Bugs
  • 37.
    37 … 1. y_softmax =tf.nn.softmax(h_fc) 2. cross_entropy = y_ * tf.log(y_softmax) … y_ h_fc Mul Softmax Log y_softmax ∈ [0,1] h_fc ∈ [-100,100] We can use static analysis to infer the range of tensors. An Example of Numerical Bugs
  • 38.
    38 NN Architecture ComputationGraph Static Analysis Check Unsafe Operations Log Exp … Detecting Numerical Bugs in Neural Network Architectures Zhang, Ren, Chen, Xiong, Cheung, Xie. Detecting Numerical Bugs in Neural Network Architectures. ESEC/FSE’20. ACM SIGSOFT Distinguished Paper Award
  • 39.
    Found 11 buggystatements in the code repository Submitted pull requests, and 3 buggy statements have been repaired by the developers 39 Bugs Detected in Real-World Architectures Zhang, Ren, Chen, Xiong, Cheung, Xie. Detecting Numerical Bugs in Neural Network Architectures. ESEC/FSE’20. ACM SIGSOFT Distinguished Paper Award
  • 40.
    Open Topics inIntelligence Software Engineering (ISE) • How to solicit and specify requirements for intelligence software? • How to tackle the complexity of integrating intelligence software with the rest of the software system? • How to define test oracles (or properties) for intelligence software? • How to design high-quality library/framework APIs for developing intelligence software? • How to transfer ISE research results into industrial/open source practice? • …
  • 41.
    (SE  AI) Practice Impact Problem Domain Solution Domain Practice Intelligent Software Engineering Intelligence Software Engineering
  • 42.
    42 Thank You! Q &A Tao Xie Peking University taoxie@pku.edu.cn https://taoxiease.github.io/