Session Title
Speaker Name, Title, Company
Session Title
Speaker Name, Title, Company
Real-Time Cloud Native Open
Source Streaming Of Any Data
to Apache Solr
Timothy Spann, Developer Advocate
2
Speaker Bio
DZone Zone Leader and Big Data MVB;
@PaasDev
https://github.com/tspannhw https://www.datainmotion.dev/
https://github.com/tspannhw/SpeakerProfile
https://dev.to/tspannhw
https://sessionize.com/tspann/
https://www.slideshare.net/bunkertor
Developer Advocate
3
Agenda
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive
a lot of documents via cloud storage, email, social channels and internal document stores. We want
to make all the content and metadata to Apache Solr for categorization, full text search, optimization
and combination with other datastores. We will not only stream documents, but all REST feeds, logs
and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through
Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing
data flow. We use Apache Tika for Document Processing with real-time language detection, natural
language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and
TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache
NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also
extract the text to apply sentiment analysis and NLP categorization to generate additional metadata
about our documents. We also will extract and parse images that if they contain text we can extract
with TensorFlow and Tesseract.
4
FLiP Stack
● Apache Flink
● Apache Pulsar
● StreamNative's Flink Connector for Pulsar
● Apache +++
Apache projects are the way for all streaming
use cases.
5
End to End Streaming Demo Pipeline
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
Clickstream Market data
Machine logs Social
https://hub.streamnative.io/connectors/solr-sink/2.5.1
6
All Data - Anytime - Anywhere - Multi-Cloud - Multi-Protocol
Multi-
inges
t
Multi-
inges
t
Multi-ingest Merge
Priority
7
Powered by Apache Pulsar, StreamNative provides a cloud-native, real-time
messaging and streaming platform to support multi-cloud and hybrid cloud
strategies.
Built for Containers
Cloud Native
StreamNative Cloud
Flink SQL
8
StreamNative Solution
Application Messaging Data Pipelines Real-time Contextual Analytics
Tiered Storage
APP Layer
Computing
Layer
Storage
Layer
StreamNative
Platform
IaaS Layer
Micro
Service
Notification Dashboard Risk Control Auditing
Payment ETL
Apache Pulsar
10
Apache Pulsar is Cloud-Native Messaging and
Event-Streaming Platform
11
Apache Pulsar Overview
Enable Geo-Replicated Messaging
● Pub-Sub
● Geo-Replication
● Pulsar Functions
● Horizontal Scalability
● Multi-tenancy
● Tiered Persistent Storage
● Pulsar Connectors
● REST API
● CLI
● Many clients available
● Four Different Subscription Types
● Multi-Protocol Support
○ MQTT
○ AMQP
○ JMS
○ Kafka
○ ...
12
What are the Benefits of Pulsar?
Data Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model
13
A Unified Messaging Platform
Message Queuing
Data Streaming
14
Flink + Pulsar (FLiP)
https://flink.apache.org/2019/05/03/pulsar-flink.html
https://github.com/streamnative/pulsar-flink
https://streamnative.io/en/blog/release/2021-04-20-flink-sql-
on-streamnative-cloud
Apache Solr
16
Apache Solr As a Destination
Apache NiFi
18
Why Apache NiFi?
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a sixty sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
19
Architecture
https://nifi.apache.org/docs/nifi-docs/html/overview.html
20
Record Processors
https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html
● XML, CSV, JSON, AVRO and more
● Schemas or Inferred Schemas
● Easily convert between them
● Support SQL with Apache Calcite
21
SOLR Connectors
● XML, CSV, JSON, AVRO and more
● Schemas or Inferred Schemas
● Use Records or Raw Text
● Support SQL with Apache Calcite
22
Apache OpenNLP for Entity Resolution Processor
https://github.com/tspannhw/nifi-nlp-processor
Requires installation of NAR and Apache
OpenNLP Models
(http://opennlp.sourceforge.net/models-1.5/).
This is a non-supported processor that I wrote
and put into the community. You can write one
too!
Apache OpenNLP with Apache NiFi
https://community.hortonworks.com/articles/80418/open-nlp-example-apache-nifi-processor.html
https://opennlp.apache.org/news/release-190.html
23
Apache Tika with Apache NiFi
https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html
https://community.hortonworks.com/articles/81694/extracttext-nifi-custom-processor-powered-by-apach.html
https://community.hortonworks.com/articles/76924/data-processing-pipeline-parsing-pdfs-and-identify.html
https://github.com/tspannhw/nifi-extracttext-processor
https://community.hortonworks.com/content/kbentry/177370/extracting-html-from-pdf-excel-and-word-documents.html
Final Thoughts
streamnative.io
Build Your Own Pulsar - SOLR Integration
https://github.com/tspannhw/FLiP-Energy
bin/pulsar-admin sinks create --tenant public
--namespace default
--name solr-sink-energy
--sink-type solr
--sink-config-file conf/solr-sink-energy.yml
--inputs energy
streamnative.io
Build Your Own Pulsar - NiFi Integration
PutSolrRecord
streamnative.io
Connect with the Community & Stay Up-To-Date
● Join the Pulsar Slack channel - Apache-Pulsar.slack.com
● Follow @streamnativeio and @apache_pulsar on Twitter
● Subscribe to Monthly Pulsar Newsletter for major news, events, project updates,
and resources in the Pulsar community
28
● https://www.datainmotion.dev/2020/04/building-search-indexes-with-apache.html
● https://github.com/tspannhw/nifi-solr-example
● https://github.com/streamnative/pulsar-flink
● https://www.linkedin.com/pulse/2021-schedule-tim-spann/
● https://github.com/tspannhw/SpeakerProfile/blob/main/2021/talks/20210729_HailHydr
ate!FromStreamtoLake_TimSpann.pdf
● https://streamnative.io/en/blog/release/2021-04-20-flink-sql-on-streamnative-cloud
● https://docs.streamnative.io/cloud/stable/compute/flink-sql
● https://pulsar.apache.org/docs/en/client-libraries-websocket/
Deeper Content
@PaasDev
https://www.pulsardeveloper.com/
timothyspann
29
Pulsar Summit Asia
November 20-21, 2021
Contact us at partners@pulsar-summit.org to become a sponsor or partner
streamnative.io
Thank You
Thank You

Real time cloud native open source streaming of any data to apache solr

  • 1.
    Session Title Speaker Name,Title, Company Session Title Speaker Name, Title, Company Real-Time Cloud Native Open Source Streaming Of Any Data to Apache Solr Timothy Spann, Developer Advocate
  • 2.
    2 Speaker Bio DZone ZoneLeader and Big Data MVB; @PaasDev https://github.com/tspannhw https://www.datainmotion.dev/ https://github.com/tspannhw/SpeakerProfile https://dev.to/tspannhw https://sessionize.com/tspann/ https://www.slideshare.net/bunkertor Developer Advocate
  • 3.
    3 Agenda Utilizing Apache Pulsarand Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink. Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
  • 4.
    4 FLiP Stack ● ApacheFlink ● Apache Pulsar ● StreamNative's Flink Connector for Pulsar ● Apache +++ Apache projects are the way for all streaming use cases.
  • 5.
    5 End to EndStreaming Demo Pipeline Enterprise sources Weather Errors Aggregates Alerts Stocks Clickstream Market data Machine logs Social https://hub.streamnative.io/connectors/solr-sink/2.5.1
  • 6.
    6 All Data -Anytime - Anywhere - Multi-Cloud - Multi-Protocol Multi- inges t Multi- inges t Multi-ingest Merge Priority
  • 7.
    7 Powered by ApachePulsar, StreamNative provides a cloud-native, real-time messaging and streaming platform to support multi-cloud and hybrid cloud strategies. Built for Containers Cloud Native StreamNative Cloud Flink SQL
  • 8.
    8 StreamNative Solution Application MessagingData Pipelines Real-time Contextual Analytics Tiered Storage APP Layer Computing Layer Storage Layer StreamNative Platform IaaS Layer Micro Service Notification Dashboard Risk Control Auditing Payment ETL
  • 9.
  • 10.
    10 Apache Pulsar isCloud-Native Messaging and Event-Streaming Platform
  • 11.
    11 Apache Pulsar Overview EnableGeo-Replicated Messaging ● Pub-Sub ● Geo-Replication ● Pulsar Functions ● Horizontal Scalability ● Multi-tenancy ● Tiered Persistent Storage ● Pulsar Connectors ● REST API ● CLI ● Many clients available ● Four Different Subscription Types ● Multi-Protocol Support ○ MQTT ○ AMQP ○ JMS ○ Kafka ○ ...
  • 12.
    12 What are theBenefits of Pulsar? Data Durability Scalability Geo-Replication Multi-Tenancy Unified Messaging Model
  • 13.
    13 A Unified MessagingPlatform Message Queuing Data Streaming
  • 14.
    14 Flink + Pulsar(FLiP) https://flink.apache.org/2019/05/03/pulsar-flink.html https://github.com/streamnative/pulsar-flink https://streamnative.io/en/blog/release/2021-04-20-flink-sql- on-streamnative-cloud
  • 15.
  • 16.
    16 Apache Solr Asa Destination
  • 17.
  • 18.
    18 Why Apache NiFi? •Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Over a sixty sources • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control
  • 19.
  • 20.
    20 Record Processors https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html ● XML,CSV, JSON, AVRO and more ● Schemas or Inferred Schemas ● Easily convert between them ● Support SQL with Apache Calcite
  • 21.
    21 SOLR Connectors ● XML,CSV, JSON, AVRO and more ● Schemas or Inferred Schemas ● Use Records or Raw Text ● Support SQL with Apache Calcite
  • 22.
    22 Apache OpenNLP forEntity Resolution Processor https://github.com/tspannhw/nifi-nlp-processor Requires installation of NAR and Apache OpenNLP Models (http://opennlp.sourceforge.net/models-1.5/). This is a non-supported processor that I wrote and put into the community. You can write one too! Apache OpenNLP with Apache NiFi https://community.hortonworks.com/articles/80418/open-nlp-example-apache-nifi-processor.html https://opennlp.apache.org/news/release-190.html
  • 23.
    23 Apache Tika withApache NiFi https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac.html https://community.hortonworks.com/articles/81694/extracttext-nifi-custom-processor-powered-by-apach.html https://community.hortonworks.com/articles/76924/data-processing-pipeline-parsing-pdfs-and-identify.html https://github.com/tspannhw/nifi-extracttext-processor https://community.hortonworks.com/content/kbentry/177370/extracting-html-from-pdf-excel-and-word-documents.html
  • 24.
  • 25.
    streamnative.io Build Your OwnPulsar - SOLR Integration https://github.com/tspannhw/FLiP-Energy bin/pulsar-admin sinks create --tenant public --namespace default --name solr-sink-energy --sink-type solr --sink-config-file conf/solr-sink-energy.yml --inputs energy
  • 26.
    streamnative.io Build Your OwnPulsar - NiFi Integration PutSolrRecord
  • 27.
    streamnative.io Connect with theCommunity & Stay Up-To-Date ● Join the Pulsar Slack channel - Apache-Pulsar.slack.com ● Follow @streamnativeio and @apache_pulsar on Twitter ● Subscribe to Monthly Pulsar Newsletter for major news, events, project updates, and resources in the Pulsar community
  • 28.
    28 ● https://www.datainmotion.dev/2020/04/building-search-indexes-with-apache.html ● https://github.com/tspannhw/nifi-solr-example ●https://github.com/streamnative/pulsar-flink ● https://www.linkedin.com/pulse/2021-schedule-tim-spann/ ● https://github.com/tspannhw/SpeakerProfile/blob/main/2021/talks/20210729_HailHydr ate!FromStreamtoLake_TimSpann.pdf ● https://streamnative.io/en/blog/release/2021-04-20-flink-sql-on-streamnative-cloud ● https://docs.streamnative.io/cloud/stable/compute/flink-sql ● https://pulsar.apache.org/docs/en/client-libraries-websocket/ Deeper Content @PaasDev https://www.pulsardeveloper.com/ timothyspann
  • 29.
    29 Pulsar Summit Asia November20-21, 2021 Contact us at partners@pulsar-summit.org to become a sponsor or partner
  • 30.
  • 31.