Algorithm Name Detection
in Computer Science Research Papers
Information Retrieval & Extraction Course
IIIT HYDERABAD
Submitted To: Prof. Vasudev Verma
Submission By: Team 41
Allaparthi Sriteja [201302139]
Deeksha Singh Thakur [201505627]
Sneh gupta [201302201]
Aim of project
Processing the contents of the research document
List out the name of algorithms being discussed in the paper
Assist the users to find research papers specific to a domain without actually
opening and reading each of them.
Extraction of Algorithm Name from Research Paper
Converting pdf to text
Input : A research paper in the pdf format.
Output : Need to convert that pdf to text format.
Processing : Using PDFMiner
pdf2txt.py -O myoutput -o myoutput/myfile.text -t text myfile.pdf
Usage:
pdf2txt.py [options] filename.pdf
Options: -o output file name
-t output format (text/html/xml/tag[for Tagged PDFs])
-O dirname (triggers extraction of images from PDF into directory)
Named Entity Recognition
Input : Research paper in the text format.
Output : Noun phrases (NNPS and NNs)
Processing :
Sentence tokenization
Merging the divided words at the end of the line [ex: div - n ision]
Removing the part before the Abstract and after the Reference.
Find the citation sentences and extract them
Do pos_tagging for those sentences.
Now extract the NNPS and NN. combine the NNPS occurring adjacent to each other in a sentence.
Filtration of the Named Entities
Input : Named Entities with author names, University names, places.
Output : stemmed desired named entities using porter stemmer.
Processing:
Designed the list of authors and universities and places.
And compare the named entities with these lists and filter them.
Search for the word algorithm or technique to give more weightage to that particular word as the
probability of getting the algorithm name will be high in such sentences.
Stem these remaining named entities using Porter Stemmer
Phase II
Input : Named Entities from Research Papers
-From each research paper in the corpus, we obtain a set of Named Entities
Eg.
-These NE’s are filtered for
author name geographical locations organization names dataset names
BUT THE DATA STILL CONTAINS NOISE!!!
neighbo
rhood
sparseli
nearme
thod
mov
i
slim
tabl matrixf
actor
hosli
m
ratin
gpre
dict
TASK :
Separate noisy data from names of actual algorithms
Using WORD2VEC
From Gensim library
Gensim is a FREE Python library that allows
-Making and Importing word2vec models
-Determine similarity between words in the model
-Determine topN most similar words to a given word
WORD2VEC MODEL :
The word2vec model under consideration contains -
word2vec word vectors
trained on ~4.3lac computer science papers, 3.7B tokens
A 300 dimensional vector representation of all 1 word algorithm names
Used as model[‘word’] = {[300 dimension vector], dtype: float}
Classifying the tokens :
Form a list,(manually by going through some papers) -
true positives[containing name of actual computer science algorithms]
false positives [most common noise components in each paper].
Compare each named entity extracted from paper with these lists of TPs and FPs
and find the similarity between them. If the similarity between a word and another
word in TP is greater than a threshold value (0.4 considered in our case), classify
it as the TP, otherwise FP.
TOKEN
TRUE POSITIVES
'Svm' 'Knn'
'Neuralnetwork'
'Decisiontree' 'Lda'
'Backprop'
'Spade' 'search’ 'plsa'
'machinelearn' 'cluster'
'randomforest'
'Network' 'markov'
'reinforcementlearn'
'Cart'
'regressiontre'
FALSE POSITIVES
‘Concept' 'dataset'
'database'
'approach' 'method'
'success'
'Algorithm' 'analysi'
'model'
model.similarity(token,true_positives)<model.similarity(false_positives)

Algorithm Name Detection & Extraction

  • 1.
    Algorithm Name Detection inComputer Science Research Papers Information Retrieval & Extraction Course IIIT HYDERABAD Submitted To: Prof. Vasudev Verma Submission By: Team 41 Allaparthi Sriteja [201302139] Deeksha Singh Thakur [201505627] Sneh gupta [201302201]
  • 2.
    Aim of project Processingthe contents of the research document List out the name of algorithms being discussed in the paper Assist the users to find research papers specific to a domain without actually opening and reading each of them. Extraction of Algorithm Name from Research Paper
  • 3.
    Converting pdf totext Input : A research paper in the pdf format. Output : Need to convert that pdf to text format. Processing : Using PDFMiner pdf2txt.py -O myoutput -o myoutput/myfile.text -t text myfile.pdf Usage: pdf2txt.py [options] filename.pdf Options: -o output file name -t output format (text/html/xml/tag[for Tagged PDFs]) -O dirname (triggers extraction of images from PDF into directory)
  • 4.
    Named Entity Recognition Input: Research paper in the text format. Output : Noun phrases (NNPS and NNs) Processing : Sentence tokenization Merging the divided words at the end of the line [ex: div - n ision] Removing the part before the Abstract and after the Reference. Find the citation sentences and extract them Do pos_tagging for those sentences. Now extract the NNPS and NN. combine the NNPS occurring adjacent to each other in a sentence.
  • 5.
    Filtration of theNamed Entities Input : Named Entities with author names, University names, places. Output : stemmed desired named entities using porter stemmer. Processing: Designed the list of authors and universities and places. And compare the named entities with these lists and filter them. Search for the word algorithm or technique to give more weightage to that particular word as the probability of getting the algorithm name will be high in such sentences. Stem these remaining named entities using Porter Stemmer
  • 6.
  • 7.
    Input : NamedEntities from Research Papers -From each research paper in the corpus, we obtain a set of Named Entities Eg. -These NE’s are filtered for author name geographical locations organization names dataset names BUT THE DATA STILL CONTAINS NOISE!!! neighbo rhood sparseli nearme thod mov i slim tabl matrixf actor hosli m ratin gpre dict
  • 8.
    TASK : Separate noisydata from names of actual algorithms Using WORD2VEC From Gensim library Gensim is a FREE Python library that allows -Making and Importing word2vec models -Determine similarity between words in the model -Determine topN most similar words to a given word
  • 9.
    WORD2VEC MODEL : Theword2vec model under consideration contains - word2vec word vectors trained on ~4.3lac computer science papers, 3.7B tokens A 300 dimensional vector representation of all 1 word algorithm names Used as model[‘word’] = {[300 dimension vector], dtype: float}
  • 10.
    Classifying the tokens: Form a list,(manually by going through some papers) - true positives[containing name of actual computer science algorithms] false positives [most common noise components in each paper]. Compare each named entity extracted from paper with these lists of TPs and FPs and find the similarity between them. If the similarity between a word and another word in TP is greater than a threshold value (0.4 considered in our case), classify it as the TP, otherwise FP.
  • 11.
    TOKEN TRUE POSITIVES 'Svm' 'Knn' 'Neuralnetwork' 'Decisiontree''Lda' 'Backprop' 'Spade' 'search’ 'plsa' 'machinelearn' 'cluster' 'randomforest' 'Network' 'markov' 'reinforcementlearn' 'Cart' 'regressiontre' FALSE POSITIVES ‘Concept' 'dataset' 'database' 'approach' 'method' 'success' 'Algorithm' 'analysi' 'model' model.similarity(token,true_positives)<model.similarity(false_positives)