Ocr++: A Robust Framework For Information Extraction From Scholarly Articles
Project Description :

Obtaining structured data from documents is necessary to support retrieval tasks. various scholarly organizations and companies deploy information extraction tools in their production environments. google scholar, microsoft academic search, researchgate, citeulike etc. provide academic search engine facilities. european publication server (epo), researchgate and mendeley use grobid for header extraction and analysis. a similar utility named svmheaderparse is deployed by citeseerx for header extraction. majority of the developed tools either make use of commercial libraries (non open source) or are very specific to some literary writing formats. also, through a comprehensive literature survey, we find comparatively less research in document structure analysis than metadata and bibliography extraction from scientific documents. the main challenges lie in the inherent errors in ocr processing and diverse formatting styles adopted by different publishing venues. we believe that a key strategy to tackle this problem is to analyze research articles from different publishers to identify generic patterns and rules, specific to various information extraction tasks. we introduce ocr++, an open source hybrid framework to extract textual information such as (i) metadata – title, author names, affiliation and e-mail, (ii) structure – section headings and body text, table and figure headings, urls and footnotes, and (iii) bibliography – citation instances and references from scholarly articles. we analyze a diverse set of scientific articles written in english to understand generic writing patterns and formulate rules to develop this hybrid framework. extensive evaluations show that the proposed framework outperforms the existing state-of-the-art tools by a large margin in structural information extraction along with improved performance in metadata and bibliography extraction tasks, both in terms of accuracy (around 50% improvement) and processing time (around 52% improvement). a user experience study conducted with the help of 30 researchers reveals that the researchers found this system to be very helpful. the result of the framework can be exported as a whole into structured tei-encoded documents. ocr++ is an extraction framework for scholarly articles, completely written in python. the framework takes a pdf article as input, 1) converts the pdf file to an xml format, 2) processes the xml file to extract useful information, and 3) exports output in structured tei-encoded documents. we use open source tool pdf2xml to convert pdf files into rich xml files. each token in the pdf file is annotated with rich metadata, namely, x and y co-ordinates, font size, font weight, font style etc. we leverage this rich information present in the xml files to perform extraction tasks. although each extraction task described below is performed using machine learning models as well as hand written rules/heuristics, we only include the better performing scheme in our framework. the current version of ocr++ is deployed at our research group server . the present infrastructure consists of single centos instance ( . we make the entire source code publicly available ( we also plan to make annotated dataset publicly available in the near future.

Other Photos :

No Updates

Project Details :
  • Date : Oct 12,2016
  • Innovator : Mayank Singh
  • Team Members : Barnopriyo Barua,Priyank Palod,Manvi Garg,Sidhartha Satapathy,Samuel Bushi,Kumar Ayush,Krishna Sai Rohith,Tulasi Gamidi
  • Guide Name : Dr. Pawan Goyal And Dr. Animesh Mukherjee
  • University : Indian Institutes of Technology Kharagpur
  • Submission Year : 2017
  • Category : Computer science, Information technology & related fields
Share Project :