摘要详情

ID / 提交时间

274 / 2017-01-31 14:26:19

标题

Implementation of PDF Crawler using Boolean inverted index and n-gram model

关键字

keyword,key-phrase,inverted index,n-gram

主题及专题

全体主题

状态

全文录用

作者

Snehal Kadwe / Yeshwantrao Chavan College of Engineering

Shrikant Ardhapurkar / Yeshwantrao Chavan College of Engineering

摘要

Now a days, most of the users wish to store their information in PDF document, retrieval of such document are most formidable task. To overcome this problem, PDF crawler is implemented. PDF document can be retrieved using keyword and key-phrase present in it. The extraction of keyword is based on Boolean inverted index where as key-phrase is extracted using n-gram algorithm. The pre-processing of PDF document begins with assigning term frequency (TF) to each and every word available in it as well as each document is mapped with unique id called as (docID). After mapping the keyword with term-frequency it extract the keyword which has highest count and store into the database using inverted index with pair of docID and keyword. The key-phrase is extracted by using n-gram. Inverted index makes the pdf crawler faster by storing the documents at one place which contains the same keyword. It helps to reduce storage space as well as it optimized the time required to retrieve the document.