AllBestEssays.com - All Best Essays, Term Papers and Book Report
Search

Crawler for Competitive Intelligence

Essay by   •  August 5, 2012  •  Research Paper  •  3,275 Words (14 Pages)  •  1,747 Views

Essay Preview: Crawler for Competitive Intelligence

Report this essay
Page 1 of 14

ABSTRACT

Competitive intelligence can be understood as the action of gathering, analyzing, and distributing information about products, customers, competitors and any aspect of the environment needed to support executives and managers in making strategic decisions for an organization.The goal of Competitive Intelligence (CI), a sub-area of Knowledge Management , is to monitor a firm's external environment to obtain information relevant to its decision-making process[1].In this paper we present a focused web crawler which collects web pages from the Web using different graph search algorithms and is the main component of a CI tool. A focused crawler or topical crawler is a web crawler that attempts to download only web pages that are relevant to a pre-defined topic or set of topics. The performance of a focused crawler depends mostly on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points[2]. CI Crawler collects Web pages from sites specified by the user and applies indexing and categorization analysis on the documents collected, thus providing the user with an up-to-date view of the Web pages.

FEATURES:-

* Crawls web pages based on any one of three algorithms:-FCFS, FICA, Dynamic Pagerank as has been explained later in the paper

* Crawls those web pages which get redirected:- Many URL's are redirected to some other web page with a new address. This causes a lot of problems for major search engines to locate the exact web page.We solved this problem by analyzing the "meta" tag and its attributes('http-equiv')

* Resolves relative URLs to absolute URLs:-Some web pages have links which are not absolute. We have solved this problem by converting the relative URL to its corresponding absolute URL.

* Extracts the useful text content from the webpages:- As our crawler crawls the web pages,it saves the pages in a ".txt" format in a local repository.We have successfully removed the "html" tags,the "javascripts",and other useless content from the page to extract out the relevant textual information.

* Avoids "ad servers" while crawling:- In a Web page there are some URLs which point to advertisement links which are useless to crawl in a "focused crawler". We are using a list of adservers(available on the internet) which can be updated dynamically in our program to block visiting such pages.

* Crawls in a focussed way, visiting pages pertained to a specific topic:- A CI crawler should crawl pages that are relevent to a particular topic entered by the user.This implies that the crawler should avoid crawling web pages that do not come under the specific topic.We are currently working on this aspect and plan to analyze the "title" tag and "keywords" of the web page to check whether the page is of relevance or not.

* Can also be used as a search engine,when integrated with LUCENE,an indexer written in JAVA.

INTRODUCTION

One of the recent and obvious trends in business today is increased competition. One reason for so much competition is because the World is now one single marketplace. Additionally, distribution channels like the internet now make it possible for anyone to enter the global marketplace. As a result of increased competition, the rate of change taking place in business is increasing exponentially. For example, internet usage now doubles every 100 days. If one expects to stay updated and survive in this fast paced competitive environment, he must know what the competition is doing. So how does one monitor the competition in this age of information overload? The answer is with Competitive Intelligence. The best way to implement CI is to focus on the recent developments of the competitors in that respective field.One must continually monitor critical issues if he expects to compete[3]. In this paper, we present a focused crawler,an integral part of CI tool which can serve as a great help for this purpose.The focussed web crawler has been developed by us using three different algorithms and each of them can be used by the user on the basis of his/her requirements. If the crawler is used specifically for a search engine then the "Partial Pagerank" or the "FICA" algorithm is suitable and if the crawler is used for CI then the "BFS" algorithm should be used as the number of domains crawled will be relatively lesser

Breadth First Search with First Come First Served (FCFS) priorty :-

We can picture the Web as a directed graph, with nodes represented by web pages and edges represented by the links between them.In this algorithm ,the web pages are crawled in the "breadth first search" manner with higher priorty given to pages in order of their occurrence[4].

Fast Intelligent Crawling Algorithm(FICA) based on DIJKSTRA's algorithm.[5]

This algorithm is also based on the Breadth first search algo but the priorty is assigned on the basis of the distance of each currently about to be crawled page from the root (starting URL).This distance is a logarithmic distance and the algorithm is dynamic, based on the DIJKSTRA's algorithm.

Dynamic Pagerank algorithm:-

Google uses the pagerank formula for ranking its pages and displaying the results according to the rank. We have used the same formula to set the priorty for crawling.This method uses the pagerank formula on the web pages seen so far and crawls the pages with higher pagerank first.

ARCHITECTURE

The Crawler for CI has the following components:-

URL Opener:- This component opens the web page for the parser to extract the links and the retriever to save the web page.

URL Retriever:- This component saves the web page opened by the URL Opener into a local repository for further analysis.

Web Page Filterer:-This component removes the irrelevent "html" and other assoiated tags to extract out the useful text.

Parser:- All the hyperlinks of a web page are visited through this component accordingly,with respect to the algorithm used.

ALGORITHMS USED

The web can be thought of as a connected , directed graph with the web pages represented by the nodes and the hyperlinks represented

...

...

Download as:   txt (16.7 Kb)   pdf (182.1 Kb)   docx (16.5 Kb)  
Continue for 13 more pages »
Only available on AllBestEssays.com