[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
RE: [smila-user] Crawler and link analysis

Hi,

Filters are used in connection with the tag CrawlScope and also with tag <Seeds FollowLinks>.


In Connection with the CrawlScope:

If a link matches the configured CrawlScope only the Unselect-Filters are checked.
If a link doesn't match the Select Filters are checked.

In your case the CrawlScope Broad matches to every link, thus the Select-Filters are not used.


In Connection with the Seed FollowLinks:

Follow -> If a Unselect-Filter matches the Link is only analyzed (means will be spidered, but will not be stored in the index), 
    Select-Filters are not used
NoFollow->If a Unselect-Filter matches the Link will not be spidered!

FollowLinksWithCorrespondingSelectFilter-->
pages that match both "Select" and "Unselect" filters will be indexed , and everything else that matches  will be analyzed


What does it mean for your case:

If A and B are on the same domain/host you should use the CrawlScope:Domain/Host/Path
Google-Links should not be spidered in this case.

You can also use the FollowLinks="NoFollow" Mode and explicit forbid google with a Unselect-Filter.


Also the following Line in log4.properties should result in more logging information regarding the webcrawler.
log4j.logger.org.eclipse.smila.connectivity.framework.crawler.web=DEBUG

Hope this helps.

Sebastian


> -----Original Message-----
> From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Patrick Pekczynski
> Sent: Saturday, August 14, 2010 9:51 AM
> To: smila-user@xxxxxxxxxxx
> Subject: [smila-user] Crawler and link analysis
> 
> Dear all,
> 
> I  played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.
> 
> If I want to crawl a site A where A has links to B and to google.com A -> B A -> google.com
> 
> and I setup a web-crawler as follows:
> 
> <CrawlScope Type= "Broad"></CrawlScope>
> 
> <Seeds FollowLinks="Follow">
> <Seed> A </Seed>
> </Seeds>
> <Filters>
> <Filter Type="RegExp" Value=".*B.*" WorkType="Select"/> </Filters>
> 
> I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.
> 
> I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
> But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.
> 
> My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A
> and ONLY following links matching some pattern B)
> 
> Thanks for your help
> 
> Kind regards,
> 
> Patrick
> 
> 
> 
> 
> --
> Patrick Pekczynski
> Lilienweg 11
> D - 66773 Schwalbach-Elm
> eMail: pekczynski@xxxxxxxxxxxxxx