[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [smila-user] Crawler and link analysis
- From: Sebastian Voigt <svoigt@xxxxxxx>
- Date: Mon, 16 Aug 2010 14:14:20 +0200
- Accept-language: en-US, de-DE
- Acceptlanguage: en-US, de-DE
- Delivered-to: firstname.lastname@example.org
- Thread-index: Acs7hW7NvfJISUuOQWGHKZNZHePXogBsiauw
- Thread-topic: [smila-user] Crawler and link analysis
Filters are used in connection with the tag CrawlScope and also with tag <Seeds FollowLinks>.
In Connection with the CrawlScope:
If a link matches the configured CrawlScope only the Unselect-Filters are checked.
If a link doesn't match the Select Filters are checked.
In your case the CrawlScope Broad matches to every link, thus the Select-Filters are not used.
In Connection with the Seed FollowLinks:
Follow -> If a Unselect-Filter matches the Link is only analyzed (means will be spidered, but will not be stored in the index),
Select-Filters are not used
NoFollow->If a Unselect-Filter matches the Link will not be spidered!
pages that match both "Select" and "Unselect" filters will be indexed , and everything else that matches will be analyzed
What does it mean for your case:
If A and B are on the same domain/host you should use the CrawlScope:Domain/Host/Path
Google-Links should not be spidered in this case.
You can also use the FollowLinks="NoFollow" Mode and explicit forbid google with a Unselect-Filter.
Also the following Line in log4.properties should result in more logging information regarding the webcrawler.
Hope this helps.
> -----Original Message-----
> From: smila-user-bounces@xxxxxxxxxxx [mailto:smila-user-bounces@xxxxxxxxxxx] On Behalf Of Patrick Pekczynski
> Sent: Saturday, August 14, 2010 9:51 AM
> To: smila-user@xxxxxxxxxxx
> Subject: [smila-user] Crawler and link analysis
> Dear all,
> I played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.
> If I want to crawl a site A where A has links to B and to google.com A -> B A -> google.com
> and I setup a web-crawler as follows:
> <CrawlScope Type= "Broad"></CrawlScope>
> <Seeds FollowLinks="Follow">
> <Seed> A </Seed>
> <Filter Type="RegExp" Value=".*B.*" WorkType="Select"/> </Filters>
> I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.
> I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
> But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.
> My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A
> and ONLY following links matching some pattern B)
> Thanks for your help
> Kind regards,
> Patrick Pekczynski
> Lilienweg 11
> D - 66773 Schwalbach-Elm
> eMail: pekczynski@xxxxxxxxxxxxxx