[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[List Home]
|
[smila-user] Crawler and link analysis
|
- From: Patrick Pekczynski <pekczynski@xxxxxxxxxxxxxx>
- Date: Sat, 14 Aug 2010 09:50:50 +0200
- Delivered-to: smila-user@eclipse.org
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:content-type; bh=Rt7P0iYGwpuoq/etmy1KWm6PyTE1ZNSXw+P7zHo8Z3Y=; b=EgLBWMcHQOZvsBrfYZn9aaksucdgYbO7p1sDypEBNjtZP/vSc07FbNEMFpJPKC6V/6 VaIv6kALNEKmbgPxVXbMMl9EykwOW2wHeTEjZFqhM6x9fh2f3V5BkGK86oEVY9PjIXdy 9LjGl9geOJYXrvGMOJQP2j+Xli85ZTfOYwCzU=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type; b=TU5I6oyrmAegLV846HPgnM63fZFMDBQYkfvj9rzUSWlPJlsrNtiezdGXV4COU7NooE rj9R3aqzPbJH8a+Y+8fF4rIxHS374RtUgiYG3f0WL1lyEr/YwnMiir03ghhURweT+/L3 vdS4HzBDoWgSFTsnPIFfXzrn4zS3jCWgsE5NQ=
- User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; de; rv:1.9.1.11) Gecko/20100711 Lightning/1.0b1 Thunderbird/3.0.6
Dear all,
I played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.
If I want to crawl a site A where A has links to B and to google.com
A -> B
A -> google.com
and I setup a web-crawler as follows:
<CrawlScope Type= "Broad"></CrawlScope>
<Seeds FollowLinks="Follow">
<Seed> A </Seed>
</Seeds>
<Filters>
<Filter Type="RegExp" Value=".*B.*" WorkType="Select"/>
</Filters>
I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.
I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.
My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A and ONLY following links matching some pattern B)
Thanks for your help
Kind regards,
Patrick
--
Patrick Pekczynski
Lilienweg 11
D - 66773 Schwalbach-Elm
eMail: pekczynski@xxxxxxxxxxxxxx
begin:vcard
fn:Patrick Pekczynski
n:Pekczynski;Patrick
email;internet:pekczynski@xxxxxxxxxxxxxx
version:2.1
end:vcard