[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[smila-user] Crawler and link analysis

Dear all,

I  played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.

If I want to crawl a site A where A has links to B and to google.com
A -> B
A -> google.com

and I setup a web-crawler as follows:

<CrawlScope Type= "Broad"></CrawlScope>

<Seeds FollowLinks="Follow">
<Seed> A </Seed>
</Seeds>
<Filters>
<Filter Type="RegExp" Value=".*B.*" WorkType="Select"/>
</Filters>

I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.

I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.

My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A and ONLY following links matching some pattern B)

Thanks for your help

Kind regards,

Patrick




-- Patrick Pekczynski Lilienweg 11 D - 66773 Schwalbach-Elm eMail: pekczynski@xxxxxxxxxxxxxx
begin:vcard
fn:Patrick Pekczynski
n:Pekczynski;Patrick
email;internet:pekczynski@xxxxxxxxxxxxxx
version:2.1
end:vcard