[smila-user] Crawler and link analysis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[smila-user] Crawler and link analysis

From: Patrick Pekczynski <pekczynski@xxxxxxxxxxxxxx>
Date: Sat, 14 Aug 2010 09:50:50 +0200
Delivered-to: smila-user@xxxxxxxxxxx
Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject :content-type; b=TU5I6oyrmAegLV846HPgnM63fZFMDBQYkfvj9rzUSWlPJlsrNtiezdGXV4COU7NooE rj9R3aqzPbJH8a+Y+8fF4rIxHS374RtUgiYG3f0WL1lyEr/YwnMiir03ghhURweT+/L3 vdS4HzBDoWgSFTsnPIFfXzrn4zS3jCWgsE5NQ=
List-archive: <https://dev.eclipse.org/mailman/private/smila-user>
List-help: <mailto:smila-user-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; de; rv:1.9.1.11) Gecko/20100711 Lightning/1.0b1 Thunderbird/3.0.6

Dear all,

I  played a bit around with the SMILA crawling facilities, especially with the WEB-crawling component.

If I want to crawl a site A where A has links to B and to google.com
A -> B
A -> google.com

and I setup a web-crawler as follows:

<CrawlScope Type= "Broad"></CrawlScope>

<Seeds FollowLinks="Follow">
<Seed> A </Seed>
</Seeds>
<Filters>
<Filter Type="RegExp" Value=".*B.*" WorkType="Select"/>
</Filters>

I would expect the crawler to start at site A and then ONLY follow B, but instead it also crawls google.com.

I also tried to use WorkType="Unselect" instead which though a bit contraintuitive is recommended in the Crawler-Documentation.
But though the crawler should only follow "some matching Unselect filters" it not only crawls B but also google.com.

My question now is, whether someone can show me what I am doing wrong or how to setup such a scenario correctly (starting at A and ONLY following links matching some pattern B)

Thanks for your help

Kind regards,

Patrick




--
Patrick Pekczynski
Lilienweg 11
D - 66773 Schwalbach-Elm
eMail: pekczynski@xxxxxxxxxxxxxx

begin:vcard
fn:Patrick Pekczynski
n:Pekczynski;Patrick
email;internet:pekczynski@xxxxxxxxxxxxxx
version:2.1
end:vcard

Follow-Ups:
- RE: [smila-user] Crawler and link analysis
  - From: Sebastian Voigt

Prev by Date: [smila-user] JSP in embedded Tomcat
Next by Date: RE: [smila-user] Crawler and link analysis
Previous by thread: [smila-user] JSP in embedded Tomcat
Next by thread: RE: [smila-user] Crawler and link analysis
Index(es):
- Date
- Thread

Breadcrumbs