Re: [smila-user] Persistent unique IDs in SMILA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [smila-user] Persistent unique IDs in SMILA

From: Jürgen Schumacher <waeller@xxxxxxxxx>
Date: Fri, 17 Aug 2012 13:17:09 +0200
Delivered-to: smila-user@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-user>
List-help: <mailto:smila-user-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-user>, <mailto:smila-user-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/smila-user>, <mailto:smila-user-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0

Hi,

ad a) Depends on where the data is coming from. E.g. the WebCrawler created the record ID using the crawl job name and the URL of the crawled resource, so as long as the URL is unique enough for you, the record ID should be, too. In each job run, each crawler will generate the same record ID for the same resource, otherwise delta checking would not be possible. That's how I understand requirement 3, so it's OK?
ad b) No, the fingerprint is usually something like a "last-modification date", or a content hash (in Web Crawler). So it may be same for objects from different locations, or it changes, if the content of one object changes.
ad c) Yes, probably same problems as c)
ad d) correct, no 4)

So it looks to me that a) would be OK.

Jürgen.

Am 17.08.2012 09:58, schrieb Bjoern Decker:

Dear SMILAs,

Within the CUbRIK platform, each content object gets an persistent unique ID for identification.

This ID has the following requirements:

1.     It needs to be unique for an instance a CUbRIK platform or even better, worldwide. (The ID is used to allocate the content object within the CUbRIK platform.)

2.     It needs to be calculated locally on a node, i.e., with no central component that provides these IDs.

3.     It needs to be unique across job runs

4.     For the same object (from the same location with the same metadata) it should be the same

Initially, I came up with the following solutions

a)     Use the record ID created during job runs (does this fulfill 3.??)

b)    Use the fingerprint from delta indexing (Does this fulfill 1?)

c)     Creating a hash from the document metadata (which might be the same as b)

d)    Use GUIDs (however, this would not satisfy 4.)

But maybe you have a better suggestion for this.

Best wishes

Björn

References:
- [smila-user] Persistent unique IDs in SMILA
  - From: Bjoern Decker

Prev by Date: [smila-user] Persistent unique IDs in SMILA
Next by Date: [smila-user] Priorities in Job Processing in SMILA
Previous by thread: [smila-user] Persistent unique IDs in SMILA
Next by thread: [smila-user] Priorities in Job Processing in SMILA
Index(es):
- Date
- Thread

Breadcrumbs