[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [smila-dev] search record: group by vs. faceting

Hi,

>> drive this even further. The question is: do we want to spec it (filtering) in that detail as a general convention or shall we leave this to impl. of integrated search technologies?
> No I donât think we should specify too much. It would be just guessing. So rather: Keep it simple, let's focus on what we (you ;-) need today. If there is some really fancy feature next year that doesn't fit it, we can extend the specification then.

I didnât pose my question here properly me thinks. Solr supports already now diff. kinds of faceting and hence the solr impl. will spec this accordingly on its impl. page. In contrast to that, we have the more generic convention that all impls should/could adhere to. So the question was here: shall we drive the generic convention to support all solr capabilities or shall we keep this more simple than what can be done and spec'ed with solr? I think we shpuld keep the general spec fairly simple and to the most common use cases, in this case I would keep the generic faceting config limited to just the common enum case and the range stuff just on the solr side. Later we can generalize this too if we feel like it. 


# Topic facetName:
> - We don't use "-" in parameter names yet, I think, so this should rather be "facetName".
noted

>So "normally" you specify the attribute to use for faceting and the 
>attribute name will be used as the facet-name, too
Roger that. I like yours better and will do it like so. Thx

> This would make it possible to have different facetings for a single attribute.
Exactly. That is why I introduced this but solr isnât there yet quite either but are working on it

> If the facet parameter does not contain an "attribute" because the faceting algorithm does not use a single attribute or whatever, the "facetName" would be required, of course. ...
Correct!

> I just wanted to make sure that the result record can still contain the standard "records" list of ungrouped results - if the search engine can produce it, of course:
Ah, now I get ur comment. Yes, that was my intention anyhow but didnât spell it out.

# Topic group by

> I'm a bit concerned about the <Map key="${group-name}"> level. I can see that there may be use cases for it, but it makes the usage a bit inconvenient in most use cases, where only one grouping is used. Could we make it optional?
Yes, that is possible and I will do it like that then.

> _ asMainResult needs no _ prefix
That still stems from an older structure where I didnât have the ${group-name} yet and will be dropped now.

And yes we need the parameter, or rather it makes much sense when using group by for the special use case of duplicate removal, where only the best result per distinct attribute value should survive, because then ur not really interested in the groups themselves.

> I just think that JSON is more convenient for describing examples to human readers.
... if u are used to it. ATM I still read XML more fluent than JSON ;)


OK great. We will do the leg work then in our solr 3.5 impl (both solr specific and smila generic search record modifications) and the migrate this for the 1.1 version, when the CQs are thru.

Thomas Menzel @ brox IT-Solutions GmbH


-----Original Message-----
From: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] On Behalf Of JÃrgen Schumacher
Sent: Freitag, 13. Januar 2012 10:07
To: Smila project developer mailing list
Subject: Re: [smila-dev] search record: group by vs. faceting

Hi,

Thomas wrote:
> > should not define two structures for very similar things, but rather 
> > try to create one structure that support all 
> > âgrouping/faceting/clusteringâ use cases
> As I said above and mentioned in my initial mail, faceting and grouping/clustering are two fundamentally different things...

Thanks, I got it (now ;-). It's ok to have both. Anyway, as far as parameters or result structures are similar we should use the same stuff to represent them. But that's OK now in your examples.

> As you can see in the examples I have extended the faceting to support 
> ranges and also the filtering of selected facet values.  One could 
> drive this even further. The question is: do we want to spec it (filtering) in that detail as a general convention or shall we leave this to impl. of integrated search technologies?

No I donât think we should specify too much. It would be just guessing. So rather: Keep it simple, let's focus on what we (you ;-) need today. If there is some really fancy feature next year that doesn't fit it, we can extend the specification then. 

> Attached you will find some sample XMLs that spec both query and result side.

Looks ok to me, basically. I'm a bit concerned about the <Map key="${group-name}"> level. I can see that there may be use cases for it, but it makes the usage a bit inconvenient in most use cases, where only one grouping is used. Could we make it optional? So in most use cases this would be sufficient (and it's very similar to the faceting parameters)

<Val key="query">tv</Val>
<Seq key="groupby">
  <Map>
    <Val key="attribute">type</Val>
    <Val key="maxcount" type="long">10</Val>
    ...
  </Map>
  <Map>
    <Val key="attribute">size</Val>
    <Val key="maxcount" type="long">10</Val>
    ...
  </Map>
</Seq>

while in more sophisticated use cases your proposal could be used?

On the faceting examples: I suppose you are more accustomed to possible options here, so that's I cannot discuss these in detail. Just one thing:

You write in facetby.xml:

    <!-- facet-name defines the key in facets result map. Internal use of this value depends on search
      technology but it is likely to correspond to an attribute name. More advanced faceting features
      might not though... -->
    <Val key="facet-name">type</Val>

And later:

    <Val key="facet-name">size-gap</Val>
    <Val key="attribute">size</Val>

- We don't use "-" in parameter names yet, I think, so this should rather be "facetName".
- As the default use case is "faceting using attributes", I think it would be nicer to represent this in the "normal" parameter structure. So "normally" you specify the attribute to use for faceting and the attribute name will be used as the facet-name, too, so the first example could be just

<Map>
  <Val key="attribute">type</Val>
  <Val key="type">enum</Val>
  ...

If you want to, you can still add a facetName:

<Map>
  <Val key="facetName">type-enum</Val>
  <Val key="attribute">type</Val>
  <Val key="type">enum</Val>
  ...

which then would be used in the result as the key of the sequence instead of the attribute name:

<Map key="facets">
  <Seq key="type-enum">
    ...
  </Seq>
  <Seq key="size-gap">     
    ...
  </Seq>
  ...
</Map>

This would make it possible to have different facetings for a single attribute. Of course the client needs to remember which facetting is based on which attribute, if the key is not the attribute name. But I suppose that's not a real problem (:

If the facet parameter does not contain an "attribute" because the faceting algorithm does not use a single attribute or whatever, the "facetName" would be required, of course. The faceting algorithm would rather be specified by the "type" parameter anyway, instead of the name, or did I get this wrong? 

On the other side: If we don't have a real need for the "facetName" parameter now, it should be left out. Let's keep it simple.

>> I assume that the $maxcount most relevant results would still be listed as ârecordsâ as in a âungroupedâ search additionally, at least optionally?
> Hm, not quite understanding you comment here. Do you want to have one 
> of the grouped results be returned redundantly in the normal results, i.e. a main group that is selected on its hit count?
> If no: plz explain further, especially what you mean by: the $maxcount 
> most relevant results

I just wanted to make sure that the result record can still contain the standard "records" list of ungrouped results - if the search engine can produce it, of course:

{
  "count": 1234,
  "records": [...], // first 10 results (or whatever "maxcount" was set to in the request) ordered by ranking
  "groups": [...] // grouing result
}
 
> Anyhow, I have provided the option â_asMainResultâ to define the main group.

I'm not sure if this is really necessary, but if you need it, it's OK with me.

Btw, parameters in the search request record do not need "_" prefixes, as there should be no attribute names (as defined in the index schema) on the top level, but they are placed in a map under "query" (the query can be either written as a single query string (as in the "default search") or as a query record (as in the "advanced search")).

>> attribute values vs. keys & dynamic groups
> I will go with your proposal.

Fine (:

> Not relevant anymore now, but Iâm wondering if we should have one 
> serialization format dictate the designâ

No, it should not, of course, and it doesn't. I just think that JSON is more convenient for describing examples to human readers. It's equivalent to the XML representation in any case.


Cheers,
Juergen.
_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/smila-dev


http://www.Taglocity.com Tags: smila, spec