Re: [emf-dev] Streaming Object & List

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [emf-dev] Streaming Object & List

From: Ed Merks <merks@xxxxxxxxxx>
Date: Wed, 28 Jun 2006 01:33:04 -0400
Delivered-to: emf-dev@xxxxxxxxxxx
List-archive: <http://eclipse.org/pipermail/emf-dev>
List-help: <mailto:emf-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/emf-dev>, <mailto:emf-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/listinfo/emf-dev>, <mailto:emf-dev-request@eclipse.org?subject=unsubscribe>

Yang,

My comments are below.

This kind of discussion would be more useful (and seen by more people) using the EMF newsgroup rather than the EMF mailing list.

Ed Merks/Toronto/IBM@IBMCA
mailto: merks@xxxxxxxxxx
905-413-3265 (t/l 969)

"Yang ZHONG" <leiwang.yangzhong@xxxxxxxxx>
Sent by: emf-dev-bounces@xxxxxxxxxxx

06/27/2006 09:17 PM

Please respond to
Eclipse Modelling Framework <emf-dev@xxxxxxxxxxx>

To	emf-dev@xxxxxxxxxxx
cc
Subject	[emf-dev] Streaming Object & List

XML has been extremely popular, its memory binding (Programming Model) is hard to ignore. Current memory bindings such as JavaBean, Service Data Objects and Eclipse Modeling Framework, have room to improve efficiency, by streaming data.

XML isn't just an in memory binding programming model and streaming is a well known solution that also has its own limitations. Many models involve cross references to objects that are somewhere far away in the stream and in the worst case scenario, the processing will be most efficiently implemented with full in memory model.

Index

. Many people know DOM is much less efficient than SAX

There are lazy DOM implementation that parse the serialization as required to answer the API queries.
. Current memory bindings are as inefficient as DOM even if SAXed

Lazy instantiation schemes have their own inefficiencies and saying that all implementations will be just as inefficient as some simple hypothetical DOM implementation is a bold assertion.
. While SAX/DOM pushes, StAX pulls which offers an opportunity to load on demand

The EStore API supports on demand loading. So do proxies, and in EMF 2.2 there is support for containment proxies to allow this be used anywhere.
. Loading on demand improves efficiency, a lot
.. ZERO cost scenario

I don't belive there is a such thing as zero cost. I.e., there is no free lunch in this universe.
... Execution Path
... Update only
... (Collection) Append only
.. Lower cost scenario
. Streaming Object
. Streaming List
. Modeling Frameworks
.. Modeling-neutral StreamReader
.. JavaBean
.. Service Data Objects
.. Eclipse Modeling Framework
. Implementation
.. Load on demand
.. StreamObject & StreamList injection
.. Code Generation (static object)
.. Dynamic object
.. Concurrent access

. Many people know DOM is much less efficient than SAX

Comparing DOM with SAX is comparing apples and oranges. Depending on your processing needs, one will better serve those needs than another and typically DOM uses SAX to populate the data.

Document Object Model (org.w3c.dom) is fully populated before available, it costs time and space(memory), that's the perspective it's much less efficient than Simple API for XML (org.xml.sax).

There are lazy DOM implementation that build the DOM on demand from a cached byte stream.

. Current memory bindings are as inefficient as DOM even if SAXed

There isn't just one DOM implementation that provides a basis for making this sweeping statement and I don't understand what an even less efficient non-SAXed implementation would look like.

Current memory bindings such as JavaBean, SDO and EMF, are also fully populated before available, even if SAX even StAX is used to populate data from XML into memory data structure, that's the perspective they're as inefficient as DOM because of the cost of both time and space(memory).

People have used and are using EMF to build models on demand by parsing the bytes lazily. Containment proxies in EMF 2.2 support this approach in general, and the EStore API also supports this.

. While SAX/DOM PUSHes, StAX PULLs which offers an opportunity to load on demand

There's no reason an API can't support on demand fetching of data. Certainly CDO and Teneo (EMFT projects) back the data in a database and rely on this on demand capability.

Streaming API for XML (javax.xml.stream) works completely opposite direction against SAX/DOM from driving perspective. While SAX/DOM parser drives the processing and PUSHes data from XML to handlers or directly into memory data structure, StAX processing is driven by demand and demand PULLs data out of XML. It offers an opportunity to load on demand if memory bindings themselves drive the StAX processing.

If each pull involves a disk IO access, it may end up being far less efficient if ultimately all the data is pulled anyway. I.e., different problems have different optimal solutions.

. Loading on demand improves efficiency, a lot

I typically don't believe any performance assertion that's not backed by a complete implementation whose completeness can be verified.

.. ZERO cost scenario

I simply cannot buy zero cost. There needs to be a cache somewhere, setting up the link to that cache will not be free, and accessing the cache will not be free.

... Execution Path

execute (Order order,Product fromUpStream)
{
if( order.paid() )
{
fromUpStream.get...
fromUpStream.set...
toDownStream( fromUpStream);
}
else
toDownStream( fromUpStream); /* "fromUpStream" does NOT need to be read and parsed at all,
the data can be DIRECTLY PIPED to down stream,
NEITHER time NOR space(memory) cost at all */
}

What if the thing that needs updating is upstream from the data that will be used to update it?

... Update only

<complexType name="Product">
<sequence>
<element name="Property1" type="int"/>
<element name="Property2" type="float" maxOccurs="unbounded"/>
...
<element name="Property100" type="date"/>
</sequence>
</complexType>
Given that definition and this instance:
<Product>
<Property1>1</Property1>>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.setProperty100( "2006-06-25");
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "<Property100>2006-06-25</Property100>" inserted, NEITHER time NOR space(memory) cost at all.

It's always possible to construct a scenario that will be optimal for a given design. For example, if all I need to do is return a value stored as an attribute on the root element, clearly I don't need to read beyond that root element and building a full in memory representation will be far from optimal.

A more interesting scenario is, given above same instance and this code:
execute (Product fromUpStream)
{
fromUpStream.setProperty1( "3");
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "1" ignored and replaced with "3", NEITHER time NOR space(memory) cost at all.

... (Collection) Append only

Given above definition and this instance:
<Product>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.getProperty2().add( 2.2000001);
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the data can be PIPED to down stream with "<Property2> 2.2000001</Property2>" inserted, NEITHER time NOR space(memory) cost at all.

.. Lower cost scenario

Many people know XML is string (human readable) based, while memory binding is binary.

Personally I don't agree that XML is human readable. It's only just barely human readable and is full of baggage that benifts most the machine reader. Human readable languages designed as such aren't as obtuse as XML.

The binding has TWO stages:
1. READ literal string out of XML
2. PARSE the literal string to binary
The parsing costs time more or less, and sometimes space(memory) depending on complexity and algorithm.

Given above definition and this instance:
<Product>
<Property1>3</Property1>>
<Property2>2.0</Property2>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
<Property100>2006-06-25</Property100>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.getProperty2 ().get( 1);
fromUpStream.getProperty1();
toDownStream( fromUpStream);
}
Since Property2[1] is demanded, the XML instance can be read through "<Property2>2.1</Property2>" and the literal string (" 2.1") can be parsed into memory before returning the binary(float). The literal string (" 2.1") itself can also be weakly cached to speed up XML exporting if no more change to Property2[1].

The rest of "fromUpStream" do NOT need to be read and parsed at all, they can be PIPED to down stream, both time and space(memory) are spared, simetimes a lot.

Since the XML processing is streaming instead of random accessing, the data ahead of Property2[1] are read and the literal strings can be stored, however parsing is NOT required right away, parsing space(memory) if any and time can be spared if NEVER demanded.
Later on whenever Property1 or Property2[0] is ever demanded, the stored literal string can then be parsed into memory before returning the binary. Then the literal string storage can become a weak cache to speed up XML exporting if no more change to the property. Any more change to the property can invalidate the weak cache to release space(memory) initiatively.
The cached literal strings can spare some time of XML exporting without space(memory) sacrifice since references are weak (Java). The stored literal strings (of properties whose values are never demanded) can also spare some time of XML exporting, as for space(memory) gain/loss, it's case by case since some binaries are less than its literal representation while some others are more.

As I said, there are definitely scenarios that are highly amenable to streaming, but not every scenario is so amenable.

Property accesses include "isSet" and "unset", besides "get" and "set". While "get" demands reading and parsing, "isSet" only needs reading and can defer parsing which may never be demanded.

Yes. So a good lazy parsing implementation could exploit that.

. Streaming Object

Loading on demand is driven by memory binding, however streaming reading may reach other data before the demanded one, so the streaming reading (StreamReader) needs to notify reached literal strings which are not demanded yet.
Here's the protocol which can be used to communicate:
interface StreamObject<Type,Property,C>
{
Object get (int propertyID); // StreamList
Type getType();
List<Property> getInstanceProperties();
C getContainer(); // StreamObject<Type,Property,?>

void set (StreamReader<Type,Property> reader);
StreamObject<Type,Property,?> createUnlessRead (int propertyID,QName typeXSI,Type type);
void setUnlessRead (int propertyID,String stringPropertyValue);
void setLiteralValue (int propertyID,QName typeXSI,String value);
Object parseLiteralValue (int propertyID,QName typeXSI,String value,Type type);
}

. Streaming List

I'm not sure I understand. Perhaps an executable prototype would be more convincing about functional completeness and would support measurements to back up the assertions.

Loading on demand is driven by StreamObject, however StreamReader may reach maxOccurs>1 property value(s) before the demanded one, so the StreamReader needs to notify reached literal strings which are not demanded yet.
Here's the protocol which can be used to communicate:
interface StreamList<Type>
{
void addStreamValue (Object value);
void addLiteralValue (QName typeXSI,String value);
Object parseLiteralValue (QName typeXSI,String value,Type type);
}

. Modeling Frameworks

.. Modeling-neutral StreamReader

There're many Modeling Frameworks, in order for StreamReader to support as many of them as possible, here's a Modeling Framework adapter protocol:
interface ModelingFramework<Type,Property>
{
Type type (Property property);
boolean many (Property property);
Collection getAliasNames (Property property);
Class getInstanceClass (Type type);
List<Property> properties (Type type);

Property element (String space,String name);
Object getNameSpace (Property property);
Object getLocalName (Property property);
enum PropertyKind
{
ELEMENT,
ATTRIBUTE,
OTHER
}
PropertyKind kind (Property property);

int property (Type type,List<Property> properties,String space,String name,boolean element);

StreamObject<Type,Property,?> create (String space,String name);
StreamObject<Type,Property,?> create (Type type);
}

.. JavaBean

class JavaBeans implements ModelingFramework<Class,PropertyDescriptor>
{
public final/*many*/ Class type (PropertyDescriptor property)
{
return property.getPropertyType ();
}
public boolean many (PropertyDescriptor property)
{
return List.class.isAssignableFrom( type( property));
}
public Collection getAliasNames (PropertyDescriptor property)
{//TODO cache
return Collections.singleton( property.getName());
}
public Class getInstanceClass (Class type)
{
return type;
}
public List<PropertyDescriptor> properties (Class type)
{//TODO cache
try
{
return Arrays.asList( Introspector.getBeanInfo( type).getPropertyDescriptors());
}
catch(IntrospectionException e)
{}
return Collections.EMPTY_LIST;
}
public StreamObject<Class,PropertyDescriptor,?> create (Class type)
{
try
{
return (StreamObject<Class,PropertyDescriptor,?>)type.newInstance();
}
catch(Exception e)
{}
return null;
}
}

.. Service Data Objects

class SDO implements ModelingFramework<Type,Property>
{
public Type type (Property property)
{
return property.getType();
}
public boolean many (Property property)
{
return property.isMany();
}
public Collection getAliasNames (Property property)
{
return property.getAliasNames();
}
public Class getInstanceClass (Type type)
{
return type.getInstanceClass();
}
public List<Property> properties (Type type)
{
return type.getProperties();
}
public Property element (String space,String name)
{
return XSDHelper.INSTANCE.getGlobalProperty( space, name, true);
}
public final Object getNameSpace (Property property)
{
return XSDHelper.INSTANCE.getNamespaceURI ( property);
}
public final Object getLocalName (Property property)
{
return XSDHelper.INSTANCE.getLocalName( property);
}
public PropertyKind kind (Property property)
{
return XSDHelper.INSTANCE.isElement( property)
? PropertyKind.ELEMENT
: XSDHelper.INSTANCE.isAttribute( property)
? PropertyKind.ATTRIBUTE
: PropertyKind.OTHER;
}
public StreamObject<Type,Property,?> create (String space,String name)
{
return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create( space, name);
}
public StreamObject<Type,Property,?> create (Type type)
{
return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create( type);
}
}

.. Eclipse Modeling Framework

class EMF implements ModelingFramework<EClassifier,EStructuralFeature>
{
public EClassifier type (EStructuralFeature property)
{
return property.getEType();
}
public boolean many (EStructuralFeature property)
{
return property.isMany();
}
public Collection getAliasNames (EStructuralFeature property)
{//TODO cache
return Collections.singleton( property.getName());
}
public Class getInstanceClass (EClassifier type)
{
return type.getInstanceClass();
}
public List<EStructuralFeature> properties (EClassifier type)
{
return ((EClass)type).getEAllStructuralFeatures();
}
public EStructuralFeature element (String space,String name)
{
return ExtendedMetaData.INSTANCE.getElement ( space, name);
}
public final Object getNameSpace (EStructuralFeature property)
{
return ExtendedMetaData.INSTANCE.getNamespace( property);
}
public final Object getLocalName (EStructuralFeature property)
{
return ExtendedMetaData.INSTANCE.getName( property);
}
public PropertyKind kind (EStructuralFeature property)
{
switch( ExtendedMetaData.INSTANCE.getFeatureKind ( property) )
{
case ExtendedMetaData.ELEMENT_FEATURE:
return PropertyKind.ELEMENT;
case ExtendedMetaData.ATTRIBUTE_FEATURE:
return PropertyKind.ATTRIBUTE ;
}
return PropertyKind.OTHER;
}
public int property (EClassifier type,List<EStructuralFeature> properties,String space,String name,boolean element)
{
final EStructuralFeature property = element
? ExtendedMetaData.INSTANCE.getElement( (EClass)type, space, name)
: ExtendedMetaData.INSTANCE.getAttribute( (EClass)type, space, name);
return null == property
? -1
: property.getFeatureID();
}
public StreamObject<EClassifier,EStructuralFeature,?> create (String space,String name)
{
return (StreamObject<EClassifier,EStructuralFeature,?>)PackageFactory.create( space, name);
}
public StreamObject<EClassifier,EStructuralFeature,?> create (EClassifier type)
{
return (StreamObject<EClassifier,EStructuralFeature,?>)EcoreUtil.create( (EClass)type);
}
}

. Implementation

.. Load on demand

class ProductImpl implements StreamObject
{
public void set (StreamReader stream)
{
reader = stream;
}
protected StreamReader reader/* = null*/;

public int getProperty1()
{
if( Property1_not_read() )
return Property1 = reader.loadPropertyValue( this, Property1_ID);
if( WeakReference.class == literalValue.getClass () ) // parsed
return Property1;
Property1 = parse( literalValue);
literalValue = new WeakReference( literalValue);
return Property1;
}
protected int Property1;
protected Object literalValue;
}

.. StreamObject & StreamList injection

For existed code, if change to support StreamObject & StreamList isn't desired, injection may be utilitized.

.. Code Generation (static object)

Code can be regenerated, or new code can be generated, to support StreamObject & StreamList.

.. Dynamic object

Memory bindings such as Service Data Objects and Eclipse Modeling Framework, enable dynamic objects besides the static ones (CodeGen). Their implementation can be extended to support StreamObject & StreamList.

.. Concurrent access

Since StreamObject & StreamList are loading on demand, synchronization may be necessary for concurrent accesses. And there may be multiple objects loading from one stream, the synchronization may need to consider the shared one stream.

Does this extend to support data that exist is separate multiple streams or data backed by a database or other non-XML sources?

You're much more than welcomed to comment.
And if you find it happen to be interesting, I can also post/wiki the prototype.
Help will be appreciated very much, especially areas such as code injection, JavaBean ModelingFramework implementation conforming to JAXB and test cases demonstrating performance gain by loading on demand.

EMF is not intended to support JAXB. It's intended to be much more general to cover non-XML data and to deal with multiple resource models, since most data doesn't come from just a single XML file. The discussion is interesting, but I'm confused how this generalized or where you expect this discussion to go. It seems you are trying to define yet another programming model; one that's particularly (or perhaps only) suited to streamed XML processing.

--

Yang ZHONG [attachment "StreamingObject&List.HTML" deleted by Ed Merks/Toronto/IBM] _______________________________________________ emf-dev mailing list emf-dev@xxxxxxxxxxx https://dev.eclipse.org/mailman/listinfo/emf-dev

References:
- [emf-dev] Streaming Object & List
  - From: Yang ZHONG

Prev by Date: [emf-dev] Need help - How to get correct value of a feature ? Why container class is Null ?
Next by Date: Re: [emf-dev] Need help - How to get correct value of a feature ? Why container class is Null ?
Previous by thread: Re: [emf-dev] Need help - How to get correct value of a feature ? Why container class is Null ?
Next by thread: [emf-dev] Europa and EMF 3.0
Index(es):
- Date
- Thread

Breadcrumbs