[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [emf-dev] Streaming Object & List
|
Yang,
My <em>comments</em> are
below.
This kind of discussion would be more
useful (and seen by more people) using the EMF newsgroup rather than the
EMF mailing list.
Ed Merks/Toronto/IBM@IBMCA
mailto: merks@xxxxxxxxxx
905-413-3265 (t/l 969)
"Yang ZHONG"
<leiwang.yangzhong@xxxxxxxxx>
Sent by: emf-dev-bounces@xxxxxxxxxxx
06/27/2006 09:17 PM
Please respond to
Eclipse Modelling Framework <emf-dev@xxxxxxxxxxx> |
|
To
| emf-dev@xxxxxxxxxxx
|
cc
|
|
Subject
| [emf-dev] Streaming Object & List |
|
XML has been extremely popular, its memory binding (Programming
Model) is hard to ignore. Current memory bindings such as JavaBean, Service
Data Objects and Eclipse Modeling Framework, have room to improve efficiency,
by streaming data.
<em>XML isn't just an in memory binding programming
model and streaming is a well known solution that also has its own limitations.
Many models involve cross references to objects that are somewhere
far away in the stream and in the worst case scenario, the processing will
be most efficiently implemented with full in memory model.</em>
Index
. Many people know DOM is much less efficient than SAX
<em>There are lazy DOM implementation that parse
the serialization as required to answer the API queries.</em>
. Current memory bindings are as inefficient as DOM even if SAXed
<em>Lazy instantiation schemes have their own inefficiencies
and saying that all implementations will be just as inefficient as some
simple hypothetical DOM implementation is a bold assertion.</em>
. While SAX/DOM pushes, StAX pulls which offers an opportunity to load
on demand
<em>The EStore API supports on demand loading. So
do proxies, and in EMF 2.2 there is support for containment proxies to
allow this be used anywhere.</em>
. Loading on demand improves efficiency, a lot
.. ZERO cost scenario
<em>I don't belive there is a such thing as zero
cost. I.e., there is no free lunch in this universe.</em>
... Execution Path
... Update only
... (Collection) Append only
.. Lower cost scenario
. Streaming Object
. Streaming List
. Modeling Frameworks
.. Modeling-neutral StreamReader
.. JavaBean
.. Service Data Objects
.. Eclipse Modeling Framework
. Implementation
.. Load on demand
.. StreamObject & StreamList injection
.. Code Generation (static object)
.. Dynamic object
.. Concurrent access
. Many people know DOM is much less efficient than SAX
<em>Comparing DOM with SAX is comparing apples and
oranges. Depending on your processing needs, one will better serve
those needs than another and typically DOM uses SAX to populate the data.</em>
Document Object Model (org.w3c.dom) is fully populated
before available, it costs time and space(memory), that's the perspective
it's much less efficient than Simple API for XML (org.xml.sax).
<em>There are lazy DOM implementation that build
the DOM on demand from a cached byte stream.</em>
. Current memory bindings are as inefficient as DOM even
if SAXed
<em>There isn't just one DOM implementation that
provides a basis for making this sweeping statement and I don't understand
what an even less efficient non-SAXed implementation would look like.</em>
Current memory bindings such as JavaBean, SDO and EMF,
are also fully populated before available, even if SAX even StAX is used
to populate data from XML into memory data structure, that's the perspective
they're as inefficient as DOM because of the cost of both time and space(memory).
<em>People have used and are using EMF to build models
on demand by parsing the bytes lazily. Containment proxies in EMF 2.2 support
this approach in general, and the EStore API also supports this.</em>
. While SAX/DOM PUSHes, StAX PULLs which offers an opportunity
to load on demand
<em>There's no reason an API can't support on demand
fetching of data. Certainly CDO and Teneo (EMFT projects) back the
data in a database and rely on this on demand capability.</em>
Streaming API for XML (javax.xml.stream) works completely
opposite direction against SAX/DOM from driving perspective. While SAX/DOM
parser drives the processing and PUSHes data from XML to handlers or directly
into memory data structure, StAX processing is driven by demand and demand
PULLs data out of XML. It offers an opportunity to load on demand if memory
bindings themselves drive the StAX processing.
<em>If each pull involves a disk IO access, it may
end up being far less efficient if ultimately all the data is pulled anyway.
I.e., different problems have different optimal solutions.</em>
. Loading on demand improves efficiency, a lot
<em>I typically don't believe any performance assertion
that's not backed by a complete implementation whose completeness can be
verified.</em>
.. ZERO cost scenario
<em>I simply cannot buy zero cost. There needs
to be a cache somewhere, setting up the link to that cache will not be
free, and accessing the cache will not be free.</em>
... Execution Path
execute (Order order,Product
fromUpStream)
{
if( order.paid() )
{
fromUpStream.get...
fromUpStream.set...
toDownStream( fromUpStream);
}
else
toDownStream( fromUpStream); /*
"fromUpStream" does NOT need to be read and parsed at all,
the data can be DIRECTLY PIPED to down stream,
NEITHER time NOR space(memory) cost at all */
}
<em>What if the thing that needs updating is upstream
from the data that will be used to update it?</em>
... Update only
<complexType name="Product">
<sequence>
<element name="Property1"
type="int"/>
<element name="Property2"
type="float" maxOccurs="unbounded"/>
...
<element name="Property100"
type="date"/>
</sequence>
</complexType>
Given that definition and this instance:
<Product>
<Property1>1</Property1>>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.setProperty100( "2006-06-25");
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the
data can be PIPED to down stream with "<Property100>2006-06-25</Property100>"
inserted, NEITHER time NOR space(memory) cost at all.
<em>It's always possible to construct a scenario
that will be optimal for a given design. For example, if all I need to
do is return a value stored as an attribute on the root element, clearly
I don't need to read beyond that root element and building a full in memory
representation will be far from optimal.</em>
A more interesting scenario is, given above same instance and this code:
execute (Product fromUpStream)
{
fromUpStream.setProperty1( "3");
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the
data can be PIPED to down stream with "1" ignored and replaced
with "3", NEITHER time NOR space(memory) cost at all.
... (Collection) Append only
Given above definition and this instance:
<Product>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.getProperty2().add( 2.2000001);
toDownStream( fromUpStream);
}
"fromUpStream" does NOT need to be read and parsed at all, the
data can be PIPED to down stream with "<Property2> 2.2000001</Property2>"
inserted, NEITHER time NOR space(memory) cost at all.
.. Lower cost scenario
Many people know XML is string (human readable) based,
while memory binding is binary.
<em>Personally I don't agree that XML is human readable.
It's only just barely human readable and is full of baggage that
benifts most the machine reader. Human readable languages designed
as such aren't as obtuse as XML.</em>
The binding has TWO stages:
1. READ literal string out of XML
2. PARSE the literal string to binary
The parsing costs time more or less, and sometimes space(memory) depending
on complexity and algorithm.
Given above definition and this instance:
<Product>
<Property1>3</Property1>>
<Property2>2.0</Property2>
<Property2>2.1</Property2>
...
<Property2>2.2000000</Property2>
<Property100>2006-06-25</Property100>
</Product>
and this code:
execute (Product fromUpStream)
{
fromUpStream.getProperty2 ().get( 1);
fromUpStream.getProperty1();
toDownStream( fromUpStream);
}
Since Property2[1] is demanded, the XML instance can be read through "<Property2>2.1</Property2>"
and the literal string (" 2.1") can be parsed into memory before
returning the binary(float). The literal string (" 2.1") itself
can also be weakly cached to speed up XML exporting if no more change to
Property2[1].
The rest of "fromUpStream" do NOT need to be
read and parsed at all, they can be PIPED to down stream, both time and
space(memory) are spared, simetimes a lot.
Since the XML processing is streaming instead of random
accessing, the data ahead of Property2[1] are read and the literal strings
can be stored, however parsing is NOT required right away, parsing space(memory)
if any and time can be spared if NEVER demanded.
Later on whenever Property1 or Property2[0] is ever demanded, the stored
literal string can then be parsed into memory before returning the binary.
Then the literal string storage can become a weak cache to speed up XML
exporting if no more change to the property. Any more change to the property
can invalidate the weak cache to release space(memory) initiatively.
The cached literal strings can spare some time of XML exporting without
space(memory) sacrifice since references are weak (Java). The stored literal
strings (of properties whose values are never demanded) can also spare
some time of XML exporting, as for space(memory) gain/loss, it's case by
case since some binaries are less than its literal representation while
some others are more.
<em>As I said, there are definitely scenarios that
are highly amenable to streaming, but not every scenario is so amenable.</em>
Property accesses include "isSet" and "unset",
besides "get" and "set". While "get" demands
reading and parsing, "isSet" only needs reading and can defer
parsing which may never be demanded.
<em>Yes. So a good lazy parsing implementation
could exploit that.</em>
. Streaming Object
Loading on demand is driven by memory binding, however
streaming reading may reach other data before the demanded one, so the
streaming reading (StreamReader) needs to notify reached literal strings
which are not demanded yet.
Here's the protocol which can be used to communicate:
interface StreamObject<Type,Property,C>
{
Object get (int propertyID); // StreamList
Type getType();
List<Property> getInstanceProperties();
C getContainer(); // StreamObject<Type,Property,?>
void set (StreamReader<Type,Property>
reader);
StreamObject<Type,Property,?> createUnlessRead (int
propertyID,QName typeXSI,Type type);
void setUnlessRead (int propertyID,String stringPropertyValue);
void setLiteralValue (int propertyID,QName typeXSI,String
value);
Object parseLiteralValue (int propertyID,QName
typeXSI,String value,Type type);
}
. Streaming List
<em>I'm not sure I understand. Perhaps an executable
prototype would be more convincing about functional completeness and would
support measurements to back up the assertions.</em>
Loading on demand is driven by StreamObject, however StreamReader
may reach maxOccurs>1 property value(s) before the demanded one, so
the StreamReader needs to notify reached literal strings which are not
demanded yet.
Here's the protocol which can be used to communicate:
interface StreamList<Type>
{
void addStreamValue (Object value);
void addLiteralValue (QName typeXSI,String value);
Object parseLiteralValue (QName typeXSI,String
value,Type type);
}
. Modeling Frameworks
.. Modeling-neutral StreamReader
There're many Modeling Frameworks, in order for StreamReader
to support as many of them as possible, here's a Modeling Framework adapter
protocol:
interface ModelingFramework<Type,Property>
{
Type type (Property property);
boolean many (Property property);
Collection getAliasNames (Property property);
Class getInstanceClass (Type type);
List<Property> properties (Type
type);
Property element (String space,String
name);
Object getNameSpace (Property property);
Object getLocalName (Property property);
enum PropertyKind
{
ELEMENT,
ATTRIBUTE,
OTHER
}
PropertyKind kind (Property property);
int property (Type type,List<Property>
properties,String space,String name,boolean element);
StreamObject<Type,Property,?>
create (String space,String name);
StreamObject<Type,Property,?> create (Type
type);
}
.. JavaBean
class JavaBeans implements ModelingFramework<Class,PropertyDescriptor>
{
public final/*many*/ Class type (PropertyDescriptor
property)
{
return property.getPropertyType ();
}
public boolean many (PropertyDescriptor
property)
{
return List.class.isAssignableFrom( type( property));
}
public Collection getAliasNames (PropertyDescriptor
property)
{//TODO cache
return Collections.singleton( property.getName());
}
public Class getInstanceClass (Class type)
{
return type;
}
public List<PropertyDescriptor> properties
(Class type)
{//TODO cache
try
{
return Arrays.asList( Introspector.getBeanInfo(
type).getPropertyDescriptors());
}
catch(IntrospectionException e)
{}
return Collections.EMPTY_LIST;
}
public StreamObject<Class,PropertyDescriptor,?>
create (Class type)
{
try
{
return (StreamObject<Class,PropertyDescriptor,?>)type.newInstance();
}
catch(Exception e)
{}
return null;
}
}
.. Service Data Objects
class SDO implements ModelingFramework<Type,Property>
{
public Type type (Property property)
{
return property.getType();
}
public boolean many (Property property)
{
return property.isMany();
}
public Collection getAliasNames (Property
property)
{
return property.getAliasNames();
}
public Class getInstanceClass (Type type)
{
return type.getInstanceClass();
}
public List<Property> properties
(Type type)
{
return type.getProperties();
}
public Property element (String space,String
name)
{
return XSDHelper.INSTANCE.getGlobalProperty(
space, name, true);
}
public final Object getNameSpace (Property
property)
{
return XSDHelper.INSTANCE.getNamespaceURI (
property);
}
public final Object getLocalName (Property
property)
{
return XSDHelper.INSTANCE.getLocalName( property);
}
public PropertyKind kind (Property
property)
{
return XSDHelper.INSTANCE.isElement( property)
? PropertyKind.ELEMENT
: XSDHelper.INSTANCE.isAttribute(
property)
? PropertyKind.ATTRIBUTE
: PropertyKind.OTHER;
}
public StreamObject<Type,Property,?> create
(String space,String name)
{
return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create(
space, name);
}
public StreamObject<Type,Property,?> create
(Type type)
{
return (StreamObject<Type,Property,?>)DataFactory.INSTANCE.create(
type);
}
}
.. Eclipse Modeling Framework
class EMF implements ModelingFramework<EClassifier,EStructuralFeature>
{
public EClassifier type (EStructuralFeature
property)
{
return property.getEType();
}
public boolean many (EStructuralFeature
property)
{
return property.isMany();
}
public Collection getAliasNames (EStructuralFeature
property)
{//TODO cache
return Collections.singleton( property.getName());
}
public Class getInstanceClass (EClassifier
type)
{
return type.getInstanceClass();
}
public List<EStructuralFeature> properties
(EClassifier type)
{
return ((EClass)type).getEAllStructuralFeatures();
}
public EStructuralFeature element (String
space,String name)
{
return ExtendedMetaData.INSTANCE.getElement
( space, name);
}
public final Object getNameSpace (EStructuralFeature
property)
{
return ExtendedMetaData.INSTANCE.getNamespace(
property);
}
public final Object getLocalName (EStructuralFeature
property)
{
return ExtendedMetaData.INSTANCE.getName( property);
}
public PropertyKind kind (EStructuralFeature
property)
{
switch( ExtendedMetaData.INSTANCE.getFeatureKind
( property) )
{
case ExtendedMetaData.ELEMENT_FEATURE:
return PropertyKind.ELEMENT;
case ExtendedMetaData.ATTRIBUTE_FEATURE:
return PropertyKind.ATTRIBUTE
;
}
return PropertyKind.OTHER;
}
public int property (EClassifier type,List<EStructuralFeature>
properties,String space,String name,boolean element)
{
final EStructuralFeature property = element
? ExtendedMetaData.INSTANCE.getElement( (EClass)type, space, name)
: ExtendedMetaData.INSTANCE.getAttribute( (EClass)type, space, name);
return null == property
? -1
: property.getFeatureID();
}
public StreamObject<EClassifier,EStructuralFeature,?>
create (String space,String name)
{
return (StreamObject<EClassifier,EStructuralFeature,?>)PackageFactory.create(
space, name);
}
public StreamObject<EClassifier,EStructuralFeature,?>
create (EClassifier type)
{
return (StreamObject<EClassifier,EStructuralFeature,?>)EcoreUtil.create(
(EClass)type);
}
}
. Implementation
.. Load on demand
class ProductImpl implements StreamObject
{
public void set (StreamReader stream)
{
reader = stream;
}
protected StreamReader reader/* = null*/;
public int getProperty1()
{
if( Property1_not_read() )
return Property1 = reader.loadPropertyValue(
this, Property1_ID);
if( WeakReference.class == literalValue.getClass
() ) // parsed
return Property1;
Property1 = parse( literalValue);
literalValue = new WeakReference( literalValue);
return Property1;
}
protected int Property1;
protected Object literalValue;
}
.. StreamObject & StreamList injection
For existed code, if change to support StreamObject &
StreamList isn't desired, injection may be utilitized.
.. Code Generation (static object)
Code can be regenerated, or new code can be generated,
to support StreamObject & StreamList.
.. Dynamic object
Memory bindings such as Service Data Objects and Eclipse
Modeling Framework, enable dynamic objects besides the static ones (CodeGen).
Their implementation can be extended to support StreamObject & StreamList.
.. Concurrent access
Since StreamObject & StreamList are loading on demand,
synchronization may be necessary for concurrent accesses. And there may
be multiple objects loading from one stream, the synchronization may need
to consider the shared one stream.
<em>Does this extend to support data that exist
is separate multiple streams or data backed by a database or other non-XML
sources?</em>
You're much more than welcomed to comment.
And if you find it happen to be interesting, I can also post/wiki the prototype.
Help will be appreciated very much, especially areas such as code injection,
JavaBean ModelingFramework implementation conforming to JAXB and test cases
demonstrating performance gain by loading on demand.
<em>EMF is not intended to support JAXB. It's
intended to be much more general to cover non-XML data and to deal with
multiple resource models, since most data doesn't come from just a single
XML file. The discussion is interesting, but I'm confused how this
generalized or where you expect this discussion to go. It seems you
are trying to define yet another programming model; one that's particularly
(or perhaps only) suited to streamed XML processing.</em>
--
Yang ZHONG [attachment "StreamingObject&List.HTML" deleted
by Ed Merks/Toronto/IBM] _______________________________________________
emf-dev mailing list
emf-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/emf-dev