Quantcast
Channel: Planet Apache
Viewing all articles
Browse latest Browse all 9364

Ian Boston: OmniTI Report into OAE: Corrections

$
0
0

The OmniTI technical report into OAE that was written in February 2012 has finally been published on the Sakai OAE Collaboration site. It was written after a 3 day “deep dive” by Theo Schlossnagle and Ciprian Tutu, whom I have never met or had any communication with. Neither Theo or Ciprian are connected to project and OmniTI was paid to do this work partially on my recommendation.

The code base being considered based on Apache Sling with some additional bundles (Nakamura) and a content storage layer (SparseMap) intended to address shallow wide content hierarchies as used by Flikr, Google Docs etc as opposed to Enterprise Content Management hierarchies which Apache Sling and Apache Jackrabbit address. Most of the comments relate to the Nakamura and SparseMap although one or two relate to Sling.

Reading the report it is easy to assume the information on which this report is based is correct therefore the recommendations are also correct and was the information correct I would have no argument with the recommendations However, some of the information on which this report was based is a mixture of factually incorrect, lack of understanding and in some cases FUD. This makes the report deeply flawed. I understand from sources the first pages of the report were massaged by the customer for political effect, although I have no evidence to prove that. Even so, I agree with the conclusion of the section on SparseMap. If the core team using the code are not prepared to use current releases, or take ownership of that code, or talk to the developer who wrote it, then they should not be using it.

For those who have an interest in knowing where the information is correct and where it is not, I will go through each point in turn to put the record straight. I believe I have the knowledge to do this as I wrote 99% of SparseMap, 60% of Nakamura, a small amount of Apache Sling as a committer and PMC member and I have been a long time user and supporter of Apache Jackrabbit. I was not present at any of the meetings.

I left this project in June 2011 after the Sakai LA conference after a disagreement with the then project manager, over the way in which the project was being managed which was preventing me from doing my job as architect and server team lead. In October 2011, at the request of the of Nicolaas Matthijs the UI/UX lead, I fixed a bug that the server team had been struggling with for some time. This resulted in a planned (on Skype iChat, there is a transcript) and vicious public attack on me by the server team lead, the server team, some of the UI team, orchestrated by the then  project manager. The transcript shows that attack was designed to ensure I never got involved again.  Feeling responsible for the survival of this project I continued to actively develop SparseMap producing releases in time (often through the night and over weekends) to be included in OAE until shortly after February 2012 when it became obvious those releases would never be used even though they contained important bug fixes and some performance enhancements.  Although I offered to talk at the time this deep dive was done, and provide whatever pointers I could I was not asked a single question. For a project that has Open in its title, I find that odd. All of this is history, my version of events, and I have moved on. 10 months on I feel it is now safe to respond without causing any damage to the project.

I remain involved with the Sakai Foundation as deputy chair of the board of directors, although this blog post contains my own views as an individual. If I have mentioned a name above, it is because I have no argument with them, their ability or integrity.

The numbers are paragraphs from the report which can also be found here: omniti-sakai-oae-report-20120223

2.1.2: states multiple JDBC backends are supported. Multiple backends are not supported, and the MongoDB backend is not JDBC. Earlier posts on this blog strongly recommend using only the RDBMS backend and then only PostgreSQL. Unfortunately some involved in OAE were determined to use other databases or wanted to use NoSQL stores, but they have never been actively supported or tuned.

2.1.3: True based on correct information.

2.2.1: Acl representation. Incorrect. ACLs are not store and used relationally. They are loaded into memory and stored as compiled bit maps in a tree cache that is populated and invalidated as required. In tests 99% of ACL resolutions happen by   a single pointer based retrieval of the bitmap from memory followed by a small set of bitmap operations. The approach is similar to that used by Jackrabbit except the ACE’s are stored as bitmaps in longs to reduce space and processing requirements.

2.2.2: Sparse data represented in DB tables: Incorrect. The data is represented in serialised binary format and with a limited set of relational search tables providing SQL query capability to retrieve those. Once the nature of the data is understood custom queries and a custom relational model would have been built to ensure that the indexing and query mechanism was efficient. This is done entirely in the configuration files and assuming there were no bugs works without code changes. The version of Nakamura under review was only using the parent child relational queries. A decision was made, incorrectly, to use Solr as the only query engine some time after October 2011.

2.3.1: Ehcache replicates key value pairs over the entire cluster. Incorrect., in fact totally incorrect. Ehcache does not replicate key value pairs over the cluster and to do so would be suicidal. The ehcache configuration files clearly show that only 1 cache is replicated. The cache that contains 5×20 bytes of encryption keys per server. That cache is replicated once every 5 minutes per server. A handful of other caches are invalidated. Invalidation is either generational invalidation or key invalidation. Cluster tests which I did showed invalidation traffic is generally low.

2.3.2: True, however any deployer worth their salt would not deploy this application without a mod_expires, mod_headers and perhaps mod_cache configured correctly.

2.3.3: True. The standard deployment of OAE is to use mod_proxy_http and ideally run the proxy in event mode.

2.4.1: Solr in the same JVM. Incorrect. The intention was never to run Solr in the same JVM. Running Solr embedded was there to allow developers to run the server without having to spend time on configuration and was never intended for anything other than play deployments. So true, but gives the wrong impression.

2.4.2: True, although this has been partially addressed by Mark Triggs using a bitmap representation principals that can read an item (incorrect to say ACLs are embedded into Solr, only the read principals are). Its also true that this application,, using Solr as the only query engine, stresses Solr well beyond what it was intended for.

2.4.3: Information correct and recommendations are correct, although the search space size is more acute than noted since as the only query engine the cardinality of many key sets is far higher than all the words on the planet.

2.4.4: Partially correct. Dates are stored as a compact representation of calendar with timezone. Where they need to be searched in sparsemap they are converted to a suitable form for searching, and where they get indexed they are stored in a form suitable for Solr to query. Its possible that the code base has changed since what I have said was true as the Solr indexing underwent active development between October 2011 and the report. The recommendation is correct.

2.4.5: True.

2.5.1: True for the version of SparseMap that OAE was using at the time of this report, but that version was 2 or 3 versions out of date and has all of the problems reported in this blog. Load tests that were available at the time or shortly afterwards proved that later versions of SparseMap were drop in compatible and made the information on which this recommendation is based, incorrect.

2.5.2: True due to the pattern of use of the entities in question predicated by the UI. The recommendations overlook the fact that SparseMap is built to shard on keyIDs out the box.

2.6.1: Incorrect. The code base is heavily instrumented already with JMX both from Apache Sling and from Nakamura itself. The released version of SparseMap at the time had full JMX counters on most of its layers with details on the caching performance. ehcache has JMX stats as well. Most of these counters are accessible by higher performance routes than JMX, although for the low level of Nakamura performance (40 concurrent users IIRC) at the time JMX was more than adequate.

2.6.2: True

2.7.1: True although Daniel Parry at the University of Cambridge was running a puppetized cluster. This was documented a in public a few months before and had been running on hardware at Cambridge since October 2011. Others were doing the same.

2.7.2: Incorrect. The migration framework at the time was built to perform live data migration. The UI at the time was being developed to understand earlier versions of data and migrate as required. When a new version of Word comes out, Microsoft does not (thankfully) ask the Linux ext file system maintainers to write a migration script. OAE and Sparsemap are no different.

2.7.3: Partially true, there were scripts, but not one was using them. Alan Berg and I worked on JMeter tests with loading scripts in November 2010. Those load scripts were maintained until I left the project.

2.7.4: True

2.8.1: True, although I have to point out that the risk of not reducing the number of requests and interdependencies between the requests is absolutely critical to the success of this project. I regularly try and use an instance of OAE in the US and in the UK from Sydney and they are unusable, giving a white screen hiding a requireJS timeout on some pages. This application, with packet latencies over 150ms in the state it was when the report was written is unusable. It’s a pity this issue was buried so deep, although the issue has been largely resolved by recent work.

2.8.2: True and yes, 4 lines of Apache httpd config is enough to do this.

2.8.3: True, but this exposes a fundamental problem with the whole system that is not being addressed. The update rate to object required to maintain counts is so high that in a large installation the write traffic will be unsustainable. For large read 10K active user. (not large at all).

2.9.1 True

2.10.1 Not true. If it was true, then Apache Sling and Adobe CQ5 might not work. Sling does URL resolution by in memory pointer to a tree of hash maps which are cached. The URL resolver was re-written slightly to cope with very large numbers of content types about 2 years ago. If there really is a problem here, then a bug needs to be filed with Apache Sling.

2.10.2 Incorrect, The QoS filter is a modified version of the QoS filter in Jetty 6. Apache Sling does not support continuations and would struggle if it was used in a server running WebSocket or CometD, as it would hog the available event processing threads. The QoS filter (probably a  bad name) is a request throttle to limit the number of concurrent requests on group of URLs at any one time so that the Sling part of the server does not try and service 30K concurrent requests, when an CometD or WebSocket end point in the same JVM might be more than happy to deal with that number. The same could be achieved in some Httpd front ends.

2.10.3: True

2.10.4: At the time of the report I would have said it was possible to make Sakai OAE multi tenant and it had been a long held desire of mine. However, since working on other projects I now think that the design of OAE is fundamentally flawed from a multi-tenancy point of view. I agree with the recommendation, but expect it will be too costly to achieve.

2.10.5: True

2.10.6: Hmm, incorrect. There is no Chat API in Nakamura. There is a Chat API in Sakai 2.x which is a totally different code base. There is a Discussion and messaging API in Nakamura that might have been abused as chat.

2.10.7: SparseMap is thread safe and concurrent. Many hours were spent testing in YourKit under load to identify thread blocking and concurrency issues. Earlier blog posts report that work. It was written as a result of encountering concurrency issues with Jackrabbit. I hasten to add that Jackrabbit it also 100% thread safe but was not designed for high volume writes with large numbers of ACLs. I should also add that the problems were encountered in October 2010 and Jackrabbit addressed some of the issues we encountered in later work.

2.10.8: Yes, why not.

2.10.9: Hmm, the recommendation is something the URG should respond to, not me, although I have heard the answer many times before.

If you have read this and have an issue with anything I have said I will be happy to discuss here in comments or if you would prefer not to do that in public then just email me. If I have said anything you consider libelous, defamatory or offensive, then please let me know and will listen to your concerns. Causing offense is not my intent, correcting mistakes is.



Viewing all articles
Browse latest Browse all 9364

Trending Articles