Search This Blog

Friday, October 31, 2014

Capturing continued – building a fully functional demo

Another way of using the capturing features of Roxy is for building a demo quickly. For instance by generating a basic search app using the Application Builder, capturing that with Roxy, and then republishing it again after customizing it in every possible way you like. I gave a live demonstration of doing that in the MarkLogic User Group Benelux meetup 'Stop horsing around with NoSQL — How to Build a NoSQL Demo to Impress Your Boss'. This article gives full details on that demo.

Capturing and redeploying an Application Builder app basically comes down to the following steps:

  1. Create and deploy a Roxy REST project
  2. Use Roxy MLCP features to ingest data
  3. Use Roxy to deploy the analyze-data tool
  4. Create indexes using the analyze-data tool
  5. Create an App-Builder project for the content db of the Roxy project
  6. Run through the App-Builder wizard and deploy
  7. Use Roxy capture to capture the app-builder project
  8. Customize the app-builder code
  9. Redeploy to the Roxy project

And that is it. Roughly 7 steps to build a fully running demo, and then you can customize it any way you like, and redeploy as much and anywhere you like.

I’ll run through some details briefly, point to other blog articles of mine that are relevant, and wrap up with links to code, data and recordings of the live demo.

Other capture parameters


My previous blog article ‘Capturing MarkLogic applications with Roxy’ discusses capturing MarkLogic applications in general. It also mentions creating a new Roxy project, and various commands useful for capturing an Application Builder application.

In addition there is one other previously not mentioned capture flag that could be of interest. It is another way to capture configurations, but not all of them at once. Instead it just captures databases and servers matching the current Roxy project. Just use --ml-config instead of --full-ml-config:

./ml local capture --ml-config

This capture command runs faster as it is capturing much less, and you can often find the settings you are looking for easier, as the resulting ml-config file is much smaller. It actually allows targeting specific databases, and servers using extra parameters --databases and --servers:

./ml local capture --ml-config --databases=App-Services,Extensions,Fab --servers=App-Services

Loading data


Using MLCP features of Roxy was discussed briefly as well in my previous article. MLCP allows ingesting data directly from compressed archives, which is very convenient. You can also apply a transform while ingesting. The demo material contains runnable examples of both. Below just a little example of how such a command-line could look like:

./ml local mlcp import -input_file_path ../sample-data/horse-racing/ -input_compressed
-transform_module /ingest/ingest-events-with-geo.xqy -transform_namespace http://marklogic.com/demo

Analyzing data


The analyze-data tool is described in one of my earlier blog articles ‘Analyze your data!‘. Just a very crude tool, and far from flawless, but it can often be useful to get a jumpstart with it, just to get you going quickly.

Customizing app-builder projects


A brief word on customizing app-builder projects: if you intend to push the changes back into the modules database generated by the wizard itself, then make sure to put any customizations into the /application/custom/ folder. Not that there is much reason to push files back there if you can push them to anywhere with Roxy.

Doing so makes sure not only that you can go back to the wizard, make some changes, and redeploy, but also that you can recapture it with Roxy. Roxy will refresh the files in the src/ folder of Roxy, but will leave src/application/custom/ untouched. That means you can go back and forth between the wizard, and your own customizations as many times as you like!

The demo material


The demo material consists of:


The cheatsheet contains a brief intro, and a long list of all commands and steps you need to go through to run the entire demo yourself. Note that you need to go down on some slides to not miss steps!

The cheats are comprised of a set of various files. It would have take too much time to type everything myself live. The full demo already took about one and half hour with all the cheats.

The latter three provide all of the pieces of data that I used for the demo. I use the Geonames country info for a crude way to show a geospatial map with markers. The horse racing data is the core information. The triples are used to pull in extra info from DBPedia, and use that to add some semantics to the demo.

The recordings


You can watch the recording of the entire live demo with the following two links (we had a break roughly half-way). The quality is not perfect, but it should be understandable:


Have fun!

Thursday, October 9, 2014

Capturing MarkLogic applications with Roxy

Ever wanted to automate deployment of an existing MarkLogic application? Or regretted that you didn’t start off with Roxy for a MarkLogic project? The capture feature of Roxy will help with that!

I was recently asked to migrate a large database with data from one demo server to another. That is a simple task with the MLCP copy command. It even allows you to add collections, permissions, migrate to a different database root, etc. Unfortunately, I was asked to not only migrate the database, but the associated app servers as well, and impose a permission structure that allowed access to the data from a REST api instance. I decided to capture all relevant details in a Roxy project, and use that to automate deployment to the target environment.

Initializing a Roxy project


What you need first is an empty Roxy project structure. I’ll be assuming a project name myapp, running against MarkLogic 7. Get hold of the ml script if you don’t have it yet (you can download it from https://github.com/marklogic/roxy/tree/dev), and run the following command:

ml new myapp --server-version=7 --branch=dev --app-type=rest

This will create a Roxy project in a subfolder called myapp. It takes the dev branch of Roxy to utilize the latest cutting edge features, which we will need for the capture functionality that we are going to use here. It also creates a REST-type project, which gives you the emptiest Roxy project structure you can currently get with the ml new command.

Setting up environments


The next step is to add details about the relevant environments. In my case it concerned a development, and a production environment, so I used the pre-existing dev and prod environment labels. You can add your own ones as well. Just edit the environments property in deploy/build.properties. Create a environment-specific properties file. For dev you create a deploy/dev.properties. I typically put the following lines in such a file:

user=gjosten
password= 

app-port=8058
xcc-port=8059

content-forests-per-host=3

dev-server=mydev.server.com

Roxy will ask for the password if you keep it empty. Note: make sure that the name of the server property matches the environment. So it is dev-server for dev.properties, but prod-server for prod.properties.

Capturing ml-config


Once this is done, you are ready to take the first step in capturing MarkLogic settings, and code. Just run the following command:

./ml dev capture --full-ml-config

Replace ‘dev’ with the appropriate environment. The above command will create a new file named deploy/ml-config-dev.xml. It will contain a list of all app servers, databases, amps, users, roles, etc from the specified environment. You don’t want to bootstrap that, and luckily Roxy normally ignores the new file. Go into this file, and isolate all parts that are relevant for your application. Copy these over to deploy/ml-config.xml.

You probably want to replace the default parts generated by Roxy, but put them next to each other first. Roxy can use placeholders to insert values from the properties files. If you matched your project name with (partial) names of databases and app servers, you could decide to copy some placeholders over. One useful case could be to use app-port and xcc-port placeholders to aim for different ports per environment. Add more properties to the properties files, if you have additional app servers.

You could in theory replace the entire ml-config.xml with the captured ml-config, but usually that is not advisable. In case you do start off with the captured one, make sure to remove the XML processing-instruction at the top.

Testing ml-config


A large benefit from Roxy here is that you can easily do some dry runs against a local VM, or your own laptop. Run the following command to create app-servers, databases, and anything else you selected on your local environment:

./ml local bootstrap

Tweak the ml-config until bootstrap runs flawlessly. Then open the Admin interface, and verify everything looks complete and running correctly. Once here you are ready to bootstrap the target environment. That is just a matter of running bootstrap against a different environment.

Capturing modules and REST extensions


Just capturing and deploying the ml-config will likely not result in a fully functioning application. It very likely depends on additional code, like modules or REST extensions. Roxy provides two additional capture functions to get hold of those. If you have a more traditional application, not using the more recent REST api, you can run this:

./ml dev capture --modules-db=mymodules

Replace ‘mymodules‘ with the appropriate modules database name. All files in that database will be written to the src/ folder.

If your app is in fact a REST api instance, like applications generated with the App Builder, use this command instead:

./ml dev capture --app-builder=myappserver

Replace ‘myappserver’ with the name of the app-server that is the REST api instance. This will capture modules into the src/ folder, but also isolate REST transforms, REST extensions, and REST options into the rest-api/ folder.

Roxy by default assumes there is just one project-specific modules database. There are ways to deploy multiple sets of sources to different modules databases. But you might consider capturing those in separate projects. That is probably easier.

Testing deploying modules


You are close to having reproduced an entire MarkLogic application with just a few commands! Test the capture of modules and REST extensions by deploying them locally:

./ml local deploy modules

This will deploy both src, and all REST artifacts. After this you should be able to go to the newly created app servers, and have running applications! Repeat above against the target environment to get them up and running there as well.

Copying documents


Last step in the process if of course copying the contents of the document databases, and maybe also schemas, and triggers. MLCP is a very useful tool for that. You can either use separate MLCP export and import. You can use ml {env} mlcp for that! Or use MLCP copy to transfer directly between source and target. Unfortunately, you can’t use ml {env} mlcp for that (yet)..

Good luck!

Thursday, June 19, 2014

Analyze your data!

Now that I am part of the MarkLogic Vanguard team, I am regularly facing data sets that are unknown to me. Time is essential, particularly within this team where we often need to create a working demo in matter of weeks, sometimes days. The shape, and quality of the data often plays an important role. Knowing what kind of information is hidden in the data, allows using that within the demo. This applies not only to Vanguard, and demos, but applies to everyone, and every project in which data plays an important role.

I created a little MarkLogic REST extension that has helped me get a good, first impression from XML data sets quickly, and helps create an initial set of indexes in MarkLogic as well. It is available free of charge: https://gist.github.com/grtjn/1aba4eb364de9268fb5f.


Intro


The idea for such a tool rose years back, in the first years working with MarkLogic. I wasn’t working for MarkLogic back then, but the situation was similar. I often faced (relatively) unknown data sets, and had to ‘dig in’ to get acquainted with them. Documentation was (and is) usually lacking, or not yet in possession. A good understanding of the data at hand, can help a lot with for instance assessments, time estimates, making educated guesses whether (complex) transformations will be necessary or not, etc…

I also felt that creating indexes in MarkLogic was rather cumbersome. We use Roxy (http://github.com/marklogic/roxy) a lot within Vanguard. No surprise the two founders of that project are the lead-members of Vanguard. It provides a convenient way to provide index configuration (and many other MarkLogic settings) in XML, and push those with a single command. But you still have to write the index definitions yourself, often with a lot of repeated, and unknown namespaces, and such. I’d rather have a little tool to help me with that.

Why a REST extension? We tend to use JavaScript on top of the MarkLogic REST-api. I therefor packaged the tool as a MarkLogic REST extension. I also deliberately kept it single-file, to make it an easy and lightweight drop-in.

Deploying the tool


There are various ways to download and deploy the REST extension. To get you going:
After that you:
  • Copy the downloaded file into the Roxy folder for REST extensions, and deploy it:

    cp analyze-data.xqy rest-api/ext/
    ml local deploy modules

  • Or use Curl (replace myuser, mypass, and 8123 with appropriate values):

    curl --anyauth --user myuser:mypass -X PUT -i -H "Content-type: application/xquery" -d@"./analyze-data.xqy" http://localhost:8123/v1/config/resources/analyze-data'
You can now access it with http://localhost:8123/v1/resources/analyze-data (again, replace 8123 with the appropriate value).

Running the tool


The tool does various counts. The first part does counts over the entire database, which can take a while, depending on the size of your database. The latter part takes a random set of (by default) 20 files, and performs analysis on those. The final part allows creation of indexes.

When you open the tool (have a minute patience, it will be doing all sorts of counts for you!) you will see something like this:


It immediately reveals various details. It gives counts of total number of documents, and counts for each of the main document types supported by MarkLogic: XML, Text, and Binary. It also provides a full list of discovered collections (requires collection-lexicon to be enabled), as well as top-1000 directories (requires uri-lexicon to be enabled), giving doc counts for every entry it lists. The last global count that is done is by root element, see next image.



Sample analysis


Based on the randomly chosen sample set, it provides insight into namespaces known to the system, and occurring within the sample set, a list of unique element paths, and a list of unique paths to any element or attribute containing character data (‘value’ paths). Each path is accompanied with an averaged count.



Element path counts can be useful to look for container elements that qualify for. High numbers high in the tree are usually an indication for that.

Value path counts can be useful to investigate data completeness. If for instance a certain attribute occurs much less often on an element than other attributes, then either it is an attribute to mark special cases (useful to know!), or it is an indication that your data source is providing incomplete data. The latter typically occurs when you are receiving data from multiple (independent) sources.

Note: empty elements and attributes are excluded from value paths at the moment.


Indexes


The last part is most interesting though. Based on the value paths from the sample set, it evaluates each path across the sample set, takes first 50 values of each, and displays the top-3 values of each. It also guesses the data-type of each, based on the top-1 value.



There is a checkbox displayed next to each value path (also applies to the Sample value paths section!). Simply mark the paths that appear worth indexing to you, and scroll down to the Create selected indexes button. This will create element and attribute range indexes for you. There code also contains functionality to create path indexes for you, but the MarkLogic App-Builder currently doesn’t support those, so I made the other ones the default.

All the way down a list of existing indexes are shown. There are checkboxes next to these as well, to allow removing them.



The last line of the page displays the ‘elapsed-time’ printed as xs:dayTimeDuration.

Monday, April 14, 2014

Vanguard here I come!

Some may have heard the news: I joined MarkLogic as of first of April. No joke! To be more precise, I joined the MarkLogic Vanguard team. The name for the team is derived from the military meaning:
“The vanguard is the leading part of an advancing military formation.” http://en.wikipedia.org/wiki/Vanguard
It is a rather popular word. Wikipedia lists the name being used for ships, aircrafts, and satellites. But also for company names, schools, sports, and political parties. Not to mention the numerous books, movies, and toys that have Vanguard in their name or title. Google actually suggests the term is derived from the French word Avant-garde, which only adds to the classiness of the word.

About the team: the team operates as international team supporting MarkLogic Sales Engineers around the globe to create compelling demos and proof of concepts. That puts the team close to the frontier. But the team also has the ambition to set an example, to not only make the demos look good, but also write compelling code. Even more reason it deserves such a name!

The team uses the RObust XQuerY (ROXY) Framework to accelerate their work. To describe it very briefly:

“Roxy is a lightweight XQuery application development framework. It includes:
  • Application configuration management
  • A lightweight mVC framework
  • A unit testing framework”
https://developer.marklogic.com/code
It is available on Github, and has become pretty popular among MarkLogic developers. No surprise, as it can accelerate development and deployment of a MarkLogic application enormously. If you take the cutting-edge ‘dev’ branch, you can even ‘capture’ an existing App Builder app, and start extending it with REST extensions very easily with the help of Roxy.

I am lucky to be on a team that includes the two biggest contributors of Roxy, not coincidently also its inventors: Dave Cassel, and Paxton Hare. For those interested to learn more about Roxy, make sure to check out the documentation, and all the blogs that are available online. There are plenty contributions and references on various personal blogs:
Most have been gathered on the wiki of Roxy:
https://github.com/marklogic/roxy/wiki/Tutorials
Stay tuned for more on my adventures at Vanguard. Looking very much forward myself!

Thursday, February 16, 2012

XMLPrague 2012, day one and two

(originally posted on XMLHolland as day one and  day two)


Cross-pollination, ER, and Pigs wearing Lipstick
a report on behalf of XMLAmsterdam by @grtjn

XMLPrague is great in so many ways, it is impossible to properly describe. I will be jotting down some thoughts and observations here anyhow, hoping to spark some good and funny memories of those who were there, and hopefully giving the others at least some impression of what they have missed..

Be there next time!


Before I start, I have to say this: Prague is absolutely one of my top favorite cities. Been there four times (twice for XMLPrague, and twice on bicycle), still hasn’t lost any of its charm. Yes, it was slippery. Yes, it was (very) cold. And Yes, some complained about coffee, queues, draft, power supply, and blocked views. But that doesn’t make Prague any less beautiful, nor XMLPrague any less interesting. Contrarily, it just added to the experience!

For those who never were..

About XMLPrague in general – for those who never attended it (so far): this conference is quite different from all others I attended. It is not an ordinary set of talks with questions afterwards. But also discussing about it, sharing thoughts (both with speakers, and among audience, even during each talk thanks to Twitter and the live TwitterWall), debating, inventing new ideas, starting new initiatives, making fun with/about each other, etc. It really feels much more like one big collective, than just a bunch of geeks that happen to share some interests (or not).

Why it is so different you might ask? Good question, no accurate answer. I don’t think XMLPrague is unique in this respect by the way. Surely there are other communities as cohesive as this one. One contributing factor must have been the decline of XML adepts this community has faced over the years. The remaining people are the more persevering – or nostalgic – ones. Anyhow, the community is surprisingly tight despite the fact it is quite literally scattered across all continents.

The general theme of day 1..

The topics at XMLPrague have always had a high level in various ways. Highly technical, talks about the progress of standards themselves, highly advanced topics, and – not the least – high quality, not to mention all those renowned speakers! At this year’s conference it seems to go even further. Topics really go beyond standards, it is all about crossing ‘borders’, building bridges. Making different techniques work together, for better (the A-team – weird quadruple, but they always succeed) or worse (chimera, pigs wearing lipstick – ugly). Standards learning or lending from each other (cross-pollination). Boldly going where no one has gone before… – well, kinda.

Opening keynote..

This starts off with the opening keynote by Jeni Tennisson. Her talk is about the fact that standards are often competing over the same space. HTML, JSON, XML, and RDF are all seeking Web dominance, for instance. Others try to take the best of some, and merge them together, building bridges or chimera. These certainly fill a need, but the result isn’t always pretty, nor does it always work. Jeni says instead of competing or merging, they should be used to work together, coexist as they are. Each standard has its niche, they should each be used for what they are best at. It often requires just a little bit of glue to make it work.

(Jeni uses the project legislation.gov.uk she currently works on as example, in which she glues the four standards together through URLs.)

Morning sessions..

The remaining morning sessions, as well as the discussion panel just before lunch, more or less extend (or ‘contradict’) on this statement.

Eric van der Vlist looks back over the period from when the XML hype started till now. XML was said to be thé data-interchange format, and was flexible enough to be applied to anything. XML advocates (quite literally) tried this, but things got overhyped. XHTML for instance has never become the success that some had envisioned. Biggest problem is that XHTML and HTML aren’t compatible (syntax-wise at least). Also, new, and more successful ideas like Web 2.0, HTML5, and JSON have overtaken XML and XHTML. But, says Eric, XML isn’t just about syntax. It is still built on top of a strong and flexible data model, and has lots of tooling. He suggests we should consider allowing data-structures like triples and JSON into the data-model. Unfortunately, he doesn’t say how to prevent it to become a chimera itself.

Robin Berjon teams up with Norm Walsh to show us there are good ideas in both HTML and XML. Looking beyond the syntax, they discuss ideas in which they take features from one domain and apply it to the other. You can already use XML together with CSS to show a web page. You can even include script elements in the XML. SVG is already mostly supported as part of HTML, but with HTML-like syntax. It is also possible to utilize the broad support of JavaScript to bring ideas from the XML domain into HTML. There is pdf.js, which might be altered to accept XSL-FO as input. There are JavaScript implementations for XSLT (Saxon-CE for instance), and XProc (written by Vojtech Toman). Web developers usually dislike the pointy brackets though. Robin suggests using CSS selectors to make transformations more accessible to them. He also suggests adding Schematron-like validation features to CSS.

As if the previous talks hadn’t gotten the crowd stirred enough yet, Anne van Kesteren surely did. In his relatively short presentation, he pretty much suggested to drop the strict XML well-formedness requirement, and allow HTML/SGML syntax again. This was a good upstart to the panel discussion that followed behind it, in which convergence between XML and HTML was discussed. The panel consisted of all previous speakers, as well as Steven Pemberton. To summarize briefly: some argued that you can’t drop the well-formedness in general, it helps in the editing process for instance. Others argued that the end user, the one looking at web pages for instance, shouldn’t be ‘punished’ for the mistakes of developers. It was suggested to apply the Postal’s law approach. Before the end of day 2, a new W3C working group was erected to address the idea of improved XML Error Recovery.

Afternoon..

After a lunch of mashed potatoes with schnitzel, the topics become less philosophical. Vojtech Toman starts with support for non-XML data in XProc. The XProc WG has looked at the need to handle such data within XProc pipelines. That is currently pretty much impossible without implementation-specific extensions. By adding a simple content-type attribute on inputs and outputs, and adding some extension functions and steps, the XProc processor ‘knows’ how to flow non-XML data appropriately. It will do conversion where appropriate. Vojtech shows an example in which he converts an image from PNG to JPEG just by specifying the appropriate input and output type. The idea for support for non-XML data was generally well received.

Next was a talk by George Bina. He talked about NVDL which is a standard to handle validation of XML with mixed namespaces. It allows different parts to be validated with different Schema types (RNG, Schema, etc). The ISO standard also provides a few sophisticated features that allow detailed control of how each part of the document should be validated. Bina shows how XProc and XSLT is used to implement NVDL support in oXygen.

After Poster presentations and a coffee break, the conference continues with a more delicate matter again: JSON. Jonathan shows some quotes telling that XQuery was meant to be a universal query language. But JSON people dislike XML. That is why a new query language is being proposed that builds a bridge: JSONiq. There are two syntaxes: XQ-- and XQ++. The former is a stripped XQuery syntax, with just support for JSON structures. It could help JSON-minded people to leverage the power of XQuery (and thus possibly that of XML databases). The latter is based on full XQuery syntax, but extended with JSON constructors and expressions. It allows XQuery-minded people to interact with JSON applications.

Norm Walsh presents another way to make life easier for JSON people, and others that don’t like to learn XQuery. Corona is an open source project to disclose many MarkLogic features as a REST interface. It allows for responses in both JSON and XML. It provides features like CRUD, and Search, but also allows management of all kinds of indexes and search facets. It also allows you to upload transformations in XSLT and XQuery, that can be applied in later requests.

Final presentation of day one is presented by Steven Pemberton. He talks about the history of XForms, and the new features of XForms 2.0. XForms 1.0 didn’t work out well, but the standard became Turing Complete with XForms 1.1. That proved its value. XForms 2.0 brings support for XPath 2.0, and AVTs. It also supports non-XML data as input, JSON in particular. Steven explains this was easiest to do when JSON was simply mapped to XML. This is not trivial, but doable. He explains which mapping is being used in XForms, which is different from existing ones. The audience makes remarks about the yet-another-JSON-mapping, but Steven explains it is for XForms internal use only. End users don’t need to know about it.

Diner and demojam..

Most of the attendees attend the social diner in the Cloister on top of the hill in ‘The other side’ of Prague. Good, and plentiful, as is the beer. Around nine, 10 contestants (including me) prepare for the demojam, sponsored by MarkLogic. Norm keeps a strict eye on the clock, as the contestants demo or dance their 5 minutes full. The applaud-o-meter helps the jury to come to a verdict. Robert Broersma with “XSLT for hipsters” ties with Gerrit Iemske with “floodit.xsl”. Norm generously grants both an IPad 2.
PS: I’ll blog more details about my demojam app ‘Mark my Tweet’ on my personal blog soon.

Theme of day 2..

The theme of the second day is mostly about newest features in XML standards, very advanced usages of them, extending coverage for XML standards, and bridging between worlds.

Morning sessions..

Sharon Adler is supposed to do the opening of the second day, but unfortunately she can’t make it due to personal health. Instead, Jonathan Robie and Michael Kay get extra time to talk about the current status of various standards. A brief summary:

XPath and XQuery 3.0 are in Last Call. The addition of dynamic function calls, inline functions, windowing in FLWOR, try/catch, higher-order functions and such is mostly known. A new string concatenation operator, support for EQNames, and outer join in FLWOR are new to me. I’m guessing the SQL people will love that concat operator.

XML Schema 1.1 has reached Proposed Recommendation stage. It has a ton of new possibilities, that  lift most of the unwanted limitations. It includes conditional type assignment, allowing elements in multiple substitution groups, open content models, and more. Most notable was perhaps the addition of assertions, inspired from Schematron. Personally, sounds like a feature that could be very popular, but could turn out to be a chimera as well.

XSLT 3.0 makes slower progress. Its streaming features require a lot of research, not something W3C is intended to do. Apart from streaming it also includes features partly inspired from XQuery, like try/catch, iterate, evaluate, and such. Things I hadn’t yet heard about: matching templates on atomic values, accumulative counting in a for-each, breaking out of it, and packaging. Packaging takes modularization of code a step further, adding more control on visibility and dependencies. There will also be support for maps, and functions that can convert between JSON and maps. The latter sounds like a nice A-team approach to me!

Adam Retter continues with a presentation on RESTful XQuery after the coffee break. He shows that while most XQuery database allow RESTful web-applications, all of them rely on extensions, and implementation specific strategies. He instead proposes using function annotations based on JSR-311 to control exposure, and let the XQuery processor take care of the request handling. You would only need to specify the accepted method, the url pattern (including parameters), and input/output content-types for each function that needs to be exposed. The idea is very well received, and Liam suggests W3C should perhaps pick it up. Yes, please!

Alain Couthures follows with a presentation on supporting XQuery in the browser, by transforming it to JavaScript on client-side using XSLT 1.0. He argues that interpreting it with JavaScript would have been slow, while XSLT processing in the browser is fast. He elaborates on how he is building XQuery support into XSLTForms through XQueryX using YAPP, BNFs, and some uhm.. quite complex XSLT templates. Personally, would be interested to compare performance with for instance Saxon CE.

Afternoon sessions..

After a good lunch of rice and sauce (or pasta with cheese sauce) – in which I get entangled in a loud discussion about American politics that I allegedly have caused –, we continue with vegetables. Evan Lanz presents the idea of a transformation language derived from XQuery, but altered to support expressing template based processing of content, in an effort to bring best of both worlds together. He calls his language Carrot, because of the use of the hmm.. caret sign. Code expressed in this language could be supported natively or (in theory) be translated into either XSLT or XQuery. He gives a brief demonstration in which he uses the online Rex parser by Gunter Rademacher, to create an XQuery parser for Carrot. From there he transforms the parsed tree into XSLT. The audience seem to like the idea, but some debate the chosen syntax. I could imagine that it feels a bit like yet-another-transform-language to some.

John Snelson, colleague of Evan, continues on the same topic, but presents a different strategy. Instead of creating a new or meta language for transformations, he suggests to use annotated functions. The functions serve as the template bodies, the annotations specify the matching conditions. In a swirling Prezi he also demonstrates that it is possible to use the new XQuery 3.0 function features to implement the matching algoritms, and the before-mentioned Rex parser to create a parser for the match patterns. I’m afraid quite a few in the audience lost track due to the fancy tumbling, sliding and zooming of his Prezi, but his demo does show how much the expressiveness of XQuery improves with its latest features.

After the last coffee break of this conference, we have just two presentations left. The first is by Charles Foster. He presents his work on XQJ. Contrary to for instance JDBC, XQJ is an API that explicitly bridges between Java and XQuery. He argues that techniques like Hibernate are suboptimal for marshaling complex object structures to flat/tabular relational database structures. The idea of XQJ is that your code talks to a façade. Method invocations get automatically relayed to the other side. You either have a façade at the Java-side, in which case Java-classes and methods are generated from XQuery code. Or a façade at the XQuery-side, in which case XQuery functions are generated from Java code. There are implementations for MarkLogic, eXist and Sedna.

The last talk is presented by Lorenzo Bossi. He talks about the difficulties of maintaining sites like those owned by 7pixel. These include web shops with compare features that contain many items. Maintaining so many requires a collaborative editing approach like wiki. This also includes maintaining the structure or templates behind the items on those sites. Such template changes can easily results in invalid documents. Lorenzo shows how a good update strategy can help. By checking document changes caused by template updates, before committing them, problems can be detected in an early stage.

Closing keynote..

The XMLPrague conference is traditionally closed in unparalleled ways by Michael Sperberg-McQueen. Trying to summarize it is daunting, but this article isn’t complete without an attempt:

Michael makes parallels between John Amos Comenius and XML. Comenius was a religious man, the last bishop of Unity of the Brethren, but exiled. He was proclaimed founder of modern education, and wrote many books, but his books were burnt, some lost forever. His legacy is still highly valued, but has to grow on you. The same accounts for XML, it is also verbose, and you have to experience it to appreciate it. Religions tend to try to rule out deviations. Some say XML tries to do the same, and say that XML fails to do so, and has therefor failed as a whole. But XML hasn’t really failed, nor did Comenius really try to convert other people. Other formats, like many binary formats for instance, have their purpose. And though XML hasn’t overtaken the world, it is used in more places than people are unaware of. The XML formats of Word, Excel, Open Office are exemplary to that. But XML is used in much less obvious areas as well, like for instance forms for writing speeding tickets in North Carolina.

Michael talks about the NOTATIONs in DTDs. They allow referring to data that isn’t in SGML/XML format. It was never the intent to disallow other formats, but have them coexist. This is why for instance media-types in HTML are such a success. That pluralism is also seen in NVDL. XML tooling is a different case. XProc, XQuery and XForms do try to achieve universality by including support non-XML formats like binary and JSON. Michael warns these attempts might become pigs wearing lipstick, but also sees great value in it. And even though supporting XQuery and transformations through JavaScript or CSS seems far-fetched, but why not if browsers won’t support it and people want it?

Michael explains Comenius’ religion was all about tolerance. Comenius also stood for universal education. He didn’t achieve his goals, but these take a long time. Michael thinks the XML community has and is striving for similar goals, and that XMLPrague helps to get a step closer. These warm thoughts concluded this years conference.

XMLAmsterdam and more..

For those who couldn’t come to XMLPrague this year, come to:

XMLAmsterdam 2012, September 19th, 2012

The call for papers will open soon. More details via http://twitter.com/xmlamsterdam and http://www.xmlamsterdam.com/!

I also collected a bunch of links to slides, proceedings, photos, and other blogs. In order of appearance:

·       XMLPrague 2012 sessions and slides
http://www.xmlprague.cz/2012/sessions.html
·       XMLPrague 2012 video archive
http://www.xmlprague.cz/2012/files/video-archive-1.html
·       Conference photos taken by Thomas White
http://www.flickr.com/photos/thomas-white/collections/72157629283332045/
·       Detailed notes of conference by Inigo Surguy
http://67bricks.com/xmlprague2012/xmlprague.html
·       The newest W3C Working Group: XML-ER
http://www.w3.org/community/xml-er/
·       Blog article of conference by Pieter Masereeuw
http://www.xmlholland.nl/content/congresverslag-xml-praag


Thursday, October 20, 2011

XQuery Novelties Revisited

(This is a translation of my article in the Dutch printed magazine <!ELEMENT.)

The latest news on XQuery [1] was presented by me at the XML Holland conference [2] of 2010. All well, but what is the use of XQuery? And why use XQuery over other kinds of alternatives, XML-related or not? I’ll try to answer these questions in this article, and explain why the (relatively new) extensions to XQuery are so interesting.

What is the use of XQuery?

XQuery [3] stands for XML Query Language [4]. That already tells the essence. It is a language to select subsets and substructures from a large set of XML files. The result can be manipulated into something that is suitable to be used in, for example, a subsequent process, or to show in a web browser. XPath [5] is used a lot in XQuery.


All XML standards have their own scope. I’ll name a few. XSLT [6] is a language for transforming XML into some other format. XPointer [7] is an extension of XPath to address nodes more accurately within XML fragments or even subparts of nodes. XLink [8] is a standard to define relationships. XInclude [9] is a standard to compose multiple pieces of XML into one using for instance XLink relationships. And XProc [10] is a standard with which can be described how XML documents should be processed to get to a desired end result. It is expressed itself in XML, and describes the process step by step, also called XML Pipelines. Within XProc you use a.o. XQuery, XSLT, and XInclude languages (and thus indirectly XPath, XPointer and XLink as well) to express what needs to be done exactly within each step.

All these standards are tied together. They are related, and depend on each other. The overlap between some of the mentioned XML standards is summarized quite well in the next image that you can also find at W3Schools [11]:



XQuery vs. XSLT

XQuery has originally a rather specific goal: extract XML fragments from a large(r) collection. This is very different from XSLT, which focusses on transforming XML documents into other XML documents, HTML documents or even documents of other formats.

You would think it should be pretty clear when and why you should use which standard. Yet we often hear the question whether it is best to use XSLT, or best to use XQuery. The point is that these two languages, more than the other ones, have a considerable overlap. There are many tasks you can do in XSLT, that you can also do in XQuery and vice versa. Although this question is in some ways unjustified and not always important, I'll discuss it in a little more detail below here.

If you can tackle something in multiple ways, and both ways do it with similar ease, there is no real reason for rejecting either of the two. Yet you will see that some people prefer XQuery. The syntax of XQuery is much more compact because it is not expressed in XML as XSLT. On the other hand, XSLT is based on a different principle, making doing for instance certain structural changes much easier. In this sense, it is mainly down to personal taste and the specific challenges of the task at hand, which of the two will be used by someone in particular for a given task.


However, XQuery is often used in combination with databases. That affects the balance. Firstly, XSLT fans aren’t always the same people who will be dealing with databases and vice versa. XSLT is more common in the area of document conversions. Secondly, databases entail additional challenges, often of an entirely different order of magnitude. XQuery has extensions that provide help in those areas. But there are no (official) XSLT extensions, and there is no real need for it either.

And that is why comparing XQuery and XSLT is so difficult, and therefore usually futile.

XQuery relatively unknown

The fact that XSLT exists much longer than XQuery, also affects the balance. In the beginning people had not much choice. Later on people got used to the quickly matured XSLT, while XQuery was still a working draft for quite some years. The idea for an "XML Query Language" arose along the emergence of XML, but it took long before it became a W3C Recommendation. XQuery is still relatively new, compared to XSLT and XPath.

One reason for this is that, after the launch of XPath in 1999, people soon became aware that such language could be largely based on XPath. That resulted in the first Working Draft of both XQuery 1.0 and XPath 2.0 in 2001. XSLT could and should of course also benefit. The XSLT 2.0 Working Draft was initiated at the same time. The Recommendations of these three were released more or less simultaneously. We are talking about 2007 by then, that is six years later!

So, XQuery is a Recommendation only since 2007, while XSLT and XPath are Recommendations since 1999, and were pretty popular from the start. XQuery is still catching up on XSLT and XPath. In addition, XML was booming business back then. Innovations in XML standards have slowed down, while new ideas like JSON [12] and NoSQL [13] are getting all the attention.

XQuery needs to catch up with XML databases as well. Various kinds of XML databases emerged after the advent of XML, but the idea of a generic Query Language didn’t appear until several years later. The fact that XQuery reached the Recommendation status only in the recent years, has slowed broad support in commercial database products the years before. A few large parties like IBM were involved in XQuery early on, other parties such as Oracle followed only years later. It was likewise with commercial XML databases: there were some early-adopters, but most of them preferred to wait to see which way the cat would jump.

Relation with databases

The fact that XQuery is used so often in combination with databases, is no coincidence. It is obvious to want to put large collections of XML in an (XML) database. Databases are designed for large-scale storage and efficient extraction. It fits the purpose of XQuery perfectly.

And that's no coincidence either. XQuery (indirectly) emerged out of database languages like SQL. The first ideas for storing XML in databases arose with the advent of XML. Initially people mainly (ab)used relational databases. However, languages like SQL are not equipped to handle XML. So, many extensions and variations arose automatically. By the time the XSLT and XPath Recommendations were a fact, people realized that there was a need for a generic query language as well. This resulted in the Quilt [14] language in 2000, which was renamed to XQuery after adoption by the W3C.

The following chart that I borrowed from sheets of a curriculum about XML and databases [15] (see ch. 10), shows briefly how various database languages merged into XQuery.


That is why it is no coincidence that XQuery and databases go so well together. XQuery is mainly designed and developed for use with databases. W3C has chosen explicitly not to limited it to only databases, making it more general purpose.

Relation with database functionality

Development around XQuery hasn’t stood still during all those years, though. There are quite a number of extensions to XQuery, which significantly increase the power of XQuery. Part of them find their origin in the application of XQuery to databases.

Ronald Bourret has a very informative website in which XML and databases [16] are elaborately discussed. He mentions some basic features that every database must support. Some of the more important are:
  • Efficient storage and extraction
  • (Full Text) Search
  • Transactional updates
  • Data integrity and triggers
  • Parallel processing and access
  • Security and crash recovery
  • Version control of data
Storage is of course inherent to databases. A good database also provides facilities for concurrent access and updates, security, and crash recovery. Extraction is covered by XQuery 1.0, the search by the Full-Text standard, updates by the Update Facility standard. And there are extensions for data integrity and versioning as well, though yet unofficial. More on that in the following part.

Extensions on XQuery

XQuery 1.0 relies on XPath 2.0. It is in fact an extension to it. Even XPath, how powerful itself, has certain limitations. It is a language for addressing substructures. It is not really designed for searching. XQuery itself doesn’t provide the right functionality for searching either. It is designed for retrieval and processing. Therefore, an extension to these languages was developed: the "XQuery and XPath Full Text 1.0 [17]" standard, which became a W3C Recommendation [18] in March this year.


XQuery is meant for extraction and processing, not for applying changes. Another extension which became a W3C Recommendation in March this year is the "XQuery Update Facility 1.0 [19]" standard. This is an extension that does allow applying (permanent) changes to XML structures.


Regarding data integrity an (unofficial) proposal [20] was presented at the XML Prague 2010 conference [21]. This extension allows embedding declarations of data collections, indexes and data constraints within your XQuery code. Instead of having to mess around with database configurations, these declarations become part of the application code itself. This makes maintenance much easier. All relevant details gathered in one spot, and within control of the developer him/herself. They would not even need to know much about the database that is actually being used.


Versioning is commonly used for Content Management, but is also used for other purposes such as traceability. Another (unofficial) proposal [22] presented at XML Prague 2010 covers versioning. It is a bit technical, and goes quite deep, but it provides some interesting features. According to the Update Facility standard, all mutations are collected in a so-called ‘Pending Update List’. At the end of an updating script the result of all mutations in that script are committed (stored). This extension describes the idea to preserve all these ‘commit’ moments. To do this effectively, the proposal mentions something called ‘Pending Update List compositions’. These commit moments provide a full history of the XML. Two new XPath ‘axes’ are added, allowing navigation through the full history as integral part of XPath navigation.


Storing all this versioning data requires a lot of disk space, but it is such cheap these days that costs are no longer a problem.

Beyond Scope

But XQuery goes even further. There are currently two extensions that go way beyond database functionality.

The successor of XQuery 1.0 is being developed as we speak: XQuery 1.1, or actually XQuery 3.0 [23], which currently has the W3C Working Draft status. This successor adds a number of features that significantly enhance the expressiveness, such as: try / catch constructs, output statements, group by within a for loop. It also allows calling functions dynamically. In other words: functions as a data type. This takes XQuery to a whole new level.

And as if that were not enough, a standard called "XQuery Scripting Extension 1.0 [24]" is being developed as well. This extension adds several new features that almost make it a (procedural) programming language, for instance: a while loop, redefinition of ,variables and an exit statement. It also builds on top of the XQuery Update Facility standard and allows cumulative (sequential) updates.


All of this makes XQuery very suitable as a ‘scripting’ language, allowing it to compete with languages such as JSP, ASP and PHP. In fact when speaking of web applications it can compete with languages like Java and .Net equally well. It is not for nothing that W3C states:

“XQuery is replacing proprietary middleware languages and Web Application development languages. XQuery is replacing complex Java or C++ programs with a few lines of code…” http://www.w3.org/XML/Query/ [25]

Note: that is an observation, not an opinion!

Programming Language

XQuery 3.0 and the Scripting Extension lift XQuery to a higher level. They give the appearance of a real programming language. It is not a surprise that W3C states that database-specific programming languages are being replaced by XQuery more and more. XQuery is ideally suited as a language for database access, but thanks to these latest enhancements it goes further. XQuery is the glue that can bring all application layers together. It is also powerful enough to support well known Design Patterns [26] without much trouble. Not only the well-known Model-View-Controller [27] pattern, but also other useful patterns, such as Observer, Strategy and others [28].

It is easiest to refer to the application that me and two of my (former) colleagues have made for a programming contest [29] to show the real power of XQuery. The goal was simple: create an application that appeals to XQuery and was well put together. The result was Socialito [30]: a ‘Social Media Dashboard’, in which tweets and other information from your Twitter account is displayed in a highly organized, and customizable manner. The user interface uses HTML and JavaScript (JQuery [31]), but apart from that it uses XQuery exclusively. The data is stored using the XML structure of Twitter itself.

In short, XQuery is not just for "Querying XML" no longer. In XQuery, you can develop application logic and application layers all together. That makes it the core of your entire application. This goes way further than any other XML standard.

Learn more?

Anyone interested in learning more, and keen to see practical applications of XQuery, is kindly invited to sign up for the XML Amsterdam conference of Wednesday 26th of October at the Regardz Planetarium in Amsterdam. Several Open Standards will be discussed, and there will be multiple presentations on XQuery.
  1. latest news on XQuery: http://xmlholland.nl/sites/default/files/Geert Josten-XMLHolland2010.pdf
  2. XML Holland conference: http://www.xmlholland.nl/jaarcongres
  3. XQuery: http://www.w3.org/TR/xquery/
  4. XML Query Language: http://www.w3.org/XML/Query/
  5. XPath: http://www.w3.org/TR/xpath20/
  6. XSLT: http://www.w3.org/TR/xslt20/
  7. XPointer: http://www.w3.org/TR/xptr-framework/
  8. XLink: http://www.w3.org/TR/xlink11/
  9. XInclude: http://www.w3.org/TR/xinclude/
  10. XProc: http://www.w3.org/TR/xproc/
  11. W3Schools: https://www.w3schools.com/xml/xpath_intro.asp
  12. JSON: http://en.wikipedia.org/wiki/JSON
  13. NoSQL: http://en.wikipedia.org/wiki/NoSQL
  14. Quilt: http://xml.coverpages.org/quilt_euro.html
  15. XML and databases: http://www.inf.uni-konstanz.de/dbis/teaching/ws0708/xml/
  16. XML and databases: http://www.rpbourret.com/xml/XMLAndDatabases.htm
  17. XQuery and XPath Full Text 1.0: http://www.w3.org/TR/xpath-full-text-10/
  18. W3C Recommendation: http://www.w3.org/TR/
  19. XQuery Update Facility 1.0: http://www.w3.org/TR/xquery-update-10/
  20. proposal: http://www.xmlprague.cz/2010/presentations/Matthias Brantner Extending_XQuery_with_Collections_Indexes_and_Integrity_Constraints.pdf
  21. XML Prague 2010 conference: http://www.xmlprague.cz/2010/index.html
  22. proposal: http://www.xmlprague.cz/2010/sessions.html
  23. XQuery 3.0: http://www.w3.org/TR/xquery-30/
  24. XQuery Scripting Extension 1.0: http://www.w3.org/TR/xquery-sx-10/
  25. http://www.w3.org/XML/Query/: http://www.w3.org/XML/Query/
  26. Design Patterns: http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
  27. Model-View-Controller: http://code.google.com/p/xqmvc/
  28. Observer, Strategy and others: http://patterns.28msec.com/
  29. Programming contest: http://www.28msec.com/contest/results
  30. Socialito: http://socialito.my28msec.com/
  31. JQuery: http://jquery.com/
  32. Personal blog: http://grtjn.blogspot.com/
  33. Company blog: http://www.daidalos.nl/blogs/blog/author/Geert/

About the author

Geert Josten joined in July 2000 as an IT consultant at Daidalos. His interest is wide, but he is most active as a content engineer with an emphasis on XML and related standards. He followed the XML standards from the very beginning and actively contributes to the XML community. Geert is also active as Web and Java developer. Read more articles by him on his personal blog [32] and the company blog [33].

Tuesday, June 28, 2011

Tweet analysis with XQuery: the highlights of #mluc11

You can learn a lot about trends by watching how they evolve and distribute. I spent a few words on that in my recent blog article ‘How many tweets are necessary to create a trending topic?’ (sorry, in Dutch). In there, I use XQuery to analyze tweets about the announcement of the upcoming merge of my company with another. In this article I will show the code I used, and apply it to a different set of tweets: (public) tweets mentioning ‘mluc11’. I will apply similar analysis to discover the highlights of the conference, and its most interesting contributors on Twitter.

The basic idea is quite simple:
  • Gather data
  • Convert to some convenient format
  • Make it searchable
  • Apply some statistics and do calculations
  • Draw graphs and conclusions

Gathering data


Twitter has quite an elaborate API, but the most useful part for this purpose –the search API– falls short. To analyze tweets you need to be able to look back over at least the past month, most likely even longer. Twitter search however only returns tweets from the past few days, limiting its usefulness quite a lot.

Twitter search does come with an RSS feed option though. That is what I used to collect little less than 600 tweets mentioning ‘mluc11’ by anyone with a public timeline. Add ‘rpp=100’ as parameter to get the max items returned per call:


I added this RSS feed a few months ago to the RSS agent I had closest at hand: Microsoft Outlook. Not my personal favorite, but I have to use it anyway (company policy).

Convert to some convenient format


From here I had two options:
  1. Extract the URLs and connect with Twitter API to retrieve the original tweet in XML
  2. Just use the RSS information, and write that as XML

Retrieving the original tweets as XML has the benefit that you can optionally include markers for mentions, urls, hashtags, and get additional details about them included in the same call as well. You need to go through a tricky OAuth process to get access to the API, however.

For the purpose of the analysis, all necessary information is already available in the RSS information, which I already had ready at hand. So, I decided to skip the hassle of accessing the API and use a bit of VBA code to write my collected RSS feed messages to XML:

Sub writeTweets()
    Dim item As PostItem
    Dim tweets As String
    Dim url As String
    Dim stamp As String
   
    tweets = "<?xml version=""1.0"" encoding=""windows-1252""?>" & vbCrLf
    tweets = tweets & "<tweets>" & vbCrLf
    For Each item In ActiveExplorer.CurrentFolder.Items
        url = Replace(Right(item.Body, Len(item.Body) - InStrRev(item.Body, "HYPERLINK") - 10), """Artikel weergeven...", "")

        tweets = tweets & "<tweet url=""" & url & """><from>" & item.SenderName & "</from><subject>" & Replace(Replace(item.Subject, "&", "&amp;"), "<", "&lt;") & "</subject><stamp>" & item.ReceivedTime & "</stamp></tweet>" & vbCrLf
    Next item
    tweets = tweets & "</tweets>" & vbCrLf
   
    Call WriteToFile("c:\tmp\tweets.xml", tweets)
End Sub

Sub WriteToFile(path As String, text As String)
    Dim fnum As Long
    fnum = FreeFile()
    Open path For Output As fnum
    Print #fnum, text
    Close #fnum
End Sub

Make sure the mail folder containing the RSS messages is your current folder. Hit Alt + F11 in Microsoft Outlook to open the macro editor, paste the macro’s in en use F5 to run the writeTweets macro. The macro results in something like this:

<tweets>
<tweet url="http://twitter.com/jpcs/statuses/58818516685553665">
<from>jpcs (John Snelson)</from>
<subject>RT @StephenBuxton: Spent last 2 weeks reviewing and rehearsing talks for #MLUC11 (San Francisco April 26-29). Great content - it's going to be a great show!</subject>
<stamp>15-4-2011 11:06:41</stamp>
</tweet>
<tweet url="http://twitter.com/jpcs/statuses/58818417511243776">
<from>jpcs (John Snelson)</from>
<subject>RT @SCUEngineering: Interested in non-relational database technologies? "MarkLogic InsideHack 2011" (4/28 - FREE, SF) http://mluc11-insidehack.eventbrite.com/</subject>
<stamp>15-4-2011 11:06:18</stamp>
</tweet>

A link to the full copy of tweets as XML can be found at the end of this article.

Make it searchable


Since I chose to write the RSS information to XML, I am lacking markers for urls, hashtags, and mentions within the tweet text. Moreover, it is worthwhile to apply some additional enrichments to get more out of the analysis. Also, when you execute above code you will notice that the stamp format coming from VBA is not according to the xs:dateTime format, which is inconvenient. We will fix all of this first. It shouldn’t take much code. Note: The following part shows the most important parts of the code only. A link to the full code can be found at the end of this article.

Let’s start with the sender, assume $t holds an individual tweet. The from element ($t/from) contains both the user id and the user full name, as can be seen from the earlier XML sample. The user id comes first, the full name between braces. You can separate that with a bit of regular expression:

let $user-id := lower-case(replace($t/from, '^([^ \(]+) \(([^\)]+)\)', '$1'))
let $user-name := replace($t/from, '^([^ \(]+) \(([^\)]+)\)', '$2')

The timestamp needs reformatting. The stamp follows Dutch localization in which day comes first, month second, year third. The string also needs a T between date and time, and a Z at the end to pretend we care about time zones. It also needs some extra leading zeros. I used the following regular expressions to fix that:

let $stamp := replace($t/stamp, '^(\d)-(\d+)-', '0$1-$2-')
let $stamp := replace($stamp, '^(\d+)-(\d)-', '$1-0$2-')
let $stamp := replace($stamp, ' (\d):', ' 0$1:')
let $stamp := replace($stamp, '^(\d+)-(\d+)-(\d+) (\d+:\d+:\d+)$', '$3-$2-$1T$4Z')

Identifying mentions within the subject takes a bit more effort, but is still relatively easy with the analyze-string function. Matches are wrapped in a user element, non-matches are passed through:

let $subject :=
       for $x in fn:analyze-string($t/subject, '@[a-zA-Z0-9_]+')/*
       let $id := lower-case(substring-after($x/text(), '@'))
       return
               if ($x/self::*:match) then
                      <user id="{$id}">{$x/text()}</user>
               else
                      $x/text()

The same method is used for hashtags and urls, but with slightly different regular expressions of course. Check out the full code listing to see how I fixed those.

I need to mention one more bit about the urls, though. Urls are usually shortened to save characters. Twitter records the real url as well, but since we rely on the RSS data, we lack that information. I used the MarkLogic Server function xdmp:http-get() to resolve the shortened urls to real urls – Other parsers likely provide alternatives. It essentially comes down to this line, which resorts to the shortened url in case the HTTP GET fails:

let $url := (try { xdmp:http-get($short-url)//*:location/text() } catch ($ignore) { () }, $short-url)[1]

If you look at the full listing, you will notice that I added more. I implemented a primitive caching mechanism to ensures the code doesn’t resolve the same url more than once. I also preloaded the cache to save you from most of the internet access which is slow, and because shortened urls tend to fail after some time.

We are almost there. It can be very interesting to make a distinction between tweets, retweets, and replies. I search for the use of ‘RT’ and ‘@’ to do so:

let $is-retweet := matches($t/subject, '^RT ')
let $is-commented-retweet := matches($t/subject, ' RT ')
let $is-reply := matches($t/subject, '^@')

I make one extra distinction: I noticed that different people can tweet identical messages. I suspect tweet-buttons on internet sites are the cause for that. I count the first as a real tweet, all subsequent ones as kind of retweets, by marking them as duplicates:

let $is-duplicate := exists($tweets/tweet[$t >> .][subject eq $t/subject])

The above expression checks whether the current tweet $t is being preceded by any other tweet ($tweets/tweet[$t >> .]) with identical subject value ($subject eq $t/subject).

The result you get after these enhancements should look more or less like this:

<tweets>
<retweet url="http://twitter.com/jpcs/statuses/58818516685553665">
<from id=”jpcs” name=”John Snelson”>jpcs (John Snelson)</from>
<subject>RT <user id=”stephenbuxton”>@StephenBuxton</user>: Spent last 2 weeks reviewing and rehearsing talks for <tag id=”mluc11”>#MLUC11</tag> (San Francisco April 26-29). Great content - it's going to be a great show!</subject>
<stamp>2011-04-15T11:06:41Z</stamp>
</retweet>
<retweet url="http://twitter.com/jpcs/statuses/58818417511243776">
<from>jpcs (John Snelson)</from>
<subject>RT <user id=”scuengineering”>@SCUEngineering</user>: Interested in non-relational database technologies? "MarkLogic InsideHack 2011" (4/28 - FREE, SF) <url href=”http://mluc11-insidehack.eventbrite.com/”>http://mluc11-insidehack.eventbrite.com/</url></subject>
<stamp>2011-04-15T11:06:18Z</stamp>
</retweet>


Apply some statistics and do calculations


First, ask yourself what you would like to know about the tweets. Personally, I am interested in two things about the MLUC11 conference:
  • What were the highlights? (according to its tweets)
  • Who is telling the most interesting things about it? (and is worth most of following)
Second, these questions needs to be translated to something measurable. For instance: length, volume, start, end, and climax of the trend as a whole. Also: the initiator, top contributors, tweeters that are influential (have large networks), tags and urls that were mentioned, and which of them the most. You could even look at geographical aspects of the trend, provided sufficient information about geographical locations is available. Most comes down to simply counting, and ordering by count. That really is pretty much it. The ‘RSS’ data is raw, but the enriched data makes the analysis rather easy.

Let’s start with a straight-forward question: who contributed the most? It is not about counting Twitter activity alone, but counting original tweets in particular. This is where the distinction between tweets and retweets gets into play. I classified each tweet in one of five categories before, but will compress that into two again for my analysis:
  1. Tweets: tweets, commented retweets, and replies
  2. Retweets: uncommented retweets, and duplicates
You could argue about those duplicates, but they are not very original anyhow. Let’s not start on that. ;-)

So, to find out who contributed most, we need to take the full list of unique tweeters, loop over them while counting tweets (and optionally retweets) sent by them, order them by tweet count, and take top n:

let $users := distinct-values($tweets/*/(from/@id | subject/user/@id))
let $users-facet := (
       for $user in $users

       let $user-tweets := $tweets/*[not(self::retweet or self::duplicate)][from/@id = $user]
       let $user-retweets := $tweets/*[self::retweet or self::duplicate][from/@id = $user]

       let $tweet-count := count($user-tweets)
       let $retweet-count := count($user-retweets)
       let $count := $tweet-count + $retweet-count

       order by $tweet-count descending, $retweet-count descending, $user

       return
               <user id="{$user}" count="{$count}" tweets="{$tweet-count}" retweets="{$retweet-count}">@{$user}</user>
)
let $top5-users := $users-facet[1 to 5]

This will give the answer to who contributed most. I call it a facet, since it requires similar calculations as needed for faceted searching. Note: MarkLogic Server has built-in functionality to retrieve such facet information from its indexes, which I choose not to use to keep this article (mostly) engine-independent. It also made it easier for me to fiddle around a bit first. Results will be discussed in next section.

Next question: who was most influential? To do this properly, it would be best to analyze the followers-network of each user to include that into a calculation about the exposure of all of someone’s tweets. But that would involve the Twitter API again. Next to this, a larger size of the network doesn’t guarantee that a larger number of people actually reads the tweets. I therefor prefer to analyze how many people found someone’s tweets interesting. That can be measured quite easily by counting the number of retweets of that person’s tweets. You could also look at the number of times someone is being mentioned, which includes not only retweets, but also replies or other kinds of mentions. The code to calculate the top mentions is following the same pattern as for the users facet, hardly worth mentioning.

We continue with topics and sites: which were most popular (and therefor interesting)? The approach in roughly the same. There is one additional catch though. Some people tend to rave about particular things. Instead of just counting tweets and retweets, I count unique senders of both as well, and use that as first order-by criterion:

let $urls := distinct-values($tweets/*/subject/url/@full)
let $urls-facet := (
       for $url in $urls

       let $url-tweets := $tweets/*[not(self::retweet or self::duplicate)][subject/url/@full = $url]
       let $url-retweets := $tweets/*[self::retweet or self::duplicate][subject/url/@full = $url]

       let $tweet-sender-count := count(distinct-values($url-tweets/from))
       let $retweet-sender-count := count(distinct-values($url-retweets/from))
       let $sender-count := $tweet-sender-count + $retweet-sender-count

       let $tweet-count := count($url-tweets)
       let $retweet-count := count($url-retweets)
       let $count := $tweet-count + $retweet-count

       order by $sender-count descending, $tweet-count descending, $retweet-count descending, $url

       return
               <url full="{$url}" long="{$url}" org="{$url}" count="{$count}" tweets="{$tweet-count}" retweets="{$retweet-count}" senders="{$sender-count}" tweet-senders="{$tweet-sender-count}" retweet-senders="{$retweet-sender-count}">{$url}</url>
)
let $top5-urls := $urls-facet[1 to 5]

The hashtag approach is essentially identical to the urls approach, I’ll skip that. So, this should be enough to answer the question which tags and urls were the most popular.

Last question: who brought the most interesting urls and hashtags forward? Finding the answer to this requires combining facets. It requires taking the top n from both and counting the occurrence of the tweeters. I could also just have counted how many urls and tags someone tweeted, but this highlights the persons that tweeted urls and tags that were found most interesting by the others. The code is not much different from the rest. Look at the full code list to see the details. I excluded the MLUC11 hashtag, since practically all tweets contain it.

Draw graphs and conclusions


Now finally, what *are* those highlights of MLUC11, and who *are* the most interesting contributors?

Well, these were the highlights, according to my analysis:

The top 5 urls:
  1. http://mluc11-insidehack.eventbrite.com/
                   (from unique senders: 13, tweets: 7, retweets: 11)
  2. http://blogs.marklogic.com/2011/04/15/followanyday-mluc11-developer-lounge-labs/
                   (from unique senders: 11, tweets: 4, retweets: 7)
  3. http://developer.marklogic.com/events/mluc11-labs-and-lounge-schedule#talks
                   (from unique senders: 9, tweets: 1, retweets: 8)
  4. http://developer.marklogic.com/media/mluc11-talks/XSLT-basedWebsitesOnMarkLogic.pdf
                   (from unique senders: 8, tweets: 1, retweets: 7)
  5. http://newsletter.marklogic.com/2011/04/live-from-mluc11/
                   (from unique senders: 5, tweets: 3, retweets: 2)
The first url refers to a sub event of the MLUC11 conference. It is targeted for developers who want to get acquainted with MarkLogic Server, or would like to ask tricky questions to some of the experts.
The second url is a blog post by Pete Aven where he looks forward to the MLUC11 conference, mentioning a few highlights, and giving a brief description of many ‘followanyday’ ML experts and enthusiasts.
The third url points to the MLUC11 schedule.
The fourth url is one particular but pretty awesome presentation about using XSLT within MarkLogic Server to rapidly develop (dynamic) websites.
The fifth is an official news page from Mark Logic announcing the start of the MLUC11 conference. It contains references to Twitter hashtags and Facebook. To the right there is also an interesting list of  ‘related posts’. ;-)

The top 5 hashtags:
  1. #marklogic (from unique senders: 26, tweets: 22, retweets: 16)
  2. #mluc+mluc12 (from unique senders: 7, tweets: 1, retweets: 6)
  3. #followanyday (from unique senders: 6, tweets: 1, retweets: 5)
  4. #tech (from unique senders: 5, tweets: 5, retweets: 6)
  5. #mluc11burrito (from unique senders: 4, tweets: 3, retweets: 1)
The mentioning of hashtags like #marklogic, #mluc and #mluc12 is not very surprising. The #followanyday hashtag is used to attract attention. It was used together with the followanyday blog post url. The #tech hashtag is used by non-tech people (I guess), in an attempt to push Mark Logic into a particular category. The #mluc11burrito hashtag was used to bring mluc11 visitors together to join in a burrito diner.

The top 5 tweeters:
  1. @mdubinko (sent tweets: 71, retweets: 1)
  2. @peteaven (sent tweets: 48, retweets: 8)
  3. @lisabos (sent tweets: 34, retweets: 1)
  4. @mattlyles (sent tweets: 14, retweets: 1)
  5. @ronhitchens (sent tweets: 13, retweets: 1)
The top 5 mentions:
  1. @mdubinko (in tweets: 6, retweets: 38)
  2. @marklogic (in tweets: 15, retweets: 20)
  3. @peteaven (in tweets: 7, retweets: 28)
  4. @hunterhacker (in tweets: 11, retweets: 5)
  5. @lisabos (in tweets: 2, retweets: 9)
The top tweeters and mentions are for the most part no surprise. I already predicted in one of my tweets that Micah Dubinko had been retweeted the most. Just keep in mind he tweeted the most about MLUC11, by far. Pete Aven and Lisa Bos tweeted a lot too, so no surprise to see them in the mentions top 5 as well. Whoever is behind the Marklogic account, he did well, and got mentioned second best with ‘just’ 12 tweets and 10 retweets. I tweeted more than mattlyles and ronhichens, but most of them were retweets (by far), while these two made quite a number of original statements from themselves. That is why they rank higher than I. Last but not least: Jason Hunter, aka hunterhacker, one of the experts that also spoke at Mark Logic, got mentioned quite a lot. But that is what you hope to achieve when you are a speaker, right? I'd say he deserves to be in the top 5!

Last but not least, the top 5 most prominent contributors (based on top 5 urls and hashtags):
  1. @marklogic (tweets: 7, retweets: 3)
  2. @dscape (tweets: 5, retweets: 6)
  3. @eedeebee (tweets: 4, retweets: 2)
  4. @peteaven (tweets: 3, retweets: 4)
  5. @contentnation (tweets: 3, retweets: 2)
The MarkLogic account scores high obviously, as should. Nuno Job (aka @dscape), Eric Bloch (aka @eedeebee), and Pete Aven are doing well as Mark Logic experts. John Blossom (aka Content Nation) was very fond of the #tech hashtag. Not sure, but he could have been the sole contributor to that hashtag.

To be honest, this wasn't my most brilliant analysis. Changing the sort order can make quite a difference. Also, inclusing a larger top n of urls and tags has large influence on the contributors top 5 as well. But making a brilliant analysis wasn’t really my point, I just hope you enjoy the bits of code I shared..

Don't forget to download the full code listing. It adds some interactivity as well!

        https://raw.github.com/grtjn/utilities/master/analyze-tweets.xqy