Search This Blog

Thursday, October 20, 2011

XQuery Novelties Revisited

(This is a translation of my article in the Dutch printed magazine <!ELEMENT.)

The latest news on XQuery [1] was presented by me at the XML Holland conference [2] of 2010. All well, but what is the use of XQuery? And why use XQuery over other kinds of alternatives, XML-related or not? I’ll try to answer these questions in this article, and explain why the (relatively new) extensions to XQuery are so interesting.

What is the use of XQuery?

XQuery [3] stands for XML Query Language [4]. That already tells the essence. It is a language to select subsets and substructures from a large set of XML files. The result can be manipulated into something that is suitable to be used in, for example, a subsequent process, or to show in a web browser. XPath [5] is used a lot in XQuery.


All XML standards have their own scope. I’ll name a few. XSLT [6] is a language for transforming XML into some other format. XPointer [7] is an extension of XPath to address nodes more accurately within XML fragments or even subparts of nodes. XLink [8] is a standard to define relationships. XInclude [9] is a standard to compose multiple pieces of XML into one using for instance XLink relationships. And XProc [10] is a standard with which can be described how XML documents should be processed to get to a desired end result. It is expressed itself in XML, and describes the process step by step, also called XML Pipelines. Within XProc you use a.o. XQuery, XSLT, and XInclude languages (and thus indirectly XPath, XPointer and XLink as well) to express what needs to be done exactly within each step.

All these standards are tied together. They are related, and depend on each other. The overlap between some of the mentioned XML standards is summarized quite well in the next image that you can also find at W3Schools [11]:



XQuery vs. XSLT

XQuery has originally a rather specific goal: extract XML fragments from a large(r) collection. This is very different from XSLT, which focusses on transforming XML documents into other XML documents, HTML documents or even documents of other formats.

You would think it should be pretty clear when and why you should use which standard. Yet we often hear the question whether it is best to use XSLT, or best to use XQuery. The point is that these two languages, more than the other ones, have a considerable overlap. There are many tasks you can do in XSLT, that you can also do in XQuery and vice versa. Although this question is in some ways unjustified and not always important, I'll discuss it in a little more detail below here.

If you can tackle something in multiple ways, and both ways do it with similar ease, there is no real reason for rejecting either of the two. Yet you will see that some people prefer XQuery. The syntax of XQuery is much more compact because it is not expressed in XML as XSLT. On the other hand, XSLT is based on a different principle, making doing for instance certain structural changes much easier. In this sense, it is mainly down to personal taste and the specific challenges of the task at hand, which of the two will be used by someone in particular for a given task.


However, XQuery is often used in combination with databases. That affects the balance. Firstly, XSLT fans aren’t always the same people who will be dealing with databases and vice versa. XSLT is more common in the area of document conversions. Secondly, databases entail additional challenges, often of an entirely different order of magnitude. XQuery has extensions that provide help in those areas. But there are no (official) XSLT extensions, and there is no real need for it either.

And that is why comparing XQuery and XSLT is so difficult, and therefore usually futile.

XQuery relatively unknown

The fact that XSLT exists much longer than XQuery, also affects the balance. In the beginning people had not much choice. Later on people got used to the quickly matured XSLT, while XQuery was still a working draft for quite some years. The idea for an "XML Query Language" arose along the emergence of XML, but it took long before it became a W3C Recommendation. XQuery is still relatively new, compared to XSLT and XPath.

One reason for this is that, after the launch of XPath in 1999, people soon became aware that such language could be largely based on XPath. That resulted in the first Working Draft of both XQuery 1.0 and XPath 2.0 in 2001. XSLT could and should of course also benefit. The XSLT 2.0 Working Draft was initiated at the same time. The Recommendations of these three were released more or less simultaneously. We are talking about 2007 by then, that is six years later!

So, XQuery is a Recommendation only since 2007, while XSLT and XPath are Recommendations since 1999, and were pretty popular from the start. XQuery is still catching up on XSLT and XPath. In addition, XML was booming business back then. Innovations in XML standards have slowed down, while new ideas like JSON [12] and NoSQL [13] are getting all the attention.

XQuery needs to catch up with XML databases as well. Various kinds of XML databases emerged after the advent of XML, but the idea of a generic Query Language didn’t appear until several years later. The fact that XQuery reached the Recommendation status only in the recent years, has slowed broad support in commercial database products the years before. A few large parties like IBM were involved in XQuery early on, other parties such as Oracle followed only years later. It was likewise with commercial XML databases: there were some early-adopters, but most of them preferred to wait to see which way the cat would jump.

Relation with databases

The fact that XQuery is used so often in combination with databases, is no coincidence. It is obvious to want to put large collections of XML in an (XML) database. Databases are designed for large-scale storage and efficient extraction. It fits the purpose of XQuery perfectly.

And that's no coincidence either. XQuery (indirectly) emerged out of database languages like SQL. The first ideas for storing XML in databases arose with the advent of XML. Initially people mainly (ab)used relational databases. However, languages like SQL are not equipped to handle XML. So, many extensions and variations arose automatically. By the time the XSLT and XPath Recommendations were a fact, people realized that there was a need for a generic query language as well. This resulted in the Quilt [14] language in 2000, which was renamed to XQuery after adoption by the W3C.

The following chart that I borrowed from sheets of a curriculum about XML and databases [15] (see ch. 10), shows briefly how various database languages merged into XQuery.


That is why it is no coincidence that XQuery and databases go so well together. XQuery is mainly designed and developed for use with databases. W3C has chosen explicitly not to limited it to only databases, making it more general purpose.

Relation with database functionality

Development around XQuery hasn’t stood still during all those years, though. There are quite a number of extensions to XQuery, which significantly increase the power of XQuery. Part of them find their origin in the application of XQuery to databases.

Ronald Bourret has a very informative website in which XML and databases [16] are elaborately discussed. He mentions some basic features that every database must support. Some of the more important are:
  • Efficient storage and extraction
  • (Full Text) Search
  • Transactional updates
  • Data integrity and triggers
  • Parallel processing and access
  • Security and crash recovery
  • Version control of data
Storage is of course inherent to databases. A good database also provides facilities for concurrent access and updates, security, and crash recovery. Extraction is covered by XQuery 1.0, the search by the Full-Text standard, updates by the Update Facility standard. And there are extensions for data integrity and versioning as well, though yet unofficial. More on that in the following part.

Extensions on XQuery

XQuery 1.0 relies on XPath 2.0. It is in fact an extension to it. Even XPath, how powerful itself, has certain limitations. It is a language for addressing substructures. It is not really designed for searching. XQuery itself doesn’t provide the right functionality for searching either. It is designed for retrieval and processing. Therefore, an extension to these languages was developed: the "XQuery and XPath Full Text 1.0 [17]" standard, which became a W3C Recommendation [18] in March this year.


XQuery is meant for extraction and processing, not for applying changes. Another extension which became a W3C Recommendation in March this year is the "XQuery Update Facility 1.0 [19]" standard. This is an extension that does allow applying (permanent) changes to XML structures.


Regarding data integrity an (unofficial) proposal [20] was presented at the XML Prague 2010 conference [21]. This extension allows embedding declarations of data collections, indexes and data constraints within your XQuery code. Instead of having to mess around with database configurations, these declarations become part of the application code itself. This makes maintenance much easier. All relevant details gathered in one spot, and within control of the developer him/herself. They would not even need to know much about the database that is actually being used.


Versioning is commonly used for Content Management, but is also used for other purposes such as traceability. Another (unofficial) proposal [22] presented at XML Prague 2010 covers versioning. It is a bit technical, and goes quite deep, but it provides some interesting features. According to the Update Facility standard, all mutations are collected in a so-called ‘Pending Update List’. At the end of an updating script the result of all mutations in that script are committed (stored). This extension describes the idea to preserve all these ‘commit’ moments. To do this effectively, the proposal mentions something called ‘Pending Update List compositions’. These commit moments provide a full history of the XML. Two new XPath ‘axes’ are added, allowing navigation through the full history as integral part of XPath navigation.


Storing all this versioning data requires a lot of disk space, but it is such cheap these days that costs are no longer a problem.

Beyond Scope

But XQuery goes even further. There are currently two extensions that go way beyond database functionality.

The successor of XQuery 1.0 is being developed as we speak: XQuery 1.1, or actually XQuery 3.0 [23], which currently has the W3C Working Draft status. This successor adds a number of features that significantly enhance the expressiveness, such as: try / catch constructs, output statements, group by within a for loop. It also allows calling functions dynamically. In other words: functions as a data type. This takes XQuery to a whole new level.

And as if that were not enough, a standard called "XQuery Scripting Extension 1.0 [24]" is being developed as well. This extension adds several new features that almost make it a (procedural) programming language, for instance: a while loop, redefinition of ,variables and an exit statement. It also builds on top of the XQuery Update Facility standard and allows cumulative (sequential) updates.


All of this makes XQuery very suitable as a ‘scripting’ language, allowing it to compete with languages such as JSP, ASP and PHP. In fact when speaking of web applications it can compete with languages like Java and .Net equally well. It is not for nothing that W3C states:

“XQuery is replacing proprietary middleware languages and Web Application development languages. XQuery is replacing complex Java or C++ programs with a few lines of code…” http://www.w3.org/XML/Query/ [25]

Note: that is an observation, not an opinion!

Programming Language

XQuery 3.0 and the Scripting Extension lift XQuery to a higher level. They give the appearance of a real programming language. It is not a surprise that W3C states that database-specific programming languages are being replaced by XQuery more and more. XQuery is ideally suited as a language for database access, but thanks to these latest enhancements it goes further. XQuery is the glue that can bring all application layers together. It is also powerful enough to support well known Design Patterns [26] without much trouble. Not only the well-known Model-View-Controller [27] pattern, but also other useful patterns, such as Observer, Strategy and others [28].

It is easiest to refer to the application that me and two of my (former) colleagues have made for a programming contest [29] to show the real power of XQuery. The goal was simple: create an application that appeals to XQuery and was well put together. The result was Socialito [30]: a ‘Social Media Dashboard’, in which tweets and other information from your Twitter account is displayed in a highly organized, and customizable manner. The user interface uses HTML and JavaScript (JQuery [31]), but apart from that it uses XQuery exclusively. The data is stored using the XML structure of Twitter itself.

In short, XQuery is not just for "Querying XML" no longer. In XQuery, you can develop application logic and application layers all together. That makes it the core of your entire application. This goes way further than any other XML standard.

Learn more?

Anyone interested in learning more, and keen to see practical applications of XQuery, is kindly invited to sign up for the XML Amsterdam conference of Wednesday 26th of October at the Regardz Planetarium in Amsterdam. Several Open Standards will be discussed, and there will be multiple presentations on XQuery.
  1. latest news on XQuery: http://xmlholland.nl/sites/default/files/Geert Josten-XMLHolland2010.pdf
  2. XML Holland conference: http://www.xmlholland.nl/jaarcongres
  3. XQuery: http://www.w3.org/TR/xquery/
  4. XML Query Language: http://www.w3.org/XML/Query/
  5. XPath: http://www.w3.org/TR/xpath20/
  6. XSLT: http://www.w3.org/TR/xslt20/
  7. XPointer: http://www.w3.org/TR/xptr-framework/
  8. XLink: http://www.w3.org/TR/xlink11/
  9. XInclude: http://www.w3.org/TR/xinclude/
  10. XProc: http://www.w3.org/TR/xproc/
  11. W3Schools: https://www.w3schools.com/xml/xpath_intro.asp
  12. JSON: http://en.wikipedia.org/wiki/JSON
  13. NoSQL: http://en.wikipedia.org/wiki/NoSQL
  14. Quilt: http://xml.coverpages.org/quilt_euro.html
  15. XML and databases: http://www.inf.uni-konstanz.de/dbis/teaching/ws0708/xml/
  16. XML and databases: http://www.rpbourret.com/xml/XMLAndDatabases.htm
  17. XQuery and XPath Full Text 1.0: http://www.w3.org/TR/xpath-full-text-10/
  18. W3C Recommendation: http://www.w3.org/TR/
  19. XQuery Update Facility 1.0: http://www.w3.org/TR/xquery-update-10/
  20. proposal: http://www.xmlprague.cz/2010/presentations/Matthias Brantner Extending_XQuery_with_Collections_Indexes_and_Integrity_Constraints.pdf
  21. XML Prague 2010 conference: http://www.xmlprague.cz/2010/index.html
  22. proposal: http://www.xmlprague.cz/2010/sessions.html
  23. XQuery 3.0: http://www.w3.org/TR/xquery-30/
  24. XQuery Scripting Extension 1.0: http://www.w3.org/TR/xquery-sx-10/
  25. http://www.w3.org/XML/Query/: http://www.w3.org/XML/Query/
  26. Design Patterns: http://en.wikipedia.org/wiki/Design_pattern_(computer_science)
  27. Model-View-Controller: http://code.google.com/p/xqmvc/
  28. Observer, Strategy and others: http://patterns.28msec.com/
  29. Programming contest: http://www.28msec.com/contest/results
  30. Socialito: http://socialito.my28msec.com/
  31. JQuery: http://jquery.com/
  32. Personal blog: http://grtjn.blogspot.com/
  33. Company blog: http://www.daidalos.nl/blogs/blog/author/Geert/

About the author

Geert Josten joined in July 2000 as an IT consultant at Daidalos. His interest is wide, but he is most active as a content engineer with an emphasis on XML and related standards. He followed the XML standards from the very beginning and actively contributes to the XML community. Geert is also active as Web and Java developer. Read more articles by him on his personal blog [32] and the company blog [33].

Tuesday, June 28, 2011

Tweet analysis with XQuery: the highlights of #mluc11

You can learn a lot about trends by watching how they evolve and distribute. I spent a few words on that in my recent blog article ‘How many tweets are necessary to create a trending topic?’ (sorry, in Dutch). In there, I use XQuery to analyze tweets about the announcement of the upcoming merge of my company with another. In this article I will show the code I used, and apply it to a different set of tweets: (public) tweets mentioning ‘mluc11’. I will apply similar analysis to discover the highlights of the conference, and its most interesting contributors on Twitter.

The basic idea is quite simple:
  • Gather data
  • Convert to some convenient format
  • Make it searchable
  • Apply some statistics and do calculations
  • Draw graphs and conclusions

Gathering data


Twitter has quite an elaborate API, but the most useful part for this purpose –the search API– falls short. To analyze tweets you need to be able to look back over at least the past month, most likely even longer. Twitter search however only returns tweets from the past few days, limiting its usefulness quite a lot.

Twitter search does come with an RSS feed option though. That is what I used to collect little less than 600 tweets mentioning ‘mluc11’ by anyone with a public timeline. Add ‘rpp=100’ as parameter to get the max items returned per call:


I added this RSS feed a few months ago to the RSS agent I had closest at hand: Microsoft Outlook. Not my personal favorite, but I have to use it anyway (company policy).

Convert to some convenient format


From here I had two options:
  1. Extract the URLs and connect with Twitter API to retrieve the original tweet in XML
  2. Just use the RSS information, and write that as XML

Retrieving the original tweets as XML has the benefit that you can optionally include markers for mentions, urls, hashtags, and get additional details about them included in the same call as well. You need to go through a tricky OAuth process to get access to the API, however.

For the purpose of the analysis, all necessary information is already available in the RSS information, which I already had ready at hand. So, I decided to skip the hassle of accessing the API and use a bit of VBA code to write my collected RSS feed messages to XML:

Sub writeTweets()
    Dim item As PostItem
    Dim tweets As String
    Dim url As String
    Dim stamp As String
   
    tweets = "<?xml version=""1.0"" encoding=""windows-1252""?>" & vbCrLf
    tweets = tweets & "<tweets>" & vbCrLf
    For Each item In ActiveExplorer.CurrentFolder.Items
        url = Replace(Right(item.Body, Len(item.Body) - InStrRev(item.Body, "HYPERLINK") - 10), """Artikel weergeven...", "")

        tweets = tweets & "<tweet url=""" & url & """><from>" & item.SenderName & "</from><subject>" & Replace(Replace(item.Subject, "&", "&amp;"), "<", "&lt;") & "</subject><stamp>" & item.ReceivedTime & "</stamp></tweet>" & vbCrLf
    Next item
    tweets = tweets & "</tweets>" & vbCrLf
   
    Call WriteToFile("c:\tmp\tweets.xml", tweets)
End Sub

Sub WriteToFile(path As String, text As String)
    Dim fnum As Long
    fnum = FreeFile()
    Open path For Output As fnum
    Print #fnum, text
    Close #fnum
End Sub

Make sure the mail folder containing the RSS messages is your current folder. Hit Alt + F11 in Microsoft Outlook to open the macro editor, paste the macro’s in en use F5 to run the writeTweets macro. The macro results in something like this:

<tweets>
<tweet url="http://twitter.com/jpcs/statuses/58818516685553665">
<from>jpcs (John Snelson)</from>
<subject>RT @StephenBuxton: Spent last 2 weeks reviewing and rehearsing talks for #MLUC11 (San Francisco April 26-29). Great content - it's going to be a great show!</subject>
<stamp>15-4-2011 11:06:41</stamp>
</tweet>
<tweet url="http://twitter.com/jpcs/statuses/58818417511243776">
<from>jpcs (John Snelson)</from>
<subject>RT @SCUEngineering: Interested in non-relational database technologies? "MarkLogic InsideHack 2011" (4/28 - FREE, SF) http://mluc11-insidehack.eventbrite.com/</subject>
<stamp>15-4-2011 11:06:18</stamp>
</tweet>

A link to the full copy of tweets as XML can be found at the end of this article.

Make it searchable


Since I chose to write the RSS information to XML, I am lacking markers for urls, hashtags, and mentions within the tweet text. Moreover, it is worthwhile to apply some additional enrichments to get more out of the analysis. Also, when you execute above code you will notice that the stamp format coming from VBA is not according to the xs:dateTime format, which is inconvenient. We will fix all of this first. It shouldn’t take much code. Note: The following part shows the most important parts of the code only. A link to the full code can be found at the end of this article.

Let’s start with the sender, assume $t holds an individual tweet. The from element ($t/from) contains both the user id and the user full name, as can be seen from the earlier XML sample. The user id comes first, the full name between braces. You can separate that with a bit of regular expression:

let $user-id := lower-case(replace($t/from, '^([^ \(]+) \(([^\)]+)\)', '$1'))
let $user-name := replace($t/from, '^([^ \(]+) \(([^\)]+)\)', '$2')

The timestamp needs reformatting. The stamp follows Dutch localization in which day comes first, month second, year third. The string also needs a T between date and time, and a Z at the end to pretend we care about time zones. It also needs some extra leading zeros. I used the following regular expressions to fix that:

let $stamp := replace($t/stamp, '^(\d)-(\d+)-', '0$1-$2-')
let $stamp := replace($stamp, '^(\d+)-(\d)-', '$1-0$2-')
let $stamp := replace($stamp, ' (\d):', ' 0$1:')
let $stamp := replace($stamp, '^(\d+)-(\d+)-(\d+) (\d+:\d+:\d+)$', '$3-$2-$1T$4Z')

Identifying mentions within the subject takes a bit more effort, but is still relatively easy with the analyze-string function. Matches are wrapped in a user element, non-matches are passed through:

let $subject :=
       for $x in fn:analyze-string($t/subject, '@[a-zA-Z0-9_]+')/*
       let $id := lower-case(substring-after($x/text(), '@'))
       return
               if ($x/self::*:match) then
                      <user id="{$id}">{$x/text()}</user>
               else
                      $x/text()

The same method is used for hashtags and urls, but with slightly different regular expressions of course. Check out the full code listing to see how I fixed those.

I need to mention one more bit about the urls, though. Urls are usually shortened to save characters. Twitter records the real url as well, but since we rely on the RSS data, we lack that information. I used the MarkLogic Server function xdmp:http-get() to resolve the shortened urls to real urls – Other parsers likely provide alternatives. It essentially comes down to this line, which resorts to the shortened url in case the HTTP GET fails:

let $url := (try { xdmp:http-get($short-url)//*:location/text() } catch ($ignore) { () }, $short-url)[1]

If you look at the full listing, you will notice that I added more. I implemented a primitive caching mechanism to ensures the code doesn’t resolve the same url more than once. I also preloaded the cache to save you from most of the internet access which is slow, and because shortened urls tend to fail after some time.

We are almost there. It can be very interesting to make a distinction between tweets, retweets, and replies. I search for the use of ‘RT’ and ‘@’ to do so:

let $is-retweet := matches($t/subject, '^RT ')
let $is-commented-retweet := matches($t/subject, ' RT ')
let $is-reply := matches($t/subject, '^@')

I make one extra distinction: I noticed that different people can tweet identical messages. I suspect tweet-buttons on internet sites are the cause for that. I count the first as a real tweet, all subsequent ones as kind of retweets, by marking them as duplicates:

let $is-duplicate := exists($tweets/tweet[$t >> .][subject eq $t/subject])

The above expression checks whether the current tweet $t is being preceded by any other tweet ($tweets/tweet[$t >> .]) with identical subject value ($subject eq $t/subject).

The result you get after these enhancements should look more or less like this:

<tweets>
<retweet url="http://twitter.com/jpcs/statuses/58818516685553665">
<from id=”jpcs” name=”John Snelson”>jpcs (John Snelson)</from>
<subject>RT <user id=”stephenbuxton”>@StephenBuxton</user>: Spent last 2 weeks reviewing and rehearsing talks for <tag id=”mluc11”>#MLUC11</tag> (San Francisco April 26-29). Great content - it's going to be a great show!</subject>
<stamp>2011-04-15T11:06:41Z</stamp>
</retweet>
<retweet url="http://twitter.com/jpcs/statuses/58818417511243776">
<from>jpcs (John Snelson)</from>
<subject>RT <user id=”scuengineering”>@SCUEngineering</user>: Interested in non-relational database technologies? "MarkLogic InsideHack 2011" (4/28 - FREE, SF) <url href=”http://mluc11-insidehack.eventbrite.com/”>http://mluc11-insidehack.eventbrite.com/</url></subject>
<stamp>2011-04-15T11:06:18Z</stamp>
</retweet>


Apply some statistics and do calculations


First, ask yourself what you would like to know about the tweets. Personally, I am interested in two things about the MLUC11 conference:
  • What were the highlights? (according to its tweets)
  • Who is telling the most interesting things about it? (and is worth most of following)
Second, these questions needs to be translated to something measurable. For instance: length, volume, start, end, and climax of the trend as a whole. Also: the initiator, top contributors, tweeters that are influential (have large networks), tags and urls that were mentioned, and which of them the most. You could even look at geographical aspects of the trend, provided sufficient information about geographical locations is available. Most comes down to simply counting, and ordering by count. That really is pretty much it. The ‘RSS’ data is raw, but the enriched data makes the analysis rather easy.

Let’s start with a straight-forward question: who contributed the most? It is not about counting Twitter activity alone, but counting original tweets in particular. This is where the distinction between tweets and retweets gets into play. I classified each tweet in one of five categories before, but will compress that into two again for my analysis:
  1. Tweets: tweets, commented retweets, and replies
  2. Retweets: uncommented retweets, and duplicates
You could argue about those duplicates, but they are not very original anyhow. Let’s not start on that. ;-)

So, to find out who contributed most, we need to take the full list of unique tweeters, loop over them while counting tweets (and optionally retweets) sent by them, order them by tweet count, and take top n:

let $users := distinct-values($tweets/*/(from/@id | subject/user/@id))
let $users-facet := (
       for $user in $users

       let $user-tweets := $tweets/*[not(self::retweet or self::duplicate)][from/@id = $user]
       let $user-retweets := $tweets/*[self::retweet or self::duplicate][from/@id = $user]

       let $tweet-count := count($user-tweets)
       let $retweet-count := count($user-retweets)
       let $count := $tweet-count + $retweet-count

       order by $tweet-count descending, $retweet-count descending, $user

       return
               <user id="{$user}" count="{$count}" tweets="{$tweet-count}" retweets="{$retweet-count}">@{$user}</user>
)
let $top5-users := $users-facet[1 to 5]

This will give the answer to who contributed most. I call it a facet, since it requires similar calculations as needed for faceted searching. Note: MarkLogic Server has built-in functionality to retrieve such facet information from its indexes, which I choose not to use to keep this article (mostly) engine-independent. It also made it easier for me to fiddle around a bit first. Results will be discussed in next section.

Next question: who was most influential? To do this properly, it would be best to analyze the followers-network of each user to include that into a calculation about the exposure of all of someone’s tweets. But that would involve the Twitter API again. Next to this, a larger size of the network doesn’t guarantee that a larger number of people actually reads the tweets. I therefor prefer to analyze how many people found someone’s tweets interesting. That can be measured quite easily by counting the number of retweets of that person’s tweets. You could also look at the number of times someone is being mentioned, which includes not only retweets, but also replies or other kinds of mentions. The code to calculate the top mentions is following the same pattern as for the users facet, hardly worth mentioning.

We continue with topics and sites: which were most popular (and therefor interesting)? The approach in roughly the same. There is one additional catch though. Some people tend to rave about particular things. Instead of just counting tweets and retweets, I count unique senders of both as well, and use that as first order-by criterion:

let $urls := distinct-values($tweets/*/subject/url/@full)
let $urls-facet := (
       for $url in $urls

       let $url-tweets := $tweets/*[not(self::retweet or self::duplicate)][subject/url/@full = $url]
       let $url-retweets := $tweets/*[self::retweet or self::duplicate][subject/url/@full = $url]

       let $tweet-sender-count := count(distinct-values($url-tweets/from))
       let $retweet-sender-count := count(distinct-values($url-retweets/from))
       let $sender-count := $tweet-sender-count + $retweet-sender-count

       let $tweet-count := count($url-tweets)
       let $retweet-count := count($url-retweets)
       let $count := $tweet-count + $retweet-count

       order by $sender-count descending, $tweet-count descending, $retweet-count descending, $url

       return
               <url full="{$url}" long="{$url}" org="{$url}" count="{$count}" tweets="{$tweet-count}" retweets="{$retweet-count}" senders="{$sender-count}" tweet-senders="{$tweet-sender-count}" retweet-senders="{$retweet-sender-count}">{$url}</url>
)
let $top5-urls := $urls-facet[1 to 5]

The hashtag approach is essentially identical to the urls approach, I’ll skip that. So, this should be enough to answer the question which tags and urls were the most popular.

Last question: who brought the most interesting urls and hashtags forward? Finding the answer to this requires combining facets. It requires taking the top n from both and counting the occurrence of the tweeters. I could also just have counted how many urls and tags someone tweeted, but this highlights the persons that tweeted urls and tags that were found most interesting by the others. The code is not much different from the rest. Look at the full code list to see the details. I excluded the MLUC11 hashtag, since practically all tweets contain it.

Draw graphs and conclusions


Now finally, what *are* those highlights of MLUC11, and who *are* the most interesting contributors?

Well, these were the highlights, according to my analysis:

The top 5 urls:
  1. http://mluc11-insidehack.eventbrite.com/
                   (from unique senders: 13, tweets: 7, retweets: 11)
  2. http://blogs.marklogic.com/2011/04/15/followanyday-mluc11-developer-lounge-labs/
                   (from unique senders: 11, tweets: 4, retweets: 7)
  3. http://developer.marklogic.com/events/mluc11-labs-and-lounge-schedule#talks
                   (from unique senders: 9, tweets: 1, retweets: 8)
  4. http://developer.marklogic.com/media/mluc11-talks/XSLT-basedWebsitesOnMarkLogic.pdf
                   (from unique senders: 8, tweets: 1, retweets: 7)
  5. http://newsletter.marklogic.com/2011/04/live-from-mluc11/
                   (from unique senders: 5, tweets: 3, retweets: 2)
The first url refers to a sub event of the MLUC11 conference. It is targeted for developers who want to get acquainted with MarkLogic Server, or would like to ask tricky questions to some of the experts.
The second url is a blog post by Pete Aven where he looks forward to the MLUC11 conference, mentioning a few highlights, and giving a brief description of many ‘followanyday’ ML experts and enthusiasts.
The third url points to the MLUC11 schedule.
The fourth url is one particular but pretty awesome presentation about using XSLT within MarkLogic Server to rapidly develop (dynamic) websites.
The fifth is an official news page from Mark Logic announcing the start of the MLUC11 conference. It contains references to Twitter hashtags and Facebook. To the right there is also an interesting list of  ‘related posts’. ;-)

The top 5 hashtags:
  1. #marklogic (from unique senders: 26, tweets: 22, retweets: 16)
  2. #mluc+mluc12 (from unique senders: 7, tweets: 1, retweets: 6)
  3. #followanyday (from unique senders: 6, tweets: 1, retweets: 5)
  4. #tech (from unique senders: 5, tweets: 5, retweets: 6)
  5. #mluc11burrito (from unique senders: 4, tweets: 3, retweets: 1)
The mentioning of hashtags like #marklogic, #mluc and #mluc12 is not very surprising. The #followanyday hashtag is used to attract attention. It was used together with the followanyday blog post url. The #tech hashtag is used by non-tech people (I guess), in an attempt to push Mark Logic into a particular category. The #mluc11burrito hashtag was used to bring mluc11 visitors together to join in a burrito diner.

The top 5 tweeters:
  1. @mdubinko (sent tweets: 71, retweets: 1)
  2. @peteaven (sent tweets: 48, retweets: 8)
  3. @lisabos (sent tweets: 34, retweets: 1)
  4. @mattlyles (sent tweets: 14, retweets: 1)
  5. @ronhitchens (sent tweets: 13, retweets: 1)
The top 5 mentions:
  1. @mdubinko (in tweets: 6, retweets: 38)
  2. @marklogic (in tweets: 15, retweets: 20)
  3. @peteaven (in tweets: 7, retweets: 28)
  4. @hunterhacker (in tweets: 11, retweets: 5)
  5. @lisabos (in tweets: 2, retweets: 9)
The top tweeters and mentions are for the most part no surprise. I already predicted in one of my tweets that Micah Dubinko had been retweeted the most. Just keep in mind he tweeted the most about MLUC11, by far. Pete Aven and Lisa Bos tweeted a lot too, so no surprise to see them in the mentions top 5 as well. Whoever is behind the Marklogic account, he did well, and got mentioned second best with ‘just’ 12 tweets and 10 retweets. I tweeted more than mattlyles and ronhichens, but most of them were retweets (by far), while these two made quite a number of original statements from themselves. That is why they rank higher than I. Last but not least: Jason Hunter, aka hunterhacker, one of the experts that also spoke at Mark Logic, got mentioned quite a lot. But that is what you hope to achieve when you are a speaker, right? I'd say he deserves to be in the top 5!

Last but not least, the top 5 most prominent contributors (based on top 5 urls and hashtags):
  1. @marklogic (tweets: 7, retweets: 3)
  2. @dscape (tweets: 5, retweets: 6)
  3. @eedeebee (tweets: 4, retweets: 2)
  4. @peteaven (tweets: 3, retweets: 4)
  5. @contentnation (tweets: 3, retweets: 2)
The MarkLogic account scores high obviously, as should. Nuno Job (aka @dscape), Eric Bloch (aka @eedeebee), and Pete Aven are doing well as Mark Logic experts. John Blossom (aka Content Nation) was very fond of the #tech hashtag. Not sure, but he could have been the sole contributor to that hashtag.

To be honest, this wasn't my most brilliant analysis. Changing the sort order can make quite a difference. Also, inclusing a larger top n of urls and tags has large influence on the contributors top 5 as well. But making a brilliant analysis wasn’t really my point, I just hope you enjoy the bits of code I shared..

Don't forget to download the full code listing. It adds some interactivity as well!

        https://raw.github.com/grtjn/utilities/master/analyze-tweets.xqy


Tuesday, May 17, 2011

PDF to XML conversion with XSLT 2.0

Good old-fashion Content Engineering can save money

XML has brought us many tools for all kinds of tasks. They make a lot of work easy, but developers lazy too. It is harder and harder to find really skilled people these days. New developers haven’t learned from doing things the hard way, even though doing things the hard way still often produces the best results. Here an example of converting PDF to XML using XSLT 2.0 to provide a cheap solution, done the ‘hard’ way.

One of our customers approached us (speaking on behalf of my company, Daidalos). They had some PDF documents with pretty clear layouts. Specific parts of these had to be converted to XML on a regular basis. Obviously, they didn’t want to spend much on it, so we had to look for a cheap solution.

The task seemed simple. Visually identifying the interesting parts, and classifying the information was easy enough. We were hoping to automate the process though. This meant we needed a conversion from PDF to XML with preservation of sufficient layout details to distinguish and classify the interesting bits.

There are plenty tools readily available that help with converting PDF to XML. Most of them cost money, though. The good ones cost a lot of money. We had not that much to spend, so we had to settle for one of the cheapest that could do the trick. Most of the cheaper ones are cheap for a good reason. They were either meant for text extraction only (with a bare minimum of layout), or were good at some aspects of the conversion, but not at what we needed: preservation of sufficient layout details.

One of my colleagues, though, stumbled upon a very small executable called: pdf2html. It is freely available on the internet. It is actually not too well in converting PDF to XML (you need the -xml option), as it occasionally produces ill-formed XML. Nor does it recognize a lot in the PDF. But it does two things pretty well:

1.      identify all text runs within the PDF (with a bit of layout information),
2.      add accurate positioning information to the text runs.

It extracts sufficient information to produce HTML and CSS out of it which mimics the exact textual layout in the PDF. The accurate positioning information is all we need to identify and classify the relevant information. It is this tool that we employed for this task.

This bare conversion to XML was only part of the job though; most of the work still had to be done Just to give an impression of its output, let me show you a sample. The PDF input looked like this:

The XML output like this:

The output is actually quite messy. The largest issues are:
·        Continues text is chunked in separate lines
·        Chunked lines cause hyphenated words to get cut in halves
·        Lines can be chunked in separate text runs as well
·        No hierarchy in the text structure
·        No advanced formatting, like tables and lists
·        No separation of headers, footers, and margin texts
·        No images, or other draws layout, not even table borders

In short: it looks like a literal translation from the PDF text floats to XML, and text floats only. Making sense out of this mess is a real challenge. And I went for it.

As said, the output has one really valuable asset: the positioning information. This positioning information helps coping with most of the above problems. It just needs a careful approach. By applying general enhancements you can get a more logical separation of the information. Once there, you can add meaning to the information.

I roughly took the following steps in mentioned order to apply general enhancement of the raw XML output:

1.      Isolating text runs in different zones (like header and footer)
2.      Gathering text runs on the same ‘line’ (ignoring columns here)
3.      Translate indentation to hierarchy (helps finding lists, provides bare table/column handling)
4.      Merging of lines to build paragraphs

For the first step you need to manually ‘measure’ the height of both. But once known, you can separate text runs belonging to headers, from those belonging to footers, and from those belonging to the body. I simply wrapped them in new elements named ‘header’, ‘footer’, and ‘body’.

The second step is not very difficult either. Search within each page for text runs that have the same top position. Wrap them in a ‘line’ element. A  simple xsl:for-each-group on the ‘top’ attribute did the trick.

The third step is a bit similar to the second step, but instead of looking at the top position, you look at the ‘left’ attribute of each line. Lines starting more to the right than above ones are wrapped in nested ‘block’ elements. Having separated header and footer earlier makes sure their contents doesn’t get mixed with text from the body.

At this point the output has already improved much. It more or less looks like this:

The fourth step requires a bit more effort, but involves not much more than joining the contents of adjacent ‘line’ elements. The ‘block’ elements help to identify where paragraphs should start and end. This is also a good opportunity to resolve hyphenation of words. Take any hyphen at the end of a line, cut it off, and concatenate with the following line without adding an extra space in between. Not entirely fail-safe, but the loss is marginal.

With just four small steps, the output can be greatly enhanced, making it much easier to isolate the interesting information, and add meaning to it. From here it takes only about 20 lines of real XSLT code to get to this end result:

In the end I needed only about 800 lines of XSLT code to get to this result. And nearly half of them are empty lines and comments. With roughly 250 actual lines of code more I was able to produce two more XML extracts of which one involved interpreting complex tabular information.

I’m not saying this approach can crack all your PDF to XML problems. But I do hope to have shown, that doing something the ‘hard’ way doesn’t have to be that really hard. And it can produce high quality results, perhaps even higher that using a more advanced PDF to XML tool.

Saturday, April 23, 2011

About me

It must have been around 1997, while I was still studying for my Master's Degree in Computational Chemistry, that I got first acquainted with XML. I started following the XSLT 1.0 standard in that period as well. It was still a Working Draft at that time. My first steps into the internet and XML community followed thereafter quickly.
I joined the Dutch IT-company Daidalos (http://www.daidalos.nl/) in the year 2000, bringing in XML and XSLT skills as well as general programming skills I had developed in the years before. At that time, Daidalos had a general focus on employing XML technologies, originally applied mostly in the publishing market. These days it is more focused on a few specific strategies and solutions, although it still covers most of the XML technologies, market-independently.
My contributions to Daidalos have always been in or close to the area of Content Engineering. In the early days I learned the proprietary programming languages Omnimark and MetaMorphosis. Both very powerful content processing languages, though very unalike. Not well-known though. I also broadened my XML knowledge, learning about most of the XML standards, and a lot about all kinds of XML tools and applications. I also followed the rise of XSLT 2.0 and XQuery 1.0 closely.
More recently, around mid-2008, I got acquainted with MarkLogic Server (http://www.marklogic.com/).  It's a bit hard to describe in one sentence, but let’s call it a highly scalable XML and search platform. I have had the opportunity to gain in-depth knowledge of this product, and have used it plenty to sharpen my MarkLogic and XQuery skills.
Only recently have I joined XML Holland (http://www.xmlholland.nl/) to help organise its yearly XML conference. Our ambition is to organise a conference similar to the annual XML Prague conferences, with plenty renowned invited speakers from outside of the Netherlands.
Apart from technical stuff am I also interested in reading books, white chocolate, playing board games, and solving puzzles. My lovely wife and I are the proud parents of a beautiful little baby-daughter.