Friday, December 9, 2011

Web Services Reliable Messaging

The OASIS WS-RX Technical Committee recently released the Web Services Reliable Messaging 1.1 specification for public review. As one of the two co-chairs of the committee, this seemed like a really good time to provide an introduction to WSRM and an overview of the specification. This article provides an introduction to the specification and talks about how it might be used in real systems. It is based on the WSRM 1.1 Committee Draft 4 which is available for public review.
Web Services Reliable Messaging (WSRM) is a specification that allows two systems to send messages between each other reliably. The aim of this is to ensure that messages are transferred properly from the sender to the receiver. Reliable Messaging is a complex thing to define, but you can think about WSRM as providing a similar level of guarantee for XML messaging that a JMS system provides in the Java world. There is one key difference though - JMS is a standard API or programming model, with lots of different implementations and wire-protocols underneath it. WSRM is the opposite - a standard wire-protocol with no API or programming model of its own. Instead it composes with existing SOAP-based systems. Later in the article I will address the exact meaning of reliability and what sort of guarantees the specification offers.

Agents

Before I explain the wire protocol, I'd like to explain the way it fits into an existing SOAP interaction. Unlike a queue-based system, WSRM is almost transparent to the existing applications. In a queue-based system, there is an explicit third party (the queue) where messages the sender must put messages and the receiver get messages from. In RM, there are handlers or agents that sit inside the client's and server's SOAP processing engines and transfer messages, handle retries and do delivery. These agents aren't necessarily visible at the application level, they simply ensure that the messages get re-transmitted if lost or undelivered. So if, for example, you have set up a SOAP/JMS system to do reliable SOAP messaging, you will have had to define queues, and change the URLs and endpoints of the service to use those queues. In WSRM that isn't necessary, because it fits into the existing HTTP (or other) naming scheme and URLs.
In WSRM there are logically two of these agents - the RM Source (RMS) and the RM Destination (RMD). They may be implemented by one or more handlers in any given SOAP stack.
The RM Source:
  • Requests creation and termination of the reliability contract
  • Adds reliability headers into messages
  • Resends messages if necessary
The RM Destination:
  • Responds to requests to create and terminate a reliability contract
  • Accepts and acknowledges messages
  • (Optionally) drops duplicate messages
  • (Optionally) holds back out-of-order messages until missing messages arrive
It is important not to confuse the Source and Destination with the "service client/requester" and "service server/provider". In a two-way reliable scenario (where both requests and responses are delivered reliably) there will be an RMS and an RMD in the client, and the same in the server.

Wire Protocol

The main concept in WSRM is that of a Sequence. A sequence can be thought of as the "reliability contract" under which the RMS and RMD agree to reliably transfer messages from the sender to the receiver. Each sequence has a lifetime, which could range from being very short (I create a sequence, deliver a few messages, and terminate) to very long. In fact the default maximum number of messages in a sequence is 2^63, which is equivalent to sending 1000 messages a second for the next 292 million years!
A Sequence is created using a CreateSequence interaction, and terminated when finished with a TerminateSequence interaction.
Example of a CreateSequence message:
<soap:body>
  <wsrm:createsequence>
    <wsrm:acksto>
      <wsa:address>http://Business456.com/serviceA/789</wsa:address>
    </wsrm:acksto>   
  </wsrm:createsequence>
</soap:body>
      
Each message in a sequence has a message number, which starts at one and increments by one for each message.
Example of a Sequence Header and message number:
<soap:header>
  <wsrm:sequence>
    <wsrm:identifier>http://Business456.com/RM/ABC</wsrm:identifier>
    <wsrm:messagenumber>1</wsrm:messagenumber>
  </wsrm:sequence>
</soap:header> 
The message number is used to Acknowledge the message in an SequenceAcknowledgement header.
Example of a SequenceAcknowledgement Header:
<soap:header>
  <wsrm:sequenceacknowledgement>
    <wsrm:identifier>http://Business456.com/RM/ABC</wsrm:identifier>
    <wsrm:acknowledgementrange lower="1" upper="1" />
    <wsrm:acknowledgementrange lower="3" upper="3" />    
  </wsrm:sequenceacknowledgement>
</soap:header>

Example One-Way Scenario

Let's walk through a simple example. For simplicity we will add reliability to a one-way interaction so in this case there is just an RMS in the client and just an RMD in the server. After this I'll talk through some of the options.
  • The client wants to send an application message, so the the RMS first sends a CreateSequence message to the same URL as the application messages go to, and
  • The RMD intercepts the message and responds with a CreateSequenceResponse. This includes the all important SequenceID which is the identifier by which this sequence will be known
  • The RMS now adds a Sequence header into the original application message. This has the SequenceID and the message number (in this case it will be 1).
  • The RMS continues to add incrementing Sequence headers into application messages.
  • The RMD delivers these messages to the server application, maintaining any guarantees that it offers, such as exactly-once and in-order
  • According to its timing policy, at some point the RMD will send SequenceAcknowledgements back to the RMS. When an RMS creates a sequence, it passes an address for acknowledgements (the AcksTo address) to the RMD. In this particular scenario, we will assume that the AcksTo address is the WS-A anonymous URI - which implies you use the transport backchannel. In this case the RMD will send the acknowledgement on the HTTP response channel. Because this is a one-way interaction, there is no SOAP envelope flowing back to the client, so the RMD will create an empty SOAP envelope, add the header, and return it on the HTTP response. The RMS will pick this up before it gets to the client application.
    Note that the acknowledgement isn't just for one message, it acknowledges all the messages successfully received by the RMD.
  • If there are any missing messages, the RMS will resend those
  • Once the RMS has had all the messages that it has sent acknowledged, it can terminate the sequence. To do this is sends a TerminateSequence message to the RMD.
  • The RMD responds to the RMS with a TerminateSequenceResponse, and
  • That's all folks!
Actually, spelt out in that level of detail it seems like quite a lot, but if we recap, there were two extra service calls (Create and Terminate), and then a few extra headers floating around. I don't think that is unnecessary overhead. At one point an early draft of the spec had an inline or implicit CreateSequence. Unfortunately, that left the first message in doubt. The current design means that once you have successfully created a sequence, you have a "contract" with the other end to deliver messages. In most implementations, if no TerminateSequence is sent the sequence will be timed out automatically. And of course, you do get extra message flows if messages are lost, as in that case they will have to be resent.
So what could have gone differently? In other words, what options are there?
Well firstly, the acknowledgements don't have to use the backchannel. The RMS can open up its own HTTP port (or other endpoint) to receive acknowledgements on. This is specified in the AcksTo address. If the AcksTo address is the same as the WS-Addressing ReplyTo address, the RMD may piggyback acknowledgements in response messages flowing back to the client in some circumstances.
Secondly, the RMD doesn't have to acknowledge the messages it has received. Instead, if it is missing just one message in a million, it can Nack just the missing message. This is like a prompt to the RMS saying, I'd really like this missing message. Thirdly, the RMS could have requested an acknowledgement. Suppose the RMD is set to only acknowledge rarely (minimizing extra bandwidth), but the RMS wants to clean up its store of messages, then it can ask for an acknowledgement by adding an AckRequested header. The RMD will respond immediately with a SequenceAcknowledgement.

Closing a sequence

The other thing that could have been different is that maybe for some reason the RMS might decide to shut down the sequence before all messages are delivered. Why? Maybe my server is being closed down and I want to clean up in an orderly manner, or maybe there is one message that In this case, its tricky. Once I terminate the sequence, I can't ask for an acknowledgement, because the RMD will have cleared its state. If I ask for an acknowledgement first and then terminate, I might not get a true picture - maybe some extra messages might end up being delivered after I receive the SequenceAcknowledgement but before the Terminate happens. Arggg.
Well, we thought of this. So, we added the ability to Close a sequence. This basically is an extra interaction that allows the RMS to say that it won't be delivering any more messages. The RMD then responds with a Final sequence acknowledgement showing the ultimate state of delivered messages. After that its ok to terminate the sequence.

Request/Response

In the case of request response, there is very little difference, except that there is a sequence in each direction. The sequences are independent - so there is no linkage between transmission of the messages on one sequence with transmission of the messages on the other sequence. The only "linkage" is that you can optimize the creation of the two sequences by sending an Offer of a return sequence in the outgoing CreateSequence.
Imagine you are a client and it is clear that there will be a two-way reliable connection. In that case the client can create a sequence and Offer it to the server for responses. Effectively this lets you create two sequences in one message exchange. However, after that the sequences are independent: for example you can terminate one and still use the other.

Firewall crossing

Most internet users can't just start up an HTTP server on their machines and have other systems connect in. The problem doesn't come with running an HTTP server - that's simple. The real problem comes with getting the packets to your machine. For example, many home users have a broadband router/firewall that performs Network Address Translation. Without complex configuration these will drop all inbound packets. Similarly if I walk into a coffee shop and use the wireless LAN, I have the same problem - my IP address isn't globally accessible. Why do we care? Well, if I just want to do one-way reliable, then this doesn't matter. In fact, in the example above we showed how it works. By piggybacking the acks on the HTTP response flow, everything works just fine. But if I have a request-response flow, things change.
Suppose a response goes missing. The server wants to resend that message to the client. But the client isn't addressable. There is no open connection to resend the message on, and no way of the server opening one. Help!!!

MakeConnection to the rescue

MakeConnection is a simple one-way message that logically flows from the client to the server. By opening up an HTTP connection, this allows the server to respond with any "queued" or waiting messages that need to be transmitted to the client. Effectively the client "polls" the server every once in a while for any waiting messages. If you think about this carefully, you will see that this message flows from the RMD to the RMS, because it is designed for the return (response) path. Effectively the client's RMD is asking the server's RMS if there are any messages waiting. Of course, the client has to identify itself to make this happen. There are two options in MakeConnection. One is to modify the WS-Addressing headers to use a special URI that includes a unique ID. This is really there for complex scenarios. For simpler scenarios, the following approach works well:
  • Client creates a sequence and offer's a sequence at the same time
  • Client sends requests, ideally receives response on the backchannel
  • For some reason, some responses are timed out or connections lost
  • Client initiates MakeConnection, passing the Sequence identifier of the offered sequence
  • Server responds with missing message, plus a flag to indicate if more are waiting
  • Once no more messages are waiting the client can terminate the sequences

Security

In many ways RM just plugs in with whatever other security model is already in place. However, there are some issues that need watching out for. In particular, there is the possibility of a "sequence attack". In this model, imagine there are two valid "clients" each with a sequence. Both are authorized at the service level, but one of the clients is actually a maverick, and he wants to attack the other sequence. If he can guess (or sniff) the sequence identifier, then he can start a Denial of Service attack, by for example, terminating the sequence. So the RM spec addresses how to associate the sequence with a particular credential or security session. This means that the RM agent can protect against this kind of attack. This is particularly important with MakeConnection, because otherwise an unauthorized user could retrieve messages destined for another system.

WSRM Policy

As well as the core spec, the TC has published a Policy Assertion Language for WSRM that can be used with the WS-Policy Framework model. In the previous spec (1.0) the policy model was fairly complex. There were a number of timing parameters that were published in WSRMP. Firstly the TC decided a number of these were "unhelpful" as they tied the parties to using static timing models instead of dynamically adjusting them. Secondly, it was felt that it would be better to have any remaining timing agreed during the CreateSequence. This means that WSRM can be used very successfully without needing to use WS-Policy. Now WS-Policy is simply used to signal whether WSRM is optional or required on a given endpoint. So what does Reliability mean anyway?
Are you still reading? Congratulations on making it this far! Well we've covered the protocol in a reasonable amount of depth. Now let's step back and see what it actually gives us! The first question that challenges people about WSRM is: "What level of reliability do I get?". And the answer isn't that simple, unfortunately. WSRM was designed as a wire protocol not as an end-to-end application level protocol. There are two reasons for this. One is that the Web Services standards (WS-*) are generally designed to cover the externally visible view of a service and not the implementation, to promote the concept of loose-coupling. The second reason is composability: to provide end-to-end reliability you need to have some kind of transaction manager associated with the application. Because there are other WS-* specifications that cover transactions, and different ways of implementing transactions, it doesn't make sense for this specification to cover that aspect. This is a thorny issue that comes up every time I discuss WSRM with customers or potential users, who are looking for much more of a plug-in replacement for existing messaging systems that tightly integrate with transactional applications.
The guarantee that WSRM - by itself - offers, is simply that the message was successfully transferred from the RMS to the RMD and that the RMD acknowledged it. Different implementations can have different guarantees behind this. For example, Apache Sandesha2, an open source implementation of WSRM, has a pluggable storage manager. This means that you can have a persistent store behind the RMD, so the acknowledgement is only sent when the message has been written to disk. This means that Sandesha can support server failure and restart. The WSO2 Tungsten server supports this model of operation.
The previous specification (WSRM 1.0) specifically talked about delivery assurances such as AtLeastOnce, AtMostOnce, ExactlyOnce and InOrder. However, these assurances are really guarantees between the RMD and the application, not across the wire. So as a committee, we removed these from the specification. We still expect implementations to offer these levels of assurance, but they are part of the implementation not the wire protocol.

Programming model and implications

If you are a JMS or messaging developer, you will be used to learning a programming model (PM) for reliable messaging, such as JMS. So WSRM might come as a shock to you, because it can be used without any new programming model. Of course its hard to generalize, because each implementation can have its own approach, but the core spec doesn't imply any particular PM. For example, Sandesha allows you to turn on RM. If there is no sequence in place, it automatically creates one, and then when no more messages are being sent, it times out and terminates the sequence. The fact that the RMS and RMD are just "handlers" in the chain of processing also means that there are no new "visible entities" such as queues that need to be configured or that show up in the client code - the RM infrastructure can share the same URIs that the existing Web Service uses. So WSRM can be added into an existing Web Services interaction with no extra application code. (By the way, Sandesha also has a full programming API that gives access to sequences if users wish to hand-code the RM behaviour).
Despite this transparentness, it is worth thinking about the implications on coding. Many recent Web Service stacks and APIs, including Microsoft WCF (Indigo), JAX-WS, and Apache Axis2, offer the ability to call a Web service asynchronously (non-blocking). In this model, instead of the client blocking until the response comes back, the client passes a callback object in when it makes the outbound call. Processing then continues on the client thread, and when the response comes back a separate thread handles passes the response to the callback handler.
This style of programming is very important for WSRM, because it means that even if the server goes down, RM can resend the request and response messages until the response is received. With a blocking call, at some point the client would timeout, leaving the reliable response "orphaned" - properly delivered back to the client but without any code available to process it. So in general, if you think you might use RM, it makes sense to write clients using this non-blocking approach. (Its actually good practice anyway: imagine a web application server that is making calls out to a third-party using Web services; if too many requests are blocking waiting for responses the server's thread pool would end up exhausted and the server couldn't handle incoming requests).

History and differences from the existing 1.0 specification

WS ReliableMessaging dates all the way back to March 2003, when it was originally published. In June 2005 the 1.0 specification was submitted to OASIS for standardization. The current draft reflects a number of changes from the 1.0 spec. Without listing all of them I can summarize the main changes:
  • Namespace changes Since the specifications have significant changes they are not compatible at the wire-level. The 1.1 spec has a different set of namespaces reflecting the ownership by OASIS
  • Cleanup The TC really worked through the specification with a fine-toothed comb, and found many small issues ranging from potential errors to potential problems interoperating.
  • Addition of CloseSequence As discussed above, there are cases where it is necessary to close an incomplete sequence, and CloseSequence allows that to happen cleanly
  • Removal of LastMessage The 1.0 spec had a marker on a message to indicate it was the last message, which was largely superfluous.
  • Improved security composition The original spec had very specific composition with WS-Security/WS-SecureConversation. The 1.1 spec has a much more flexible approach that also supports composition with SSL/TLS based security sessions.
  • Updated to use the W3C WS-Addressing Recommendation The 1.1 spec uses the recommended version of WS-Addressing from the W3C.
  • Simplification of WSRM-Policy The published policy assertion is much simpler - basically is RM on or optional. The previous spec had a number of timing parameters which would not allow for dynamic adjustment of the protocol, so they were removed, or moved into the CreateSequence.
  • Support for two-way reliability with firewall crossing The MakeConnection support was added in the 1.1 spec

Implementations

There are a number of implementations of the existing WSRM 1.0 specification, including Microsoft WCF (formerly known as Indigo), and Apache Sandesha2. The OASIS WSRX TC hosted an interop based on the last Committee Draft earlier in 2006, and 5 companies turned up with implementations. Although the interop didn't produce 100% coverage, three companies managed to interop fully between their implementations in all scenarios. The TC is hosting a second interop during the public review period, to fully test the implementations on the latest specification. We are also expecting more companies to take part this time.

Summary

In this article we've covered a lot of ground, from the overall model down to the main elements of the wire protocol. There are more complicated scenarios I haven't covered, and I encourage you to read the spec itself to understand the nuances, but I hope its been useful. I'd like to finish off by looking at some of the potential uses I see for WSRM, and some of the ideas that customers have talked to me about.
  • B2B messaging A number of people see WSRM playing a key part in business to business links. Many companies are looking for a low-cost simple way of ensuring that orders, invoices, etc. are reliably and securely transmitted over the Internet to partners. WSRM is an ideal technology to provide the reliability for those links.
  • Internal department-to-department or server-to-server links WSRM is also a very useful protocol inside the enterprise. More and more companies are developing and using Web services and XML communications internally, and as those links become "line-of-business" WSRM will become a key technology to ensure reliability.
  • JMS replacement Some companies are looking at WSRM as a long-term replacement for existing proprietary JMS systems. The next release of Windows, Vista, will include WSRM support built-in. That makes it tempting if companies have currently got to install proprietary JMS clients on many workstations.
  • JMS bridge You could use WSRM as a standard protocol to bridge between two different JMS implementations. The Apache Synapse open source project is designed to help you do this, amongst other things.
  • Browser-based scenarios and notifications As AJAX applications get more interesting, the idea of doing reliable messaging directly from a browser becomes pretty exciting, especially if you were building, for example, an AJAX trading application. At least one effort is creating a plug-in for the Firefox browser that supports a SOAP-based AJAX model. RM support is coming and will make it very simple to create reliable AJAX applications. Since AJAX already uses a non-blocking asynchronous approach it is ideally suited to being composed with WSRM. The ability to cross firewalls using the MakeConnection facility also means that RM can be used without the client needing to open ports. This approach can also be used to support subscriptions, where the browser makes a single request (subscribe) and receives multiple responses (notifications) back using MakeConnection.
All in all, I see a bright future for WSRM. Its taken a while to pull together all the companies and the technology into a single approach, but we are making good progress, and the public review of the specification is a major milestone on that path.

Resources

No comments:

Post a Comment