Sometimes it's just hard for the newest fangled spangeled technology like SOAP and WSDL to beat tried and true technologies like Comma Separated Values, and sometimes it is the most benign of examples that seem to prove it.
Consider the following situation: you are searching or listing some potentially large data set, like a file system. You don't know how much stuff you are getting, but you know when you do find something you know all you will ever need know about it. But conversly this knowledge is atomic: you know nothing until you know everything. Furthermore getting that information has some non trivial cost that is basically static. And for the sake of argument all of this API is hidden behind some horrible facade that you cannot access (yea, I could be less abstract but, you know, NDA this, trade secret that, confidential interface here, yadda yadda yadda). How would you write a SOAP service to access that and get the data?
The obvious solution is to queue all of the result up into a single query. That's all fine and dandy, except that you then get all of the data at one time. That may not be desireable: what if you are showing a table with that data north of the one comma when you pretty printed range and after five minutes without data the user/QA Engineer/Customer thinks the query is hung and kills the app, never mind the fact that the data was available within 2 seconds of submission, essentially we have made the time to first data the same as time to all data plus network overhead. Why? Because of Well-Formed XML constraints; I don't know the soap response is valid XML until I receive the close tag, and for an unbounded element I don't how many there are until the close tag for the wrapper appears, and then because of threading issues it just doesn't deal well with common soap toolkits.
Can we could cache the results from the query and get the results in multiple calls? Then we have to cache it on the server side which means we need to keep soap sessions or handle callback objects or other such messiness. And what if the query is aborted? Then how do you deal with left over data no one ever came to get? LRU methods could also cause the data to be pushed out before you could get it when you have a high number of users or calls unless you make the chances very big, not to mention the bug potential. And even if we could do a quicker call to get just some key for each piece of data and return that quicker we still would have to go back and make N calls, which if the overhead of each individual SOAP call is any single digit fraction of the time to get the data from the key (or worse yet, a multiple) then the first row may be 2 seconds but the total time to get all of that data is now 150% to 300% what it was before if the overhead was 1/2 or 2x.
So how does CSV fit into this? First off a CSV file can have full context for any given row taken out of the file at any point (the agreement of column meanings is the same as needing a WSDL to describe it). Parsing single lines is a rather trivial state machine that college freshmen could implement as a weekly CS101 lab. You don't need some magic end of content marker like a close tag, it's always a newline outside of quotes. Detecting when you are done is as simple as detecting the fact the file is closed, and if a user abandons a request mid stream the socket errors tell the querying engine to stop creating more data. It's much like feeding a baby: you move the stuff in the jar one spoonful at a time to the baby and you stop when the baby spits up, gets resistant, or you run out of baby food.
Optimizing Web Services are much like other optimization tasks, there are hot spots where you focus on tweaking the algorithm or dropping down to assembly language if you have to. And when you optimize for raw speed sometimes you get other pleasant surprises too: it's 10x to 100x smaller as well.