codem - blog

Posts Tagged ‘sydphp’

Desk space available (Bapple)

Are you looking for a desk/office space in Surry Hills 3 minutes from Central station? Codem’s web collaborators Bapple are offering a desk space – here’s the low down:

“Offering a desk space to web/creative professionals. We’d also like to work with you if you choose! $86/wk + GST (or $395/mth all inclusive). The space comes with:

  • large desk
  • chair
  • broadband
  • A/C
  • Kitchen and Loo access
  • 24/7 access
  • sunlight + balcony

It’s a clean, bright office with friendly creative/web developer folk and no bills”

desk space available, Surry Hills, Sydney
(Desk on right will be on the left where white table is pictured.)

Contact Pete for more info on 0419144829 or visit the Bapple website

Consuming XML, fast, with PHP and XMLReader

Let’s face it, XML isn’t the lightest of data serialisation formats out there. Consider and compare this:

<alternate_description>something else<alternate_description>

against this, in JSON

{ alternate_description : "something else" }

Those repetitive XML tags are really just extra bytes to download and parse. Unfortunately, sometimes, we have to consume huge gobs of XML for a project and for that we have XMLReader, the lesser known cousin of SimpleXML.

Unlike SimpleXML, which consumes the entire document before making it available for parsing, XMLReader “acts as a cursor going forward on the document stream and stopping at each node on the way” (php.net/xmlreader). Kind of like a line-by-line CSV parser but acting on the nodes of an XML document.
Choosing the right XML parser for the job is very important, as if you don’t choose correctly it can lead to unwanted and avoidable performance issues on your server.

To illustrate this, I pointed both SimpleXML and XMLReader at the same 190MB XML document via a PHP shell script, ran two tests on each extension and profiled the results.  Test one found a node at the start of the file, the other test found a node at the end of the file.

The XML document in question is a standard XML document containing 21467 records, it looks something like this:

<persons>
   <person>
       <name>John</name>
       <!-- other nodes -->
   </person>
   <!-- 21466 person nodes -->
</persons>

Peak memory usage is measured by the “top” command (%MEM).

SimpleXML

Test One:
Nodes : 1
Peak Memory Usage: 18%
Processed 190MB of XML in 3.14164 seconds

Test Two:
Nodes : 21467
Peak Memory Usage: 18%
Processed 190MB of XML in 3.20796 seconds

XMLReader

Test One:
Nodes: 1
Peak Memory Usage: 0.3%
Processed 190MB of XML in 0.00128 seconds

Test Two:
Nodes : 21467
Peak Memory Usage: 0.7%
Processed 190MB of XML in 16.4478 seconds

These results really give an indication of the different uses of both extensions.

XMLReader flew through finding the first element in no time at all while SimpleXML took about the same time to find the first and the last element. The big difference is memory — XMLReader performed about 50 times better than SimpleXML.
Understandably, XMLReader took a lot longer to find the last node as it had to process each node in the document until it found a match. A seek() method on the XMLReader class would obviously be useful here to skip unwanted nodes.

Use cases

For simple parsing such as RSS feed handling and small XML documents SimpleXML is definitely the way to go. It’s easy access to document nodes is a great advantage.
For larger document importing, XMLReader wins hands-down due to its ability to read the document node by node with limited impact on system memory, in fact you can parse XML documents with XMLReader that are larger than the available system memory.

One final tip: avoid building large data structures while processing large XML documents with XMLReader as it defeats the purpose of using XMLReader in the first place — just grab the data needed to perform an operation and skip to the next iteration.

Other Resources