03.05.2011. 16:17

PHP: Parsing XML Using Regexp

Here's one evergreen task: get values from external XML. There are zillions of ways to perform this, you can use PHP xml_parser (if you're into BDSM), or SimpleXML, etc, etc - but I prefer an old school kludge that I've made some 10 years ago in perl, using regular expressions. It's pretty efficient solution when XML format is fixed and known, e.g. for a collection of items like:

<items>
   <item>
      <property1>value1</property1>
      <property2>value2</property2>
      ...
   </item>
   <item>
      <property1>value1</property1>
      <property2>value2</property2>
      ...
   </item>
   ...
</items>

All you need is to hit a string with this regexp:

	preg_match_all('/<'.$tag.'>((?:(?!<\/'.$tag.'>).)*)<\/'.$tag.'>/ms', $text, $matches);

now, everything's in $matches[1] array.
(Of course, nesting is not supported - this will grab the first closing tag)
So, here's a parser example:

<?php

/**
 * Sample class
 */
class sampleXmlRegexpParser {

	/**
	 * Parse import xml
	 *
	 * @param  string $xml
	 * @return array
	 */
	public function parse($xml) {
		$items = array();
		foreach ($this->_getAllWrappedInTag($xml, 'items') as $itemTxt) {
			$fields = array();
			foreach (array(
						 'property1',
						 'property2',
						 //...
						) as $key)
				$fields[$key] = $this->_getTagValue($itemTxt, $key);
			$items[] = $fields;
		}
		return $items;
	}

	/**
	 * Gets value by tag
	 *
	 * @param  string $text
	 * @param  string $tag
	 * @return mixed
	 */
	protected function _getTagValue($text, $tag) {
		if (preg_match('/<'.$tag.'>((?:(?!<\/'.$tag.'>).)*)<\/'.$tag.'>/ms', $text, $matches))
			return $matches[1];
		else
			return null;
	}	

	/**
	 * Extracts all texts surrounded by given tag
	 *
	 * @param string $text
	 * @param string $tag
	 * @return array
	 */
	protected function _getAllWrappedInTag($text, $tag) {
		if (preg_match_all('/<'.$tag.'>((?:(?!<\/'.$tag.'>).)*)<\/'.$tag.'>/ms', $text, $matches))
			return $matches[1];
		else
			return array();
	}	
}
?>
Tags:phpregexpxml