When using PHP’s simplexml_load_string function to parse XML strings, it is common to encounter parsing failures caused by “illegal characters.” These errors typically arise because the XML string contains characters that do not comply with XML standards, such as control characters, unescaped special symbols, or inconsistent encoding. This article will explain the causes of this issue in detail and offer solutions along with example code.
simplexml_load_string is a convenient PHP function for parsing XML strings. When the XML string contains illegal characters, the function returns false and triggers error messages. Illegal characters generally include:
ASCII control characters (such as 0x00 to 0x1F, excluding space, newline, and tab)
Unescaped characters (for example, & not written as &)
Mismatched encoding declaration or content encoding in the XML
Non-UTF-8 encoding without declaration
These characters cause the XML parser to fail to correctly interpret the string structure, resulting in parsing errors.
You can remove control characters using a regular expression:
<?php
$xmlString = 'This is an XML string containing illegal characters';
<p>// Remove control characters but keep newline (\n), carriage return (\r), and tab (\t)<br>
$cleanXmlString = preg_replace('/[^\PC\s]/u', '', $xmlString);</p>
<p>$xml = simplexml_load_string($cleanXmlString);<br>
if ($xml === false) {<br>
echo "Parsing failed\n";<br>
} else {<br>
print_r($xml);<br>
}<br>
?><br>
If the XML content contains unescaped symbols like &, <, or >, they must be escaped first:
<?php
$xmlString = 'This is an XML string with unescaped & symbols';
<p>$xmlString = str_replace('&', '&', $xmlString);</p>
<p>$xml = simplexml_load_string($xmlString);<br>
if ($xml === false) {<br>
echo "Parsing failed\n";<br>
} else {<br>
print_r($xml);<br>
}<br>
?><br>
Note: If the XML is already valid, performing replacements might cause errors. Handle according to specific cases.
simplexml_load_string expects UTF-8 encoded strings by default. If the XML uses another encoding (e.g., GBK, ISO-8859-1), convert it first:
<?php
$xmlString = file_get_contents('http://gitbox.net/path/to/xmlfile.xml');
$xmlString = mb_convert_encoding($xmlString, 'UTF-8', 'GBK');
<p>$xml = simplexml_load_string($xmlString);<br>
if ($xml === false) {<br>
echo "Parsing failed\n";<br>
} else {<br>
print_r($xml);<br>
}<br>
?><br>
For better debugging, enable internal error handling:
<?php
libxml_use_internal_errors(true);
<p>$xmlString = '<invalid&xml>';</p>
<p>$xml = simplexml_load_string($xmlString);<br>
if ($xml === false) {<br>
foreach (libxml_get_errors() as $error) {<br>
echo "Error: ", $error->message;<br>
}<br>
libxml_clear_errors();<br>
} else {<br>
print_r($xml);<br>
}<br>
?><br>
Below is a comprehensive example demonstrating how to clean illegal characters, ensure encoding, and catch errors:
<?php
libxml_use_internal_errors(true);
<p>$xmlString = file_get_contents('<a rel="noopener" target="_new" class="" href="http://gitbox.net/sample.xml">http://gitbox.net/sample.xml</a>');</p>
<p>// Remove illegal characters (preserve newline, carriage return, tab)<br>
$xmlString = preg_replace('/[^\PC\s]/u', '', $xmlString);</p>
<p>// Convert encoding to UTF-8 (assuming original encoding is GBK)<br>
$xmlString = mb_convert_encoding($xmlString, 'UTF-8', 'GBK');</p>
<p>$xml = simplexml_load_string($xmlString);</p>
<p>if ($xml === false) {<br>
echo "Parsing failed, errors are as follows:\n";<br>
foreach (libxml_get_errors() as $error) {<br>
echo trim($error->message), "\n";<br>
}<br>
libxml_clear_errors();<br>
} else {<br>
echo "Parsing succeeded:\n";<br>
print_r($xml);<br>
}<br>
?><br>
When simplexml_load_string parsing fails, first check for illegal control characters and clean them as needed.
Ensure special characters in the XML string are properly escaped.
Make sure the XML string is encoded in UTF-8; convert if necessary.
Use libxml_use_internal_errors(true) to obtain detailed error information, which helps locate problems.
Mastering these techniques can effectively prevent parsing failures caused by illegal characters and make XML parsing more stable and reliable.