Tuesday 18 August 2015

How do you parse and process HTML in PHP

How do you parse and process HTML in PHP

Following are different ways to parse the HTML

Use DOM: The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.
$fullHTML = file_get_contents('http://www.example.com/scrap.php');
$domObj = new DOMDocument();
$xpath = new DOMXPath($domObj);
$tags = $xpath->query('//div[@class="myclass"]/div');
foreach ($tags as $tag) {
    echo "\n";

Use SimpleXMLElement: The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
$fullHTML = file_get_contents('http://www.example.com/scrap.php');
$allData = new SimpleXMLElement($fullHTML);

Regular Expressions: It is sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.
$fullHTML = file_get_contents('http://www.example.com/scrap.php');
preg_match_all("/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/", $fullHTML, $matches);

3rd Party Libraries
There are lot of 3 party libraries which can parse your HTML/XHTMl. Following are few famous libraries.
phpQuery: phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).
Zend_Dom: Zend_Dom provides tools for working with DOM documents and structures.
QueryPath: QueryPath is a PHP library for manipulating XML and HTML.
FluentDom: FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP.

There are different APIs are available for the scraping the website and few of them are following.
YQL:The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. It have like SQL syntax, familiar to any developer with database experience.
ScraperWiki: ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.