Search

How to Search WordPress Blogs Across the Web Using a Custom Widget

Listen to this article
Search-WordPress-Websites
Search Only WordPress Websites

Can you find out if a particular site is built on WordPress? It’s very simple. You just need to look at the source code, of any of the web pages. The source contains the text content of the webpage, along with the HTML tags, which are used by your browser to render the content. When you take a look at the contents in detail, you might find some WordPress related directories there. For example, view the page source of a site, which you know for sure uses WordPress, like our site. You can view the page source of this very blog post. You will find some URLs, which contain /wp-content/ or /wp-includes/. In fact, you will even find out the theme we use, and some of the plugins.

When you think about it, you can use this exact logic, to detect WordPress sites from anywhere, even among search results. For example, from your search results, you can check the source of every page and check if it is built on WordPress. And actually if you automate this filtering process, you can create your very own WordPress Blogs Search Widget.

Page-Source-WisdmLabs
WisdmLabs is Built on WordPress

[space]

Creating a Search WordPress Blogs Widget

In our previous articles, ‘Building an Apache Solr Search Plugin for WordPress’ and ‘Configuring the Sphinx Search Plugin for WordPress’, we had written about creating advanced search plugins for your WordPress site. But this article is a bit different. We won’t be implementing a search widget. Here, I will provide you with a technique to limit your search only to sites built on WordPress.  We will be using Google search API to fetch search results. And then filtering WordPress sites from the results, and displaying them to the user.

Create the Search Widget

So, we begin by creating a simple Widget in WordPress. Since our widget will provide a search functionality, we will have to add a search bar. The search bar is basically a form, with a textbox and a submit (search) button. When a user enters keywords to be searched, you have to create a search query, which is basically the search engine URL with the keywords as parameters. For example, if a user enters, the words ‘web’ ‘developers’, your search query would be

[pre]https://www.google.com/search?q=web+developers[/pre]

Any additional tags, could be added as parameters in your search query. The eventual search will be performed as follows:

[pre] $title_links = array();
$search_result_links = array();

// add a user agent
// you can add any user agent you prefer. Sometimes, Google might block your IP address If multiple requests are sent continuously for the same keywords, in such cases, you can use different user agents to perform your search operation
$my_user_agent = “Mozilla/5.0 (X11; U; Linux x86_64; en-ca) AppleWebKit/531.2+ (KHTML, like Gecko) Version/5.0 Safari/531.2+”;

// only get the first 50 results
$srp=0;
while($srp<50)
{
// find page wise results
$my_url = “https://www.google.com/search?q=web+developers&start=$srp”;
$c = curl_init($my_url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT,$my_user_agent);
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 10);
// perform the search
$html = curl_exec($c);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

// get the URL of each search result
$result_links=array();
$result_links=$dom->getElementsByTagName(‘cite’);
foreach($result_links as $l)
{
$title=’http://’.strip_tags($l->nodeValue);
array_push($search_result_links,$title);
}

$srp += 10;
}[/pre]

Hint: To further optimize the above code, we can execute multiple search requests in parallel using cURL Multi Init.

Simple-WordPress-Search-Widget
Simple WordPress Search Widget

 

Find Sites Built on WordPress

If we leave the search query as is, Google would list results from all sites. To separate out just WordPress blogs, we will have to filter out the search results. So how do we do this? We have to detect the use of WordPress in the source (HTML tags) of the search results, using regular expressions. (Just stay with me on this).

Usually the stylesheets and scripts present in wp-content folder, are enqueued in the head. For example, your theme’s stylesheet would be present as

[pre]<link rel=’stylesheet’ href=’http://example.com/wp-content/themes/twentyfourteen/style.css?ver=3.9.1′ type=’text/css’ media=’all’ />[/pre]

Now, we cannot obviously search for an exact match because every search result will have a different site name and theme. We can instead perform a generalized search using a regular expression, because the structure will remain the same. The tag will be  link rel= ‘stylesheet’ (or ”stylesheet”), path to /wp-content/ followed by your theme and the file name.

The regular expression for this would be as follows:

[pre]”/<link rel=(\”|’)stylesheet(\”|’) [^>]+wp-content/themes/i”[/pre]

Here, we ensure that wp-content is present before the link tag is closed, and ‘i’ indicates case-insensitivity. You can create similar regular expression, say for plugins, or script files, or other meta tags. Be sure to test out the regular expressions you create, using RegExr (a must-be-bookmarked site).

If you do not want to break your head over creating regular expressions to detect the WordPress CMS, you can use the CMS Detector Class. This class determines the software used by a webpage, by analysing the HTML tags present, using such above mentioned regular expressions. You can use this class directly, or use the individual regular expressions to filter search results. (I would recommend using the CMS Detector class, if you wanted to detect several softwares used by a site or if the user had an option to select the CMS to be detected).

Filtering WordPress Only Blogs

Wait, we are not done. We have to connect the two: the search process, and the filtering of results. At the end of our search process, we would have populated the $search_result_links array with the search result URLs. We need to get the contents of each link, and detect if WordPress is being used.

[pre]$final_results = array();
foreach($search_result_links as $sr_link){
$source_content = file_get_contents($sr_link);
if(is_WordPress_detected($source_content))
array_push($final_results, $sr_link);
}[/pre]

I will provide you with two options here, you can use CMS Detector, or directly match the content against the regular expressions.

Using the CMS Detector

[pre]function is_WordPress_detected($source_content)
{
$cms_detected = CMS_Detector::process($source_content);
if(in_array(‘WordPress’, $cms_detected))
return true;
return false;
}[/pre]

Using Regular Expressions

[pre]function is_WordPress_detected($source_content)
{
$rval = false;
// add all your regular expressions
$wp_regex = array(“/<link rel=(\”|’)stylesheet(\”|’) [^>]+wp-content/themes/i”);
foreach($wp_regex as $wpr)
{
preg_match($wpr, $data, $matches);
if (!empty($matches)){
// detected
$rval = true;
break;
}
}
return $rval;
}[/pre]

[space]

And there you have it your very own custom search feature, which can provide users an option to only search across sites built on WordPress. We did not have to think much. We only had to automate the processes we performed manually. Remember, it’s not rocket science, it’s logic.

Aparna Gawade

Aparna Gawade

Leave a Reply

Your email address will not be published. Required fields are marked *

Get The Latest Updates

Subscribe to our Newsletter

A key to unlock the world of open-source. We promise not to spam your inbox.

Suggested Reads

Join our 55,000+ Subscribers

    The Wisdm Digest delivers all the latest news, and resources from the world of open-source businesses to your inbox.

    Suggested Reads