BMOW title
Floppy Emu banner

WordPress, https, and Canonical URLs

padlock

About a week ago, I added an SSL certificate to the BMOW web site, in preparation for some improved shopping cart features. With an SSL certificate signed by a certificate authority (free from Let’s Encrypt), the site can serve pages using the encrypted https protocol as well as the standard non-encrypted http. Pages encrypted with https will show a padlock icon or something similar in the address bar of most web browsers, and are normally used for handling sensitive content like payment info for a web store. My plan was to continue serving the existing blog pages using http, and use https for the new shopping cart pages. But it’s technically possible to serve any page from the site using https – try it! Just manually edit the URL of this post in your address bar, and change http to https.

The main blog pages aren’t designed to be served with https, however, and they contain embedded non-secure content like images and comment forms that use the http protocol. If you view this post as https, it will work, but your browser will probably display a warning about insecure content. If you try to post a comment, you’ll see a warning about a non-secure form, and if you persist in posting the comment you’ll see an error 403: forbidden message.

Since nobody ever visits the BMOW site using https, I thought those security warning didn’t matter, until I discovered that Google has started replacing all of the BMOW links in its search results with https versions of those same links. Someone who searches Google for “KiCad vs Eagle” might see a result with an https link to my post on that topic. Following the link, they’ll get a bunch of security warnings from their browser. And after commenting on the post, they’ll get the dreaded error 403. Oops.

I learned that Google prefers to index pages as https rather than http, if it discovers that a web server supports both. After doing more research I considered a few paths out of this mess:

  • Go full blown https everywhere on the site. Fix images, comment forms, and other content that use http.
  • Redirect https requests to http versions of the same URL.
  • Use canonical URLs to instruct search engines to index the http versions of pages, not https.

Switching everything to https would be lots of work, and wasn’t the end result I wanted anyway. Redirecting all https requests to http would probably be OK, but seems a little bit drastic, and I’d need to carve out exceptions for the shopping cart and admin pages.

Canonical URLs

Canonical URLs are a nice feature,  and I decided to use them to solve this problem. In the header section of any HTML document, you can include a link like this one:

<link rel="canonical" href="http://www.example.com/mypage/" />

and search engines will index the page as http://www.example.com/mypage/, regardless of whether they reached the page as

http://www.example.com/mypage/
https://www.example.com/mypage/
https://www.example.com/mypage/?q=vegemite

WordPress automatically adds canonical URLs to some pages, but not all, so I installed the Yoast SEO plugin to gain more control over canonical URL generation. Yoast added the canonical URLs as expected, but not in the way I needed. If I visited a page on the site using http, then Yoast would generate a canonical URL link beginning with http://. But if I visited a page on the site using https, Yoast would generate a canonical URL link beginning with https://, which was exactly what I didn’t want. I was finally able to force canonical URLs to always start with http:// by inserting this code snippet into my WordPress install’s functions.php:

function design_canonical() {
  global $post;
  if(isset($_SERVER['HTTPS']) && $_SERVER['HTTPS'] == "on") {
    $find = 'https://www.exampledomain.com';
    $replace = '';
    $theurl = str_replace($find,$replace,get_permalink($post->ID));
    return site_url( $theurl , 'http' );
  } else {
    // Leave blank and Yoast SEO will use default canonical for posts/pages
  }
}

add_filter( 'wpseo_canonical', 'design_canonical' );

Fixed?

This should be all that’s needed to make Google, Bing, and other search engines use http for indexing all my content. It may take a few days for the Google index to be updated with the new links, but eventually everything will be http. And that should be enough to prevent visitors from accidentally viewing the site content as https, right? Well, maybe not. I had forgotten about the existence of browser plugins like HTTPS Everywhere that attempt to force use https wherever they can. Even if Google’s no longer sending traffic to https versions of my pages, then, other sources of https traffic still exist. And those visitors will have all the security warning and error problems I described.

I’m scratching my head, wondering how to proceed. Redirect all https traffic to http, as I’d originally considered? Or leave everything as is, and let HTTPS Everywhere visitors deal with the problems that extension creates? Maybe there’s another simpler solution. It all makes me appreciate how complex the job of a web site admin can really be.

Read 11 comments and join the conversation 

11 Comments so far

  1. Tim Buchheim February 12th, 2016 4:12 pm

    A few months ago I switched the site I run over to HTTPS-only. Trying to access it via unencrypted HTTP just gets you a redirect to the HTTPS site. Of course, 99% of the site had been working just fine with HTTPS before that, so I didn’t need to do much to switch.

    My suggestion: change all links, image URLs, script URLs, etc. to use “/abolute/pathname” format rather than “http://www.bigmessowires.com/absolute/pathname” format. That way they work with either HTTP or HTTPS.

    In some cases relative pathnames make sense. (e.g., if you keep all the images used by a CSS stylesheet in the same directory as the CSS file, then just use relative pathnames.) But there’s usually no good reason to use a full URL including protocol & hostname for anything that’s referring to the same site. Now if you have to use any links to off-site image/style/script resources then you’ll run into problems, as then you do need to specify a protocol & host.

  2. Steve February 12th, 2016 4:56 pm

    I have 9 years and about 500 posts with many hard-coded http:// references, too many to fix by hand. There might be a plugin that could do it for me automatically, but I’m a little gun-shy of doing a mass edit. But my bigger concern is why I should use https in the first place for content that doesn’t need to be encrypted – it just seems wrong. 🙂 What convinced you to make the switch?

  3. Owen Shepherd February 14th, 2016 10:52 am

    One good reason is that there are now many ISPs (even in the US!) which will do forceful ad injection into all pages served over HTTP

  4. Steve February 15th, 2016 4:25 pm

    The canonical URL solution I described above is working, but much too slowly. Google managed to index about 600 https links to this site within a matter of days, but had only removed about 40 of them after three days from adding canonical URLs. So I’ve admitted defeat, and implemented a 301 redirect from https to http for all non-secure blog pages and files. In my root level .htaccess file, I added this:

    <IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteBase /
    RewriteCond %{HTTPS} on
    RewriteRule .* http://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
    </IfModule>

    Clicking on a BMOW https link in a Google search result will now redirect to the http version of that same page, with no more warnings about insecure content. Works for WordPress permalinks as well as real files, at the root directory and subdirectories. But viewing https pages in my shopping cart section (yet to be made public) also works, and remains https. To be candid, I don’t understand why that works, or the precedence of multiple .htaccess files if there’s one at the root level and others in subdirectories.

  5. GotNoTime February 20th, 2016 5:34 am

    Why not just disable HTTPS and remove the certificate? You’re forcing all the connections to HTTP anyway so you’re not gaining anything but adding extra complication and CPU load.

  6. Steve February 20th, 2016 7:23 am

    I need the certificate for the new shop pages at http://www.bigmessowires.com/shop. The rest of the site doesn’t need HTTPS and would take some work to make it HTTPS-friendly. I’ll likely do that eventually, but not today…

  7. GotNoTime February 20th, 2016 10:48 am

    You could split it into two sites. Everything is nicely separated then and you don’t have troubles with SSL.

    http://www.bigmessowires.com = blog + no SSL.
    shop.bigmessowires.com = shop + SSL.

  8. Chris Combs February 27th, 2016 7:30 am

    You could use protocol-agnostic src/hrefs–just //server.com/stuff instead of http://server.com/stuff. The browser will use the current page’s protocol to retrieve each asset specified this way.

  9. Steve February 27th, 2016 7:53 am

    Yup, that would have been the best way to go if I’d done it from the start. I’m afraid of the clean-up job fixing 9 years of old posts and content with embedded http protocol refs in the URL, but eventually I’ll probably do that.

  10. Jeff Mcneill December 1st, 2016 9:29 pm

    HTTPS-friendly is really just a matter of rewriting all hardcoded URL references (search-and-replace in the database and file system), setting rewrite rules so that inbound http:// requests are changed into https:// requests, and of course a well-configured SSL setup and valid certs.

    Yes, there are many details in here, but the bottom line is that the web is moving to HTTPS with a combination of carrots and sticks by search engines, preferences of some users for greater privacy (that is, greater than none), and greater security.

    It is the future, and the sooner it is implemented (with concommitant growing pains and organic site traffic index learning curve), the better.

  11. Peter October 26th, 2017 11:12 pm

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.