BMOW title
Floppy Emu banner

Archive for the 'Bit Bucket' Category

WordPress Latin1 and UTF-8, Part 2

Yesterday I wrote about some BMOW blog troubles displaying special characters and international characters, which was apparently triggered by a recent update to MySQL 8 at my web host. Old pages containing special characters like curly quotes, accented letters, or non-Latin characters were suddenly rendering as garbled combinations of random-looking symbols, whereas they were previously OK. If you read the follow-up comments, you saw that I was eventually able to resolve the problem (mostly) by adding these lines to my wp-config.php file:

define(‘DB_CHARSET’, ‘latin1’);
define(‘DB_COLLATE’, ”);

But I didn’t fully understand what those lines changed, or exactly why this problem appeared in the first place. After some digging in the MySQL database, I think I have a slightly better understanding now.

 
Back to Kristian Möller

I returned to the example of Kristian Möller, whose name contains the letter o with umlaut. After the MySQL update, the name was appearing incorrectly as Möller. This is what you’d expect to see if the UTF-8 bytes 0xC3 0xB6 for ö were incorrectly interpreted as two separate Latin1 bytes, 0xC3 for à and 0xB6 for ¶.

Using phpmyadmin, I was able to connect to the live WordPress DB, and examine the wp_comments table where this name is stored. The result for comment_id 233746 is shown above, displaying the author’s name as both text and as raw hex bytes. You can see the hex bytes contain the sequence C3B6, which is the correct UTF-8 byte sequence for ö. That’s great news. It means the contents of my database text are correct and uncorrupted UTF-8 bytes. But all is not well – the metadata associated with the table is wrong. It thinks the text is Latin1, and displays it as such in the myphpadmin UI. I was able to confirm this by executing the SQL command:

show create table wp_comments

This echoes back the SQL command that was originally used to create this table, way back in 2007. And lo and behold, part of that original SQL command specified CHARSET=latin1. Ever since then, WordPress has been storing and retrieving UTF-8 text into a Latin1 table in the database. This is bad practice, but it worked fine for 14 years until the MySQL update earlier this month.

 
Why Does DB_CHARSET latin1 Help?

Defining WordPress’ DB_CHARSET variable to be latin1 sounds like it’s telling WordPress what type of character set is used by the database. But if you think it through, that doesn’t fit the evidence here. If I tell WordPress that my DB data is in Latin1 format, even though as we’ve seen it’s really UTF-8, then I would expect WordPress to convert the data bytes from Latin1 to UTF-8 as it loads them during a page render. That would do exactly the wrong thing; it would cause the very problem that I’m trying to prevent.

I searched for a detailed explanation of precisely what the DB_CHARSET setting does, but couldn’t find one that made sense to me. Most references just say to change the value, without fully explaining what it does.

While I don’t have any strong evidence to support this, my guess is that a MySQL client has a choice of connecting to the MySQL database in Latin1 mode or UTF-8 mode, and this is what DB_CHARSET controls for WordPress. If the client connects as a UTF-8 client but the table is marked as being Latin1, my guess is MySQL automatically translates the data. Normally that would be a good thing, but if UTF-8 data were stored in a table improperly marked as being Latin1, it would cause unwanted and unnecessary character conversions, causing the types of problems I saw on the blog.

 
Why Did This Break Now?

So what changed during the recent MySQL update to suddenly break this? Why did the problem appear now? Initially I suspected the underlying data bytes had become corrupted during the update, but the hex display from phpmyadmin showed the data bytes are OK.

I can’t say for certain whether the problem was caused by exporting and reimporting my database, or whether it’s due to new behavior in MySQL 8. Now that I think about it, I’m not even certain whether the result I saw from that show create table wp_comments was actually the original SQL command from 2007, or the SQL command from eight days ago when the database was migrated to MySQL 8.

If these database tables were always explicitly marked Latin1, going all the way back to 2007, then I think this character set conversion problem would always have happened too. Or at least it would have happened as soon as I updated to my current version of WordPress, instead of when I updated to a new version of MySQL.

One possibility is that with the old database and old version of MySQL, the character set for the database tables wasn’t explicitly defined. It relied on some database-wide default which just happened to be UTF-8, so everything worked when WordPress connected to the DB as a UTF-8 client. Then during the MySQL 8 update, somehow the tables were explicitly set to Latin1 and the problem appeared.

Another possibility is that the tables were already explicitly Latin1, but WordPress was previously connecting to the database as a Latin1 client, so it worked OK. Since my version of WordPress hasn’t changed recently, this would mean the default database connection type for WordPress must somehow come from the database itself, or the database server, and that’s what changed during the MySQL 8 update.

Whatever the explanation, changing DB_CHARSET now seems like only a temporary solution. I still have UTF-8 data stored in tables that say they’re Latin1, which seems likely to cause more problems down the road. If nothing else, it makes the output display incorrectly in the phpmyadmin UI. A full solution will probably require some more significant database maintenance, but I hope to postpone that for a while.

Read 3 comments and join the conversation 

Non-Breaking Spaces and UTF-8 Madness

A few months back I wrote about some web site troubles with the   HTML entity, non-breaking spaces, and UTF-8 encoding. A similar problem has reared its head again, but in a surprising new form that seems to have infected previously-good web pages on this site. If you view the Floppy Emu landing page right now, and scroll down to the product listing details, you’ll see a mass of stray  characters everywhere, screwing up the item listings and making the whole page look terrible. Further down the page in the Documentation section, the Japanese hiragana label that’s supposed to accompany the Japanase-language manual appears rendered as a series of accented Latin letters and square boxes.

The direct UTF-8 encoding for a non-breaking space (without using the   entity) is C2 80. Â is Unicode character U+00C2, Latin capital letter A with circumflex. So somehow the C2 value is being interpreted as a lone A with circumflex character rather than part of a two-byte sequence for non-breaking space.

The strangest part of all this is that the Floppy Emu landing page hasn’t been modified since September 6, and I’m sure it hasn’t looked this way for the past month. In fact the most recent archive.org snapshot from September 8 shows the page looking fine. So what’s going on?

I’ve noticed a related problem with some other pages on the site. The Mac ROM-inator II landing page hasn’t been modified in six months, but it too shows strange encoding problems. The sixth bullet point in the feature list is supposed to say Happy Mac icon is replaced by a color smiling “pirate” Mac, with the word pirate in curly double-quotes. In the most recent archive.org snapshot it looks correct. But if you view the page today, the curly quotes are rendered as a series of accented letters and Euro currency symbols. There are similar problems elsewhere on that page, for example in the second user comment, Kristian Möller’s name is rendered incorrectly.

These problems are visible in all the different browsers I tried. And none of them seem to correspond to any specific change I made in the content of those pages recently.

I’m not a web developer or HTML expert, but I’m wondering if the header of all my HTML pages somehow got broken, and it’s not correctly telling the web browsers to interpret the data as UTF-8. But when I view the page source using the Chrome developer tools, the fourth line looks right:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

And what’s more, Unicode characters in this very blog post appear to render just fine. For example I can copy-paste those hiragana characters from the archive.org snapshot of the Floppy Emu page, and paste them here without any special encoding tricks, and they render correctly:

??????????????

Nevermind, I’m wrong. I can paste those characters into the WordPress editor, and they appear correctly, but as soon as I save a draft or preview the page they turn into a string of question marks. I’m sure that direct copy-and-paste of this text used to work, it’s how I added this text to the Floppy Emu page in the first place.

So if the character set is configured correctly in my pages’ HTML headers (which I could be wrong about), what else could explain this? The alternative is that the text is stored in a broken form in the WordPress database, or is being broken on-the-fly as the page’s HTML is generated. Perhaps some recent WordPress action or update caused the text of previously-written pages to be parsed by a buggy filter with no UTF-8 awareness and then rewritten to the WordPress DB? But I haven’t changed any WordPress settings recently, performed any upgrades, or installed any new plugins. I can’t explain it.

I dimly recall seeing a notice from my web host (Dreamhost) a few weeks ago, saying they were updating something… I think MySQL was being upgraded to a newer version. Maybe that’s a factor here?

Whatever the cause, the important question is how to fix this mess. I could go manually edit all the affected pages, but that would be tedious, and I wouldn’t really be confident the problem would stay fixed. Even if I were willing to make manual fixes, it still wouldn’t fix everything, since some of the broken text is in users’ names and comment text rather than in my own page text. For the moment, I’m stumped until I can figure this out.

 
An Example: Möller

On the ROM-inator page there’s a comment from Kristian Möller. I’m using the explicit HTML entity here for o with umlaut to ensure it renders correctly. Until recently, it also looked correct on the ROM-inator page, but now it appears as Möller, again using explicit HTML entities for clarity. So what happened?

The two-byte UTF-8 encoding for ö is 0xC3 0xB6. The one-byte Latin-1 encoding for à is 0xC3 and for ¶ is 0xB6. So a UTF-8 character is being misinterpreted as two Latin-1 characters, but where exactly is it going wrong? Is the browser misinterpreting the data due to a faulty HTML character encoding meta tag? Is the data stored incorrectly in the WordPress DB? Or is WordPress converting the data on the fly when it retrieves it from the DB to generate the page’s HTML? I did a small test which I think eliminates one of these possibilities.

First I wrote a simple Python 2 program to download and save the web page’s HTML as binary data:

import urllib2
contents = urllib2.urlopen('http://www.bigmessowires.com/mac-rom-inator-ii/').read()
open('mac-rom-inator-ii.html', 'wb').write(contents)

I think this code will download the bytes from the web server exactly as-is, without doing any kind of character set translations. If urllib2.urlopen().read() does any character translation internally, then I’m in trouble. Next I examined the downloaded file in a hex editor:

This shows that what the web browser renders as ö is a four-byte sequence 0xC3 0x83 0xC2 0xB6. Those are the two-byte UTF-8 sequences for à and ¶. So this doesn’t look like the web browser’s fault, or a problem with the HTML character encoding meta tag. It looks like the data transmitted by the web server was wrong to begin with, it really is sending à and ¶ instead of ö.

This leaves two possibilities:

  1. During the MySQL upgrade, 0xC3 0xB6 in the database was misinterpreted as two Latin-1 characters instead of a single UTF-8 character, and was stored UTF-8-ified in the new MySQL DB as 0xC3 0x83 0xC2 0xB6.
  2. 0xC3 0xB6 was copied to the new MySQL DB correctly, but a DB character encoding preference went wrong somewhere, and now WordPress thinks this is Latin-1 data. So when it retrieves 0xC3 0xB6 from the DB, it converts it to 0xC3 0x83 0xC2 0xB6 in the process of generating the page’s final HTML.

I could probably tell which possibility is correct if I used mysqladmin to dump the relevant DB table as a binary blob and examine it. But I’m not exactly sure how to do that, and I’m a bit fearful I’d accidentally break the DB somehow.

If possibility #1 is correct, then the DB data is corrupted. There’s not much I could do except restore from a backup, or maybe look for a tool that can reverse the accidental character set conversion, assuming that’s possible.

If possibility #2 is correct, then I might be able to fix this simply by changing the faulty encoding preference. I need to find how to tell WordPress this is UTF-8 data, not Latin-1, so just use it as-is and don’t try to convert it.

Read 7 comments and join the conversation 

Transistor Puzzles

I’ve fallen into a deep hole involving current-limiting circuits and current sources, in an attempt to solve a minor problem with the Yellowstone disk controller. In the big scheme of things, this isn’t a very important problem, but it has me intrigued. I’ve created a simulation of a circuit that may solve the problem, but it exhibits some transistor behavior that I don’t understand. Specifically, the base currents of the transistors don’t seem to follow the rules that I thought I understood.

You can see a working simulation of the circuit here.

The are two SPDT switches at the top of the diagram, representing Yellowstone’s two disk connectors. On pin 9 of this connector, some types of drives will have a 10K resistor connected to the +12V supply. This type of drive needs a -12V supply input on pin 9. It will use about 3 mA at most from -12V. Other types of drives will have a direct connection to +5V on pin 9 (+5V is also provided on pin 11 to all drive types). Yellowstone can either provide an additional +5V supply on pin 9 for this type of drive, or more simply, just leave the pin unconnected.

The basic idea here is to establish a current limiter of about 10-20 mA for pin 9. That’s more than enough for normal operation with the -12V supply, and it prevents dangerously high amounts of current flowing if -12V is directly connected to +5V as in the second type of drive.

Here’s the puzzle: the three NPN transistors have all their bases connected, and all their emitters connected. As shown, they all have the same base-emitter voltage Vbe of 0.668 volts. My mental model is that the base-emitter connection is essentially like a diode. With identical diodes, identical Vbe, and a single shared 470 ohm current-limiting resistor, I would expect the base current Ib to be identical for all three transistors. Yet they’re not. The middle transistor’s Ib is 7.2 mA while the others are only 0.165 mA.

The fact that one transistor is in saturation must be relevant. But the “base-emitter connection is a diode” assumption is failing here, and I can’t explain why. I need to read more transistor theory.

Read 13 comments and join the conversation 

Yellowstone Glitch, Part 7: Maybe a Conclusion

All these Yellowstone glitching mysteries may finally be headed for a conclusion. It looks like there are at least two separate problems with different causes: one causing glitching during a bus cycle and the other causing glitching at the end of a bus cycle.

 
Fighting at End of Bus Cycles?

This one is complicated to explain, so bear with me. You should understand that the Apple IIgs is a 5V system, but Yellowstone uses a 3.3V 74LVC245 to communicate on the data bus. This works because the LVC family is 5V tolerant on its inputs, and its 3.3V output is high enough to be sensed as a valid “high” by 5V logic chips. On the IIgs motherboard there’s a 74HCT245 that handles the computer’s side of transfers to and from the data bus.

Yesterday I noticed something curious: when Yellowstone is driving a 3.3V high value onto the data bus, at the end of the bus cycle the voltage always immediately jumps up to 5V, and stays there for a few hundred nanoseconds until something else puts a value on the pus. What’s doing that? Is there a 5V pull-up resistor on the bus, or something similar? No. When Yellowstone is driving a 0V low value onto the data bus, the voltage remains at 0V after the end of the bus cycle.

I’m not 100 percent sure, but I think at the end of a bus cycle the IIgs is immediately reversing the direction of its 74HCT245. Previously this chip was taking the peripheral card’s data from the peripheral slot data bus and moving it to an internal data bus, but now it begins taking data from the internal data bus and moving it to the peripheral slot data bus. And what data is that? In the first tens of nanoseconds after the direction is reversed, it’s actually the same data that the peripheral card was outputting, now briefly stored in the bus capacitance of the internal data bus.

What happens if the peripheral card’s output driver is a bit slow to turn off at the end of a bus cycle, due to propagation delays on the control signals? Since the peripheral card and the 74HCT245 from the IIgs are both driving the same data onto the bus, normally it should be OK. But for Yellowstone and its 3.3V 74LVC245, it’s not OK. For a time of roughly 15 ns, it causes 5V and 3.3V sources to both try to drive the bus at the same time, resulting in high current flows into the 3.3V supply, and overall badness. This is what I strongly suspect is causing Yellowstone’s end-of-cycle spikes and glitching.

There are several possible solutions:

  1. adjust Yellowstone’s ‘245 turn-off so it happens earlier, before the bus cycle is theoretically over
  2. modify Yellowstone to use a true 5V output driver, so there’s no 3.3V-to-5V conflict
  3. insert series resistors on the data bus to limit the current from 3.3V-to-5V conflict to safe levels

I implemented option 1, and it substantially reduced the spikes at the end of bus cycles. Surprisingly, it didn’t eliminate them completely. It feels strange to disable the ‘245 before the bus cycle is over, because the CPU doesn’t capture the bus value until the very end of the cycle. It seems like it should cause bad data to be read, causing malfunctions. But in practice it appears to work OK, probably thanks to that bus capacitance persisting the data value even after the ‘245 shuts off.

I also implemented option 2, through a bit of board surgery in which I replaced Yellowstone’s 74LVC245 with a dual-supply 74LVC8T245. This almost completely eliminated the spikes at the end of bus cycles, because the voltage on the bus stays constant at 5V after the cycle ends.

I would like to try option 3, but that will take more effort to set up.

 
High Current During Bus Cycles?

The second problem is the one I was chasing initially: spikes and glitches during a bus cycle, at the moment when the 74LVC245 is enabled and begins driving the data bus. I had a theory this was caused by a brief violation of the max output voltage spec of the 74LVC245, when it tries to output 3.3V but finds the bus capacitance is already charged to 5V. So I desoldered the 74LVC245 and replaced it with a 74LVT245, a nearly identical chip but with a higher max output voltage spec above 5V. Unfortunately this did nothing to help the spikes and glitches during bus cycles. Then I replaced the 74LVT245 with a 74LVC8T245, a dual-voltage chip with true 5V I/O on the Apple II side. Again this did nothing to solve the problems during bus cycles.

Based on these two tests, I concluded that violating the max output voltage spec of the 74LVC245 was never a problem, or at least it was never the main problem. The signal spikes are very likely caused by a large amount of current briefly flowing when all the data bus outputs change simultaneously from 1 to 0 or vice-versa. This is a “normal” condition, not a violation, but it’s troublesome. I’ve attempted several board modifications to help meet this sudden current demand, including adding a 10 uF bypass capacitor to the ‘245, and adding extra power and ground wires from the ‘245 straight back to the voltage regulator. None of it seemed to help.

I can’t quite explain this, since I didn’t think there should really be all that much current flowing. I guess I was wrong. But the only solution seems to be finding ways to reduce the current, or spread it out over a longer period of time. That’s what my 10101010 pre-driving trick accomplishes, but there’s plenty more room for reducing the current further.

Some possible options here:

  1. replace the 74LVC245 bidirectional buffer with two unidirectional buffers: an LVC buffer for input and an LS buffer for output
  2. insert series resistors on the data bus to limit the current
  3. something else I’m overlooking

The 74LS245 is an appealing option because the LS family just can’t drive very hard, at least not when outputting a high voltage. But it won’t work as a bidirectional substitute for the 74LVC245, because its 5V outputs (or close enough to 5V) would damage the FPGA. So I’d need to use the LS chip for output only, and use an LVC chip for input. That’s not very appealing. I’m also not sure how well it would reduce the current when driving low voltages instead of high ones. It might still draw too much current, or it might be fine. Isn’t this basically how all 1980s vintage peripheral cards worked? How did they avoid this problem?

Options 1 and 2 should both resolve the problems at the end of the bus cycle too, so that’s good. The other alternatives have more limited application. Adjusting the ‘245 turn-off timing does nothing to help the problems during the bus cycle, nor does using a 74LVC8T245 chip.

 
Unsolved Mysteries

Sadly none of the above can explain why these same problems didn’t appear in revision 1 of Yellowstone. Probably they did, but they weren’t severe enough to cause bit flips and malfunctions, so I never noticed. The only partial explanation I can think of is that revision 2’s RAM chip is to blame. Revision 1 used internal FPGA RAM and didn’t have a separate RAM chip. My guess is that the extra current used by the RAM is exacerbating the problem somehow.

 
Next Steps?

If you’re still reading this wall of text, it’s time to evaluate the possible solutions and make a choice. Let’s start with the simplest option: do nothing (at least from a hardware standpoint). By implementing the 10101010 pre-driving trick, adjusting the ‘245 turn-off timing, and a few other small timing changes, I’ve already improved things enough to get my prototype board working.

Here’s what things look like with only the FPGA logic changes. Same as in part 6, the traces shown are:

  • Blue (Ch4) is address line A1. It’s a 5V input from the Apple II
  • Yellow (Ch1) is the 3.3V supply voltage, measured at the VCC pin of the ‘245.
  • Cyan (Ch2) is IOSTROBE, and marks the boundaries of the bus cycle.

I didn’t capture GND this time.

Yeah it still doesn’t look great, but it works. If you’ve forgotten how bad everything looked before I made the FPGA logic changes, here’s the headline image again from part 6:

So maybe this is good enough now, without needing further hardware changes? Especially if some of the ringing shown in the scope traces is due to my poor probe setup rather than being a true representation of the signal?

If a hardware change is needed, series resistors are attractive because they’re simple and easy. But what value of resistor? It must be big enough to significantly limit the current in a bus-fighting scenario, but not so big that it results in failure to meet the Vil and Vih specs of the other chips on the data bus that are receiving data.

Let’s say I used 100 ohm series resistors. In a 3.3V-to-5V bus fighting scenario at the end of a bus cycle, that would limit the current to 1.7 / 100 = 17 mA per data bus line, or 136 mA total. That’s still pretty high. Too high, I think.

If I used 500 ohm resistors, it would limit the total current to a much more modest 27.2 mA total, but it would create a new problem. For the 74LS family inputs on the data bus of the Apple IIe, and for as many as six other peripheral cards installed that use 74LS parts, the inputs source 0.2 mA when they’re “receiving” a logical low value. All combined that’s 1.4 mA worst case. With 500 ohm resistors and a current of 1.4 mA, even if Yellowstone could output a true 0.0V logical low value, the LS inputs would see a voltage of 1.4 * 500 = 0.7V, which is almost above the Vil threshold for 74LS family parts. In short, with a fully loaded set of 7 peripheral cards that all use 74LS logic, and 500 ohm resistors, Yellowstone might not work.

There’s some middle ground here. Resistor values from roughly 150 to 500 could probably work to solve both problems, but it’s a narrow enough range that it makes me slightly nervous. Maybe go with 220 or 330 ohm.

If a hardware change is needed but series resistors won’t do, then I think the only viable alternative is a two-chip solution with 74LS output and 74LVC input. But I don’t love this solution, for the reasons mentioned previously: extra chips, extra design complexity, and a concern it might not work anyway for driving a logical low voltage. There would be a small amount of additional control complexity too. Each chip would need a separate enable signal from the FPGA, where spare pins are scarce, and extra care would be needed to prevent enabling both at once.

Some other combination of solutions might be possible too, like 74LVC8T245 with series resistors. But I don’t want to go overboard.

As you might expect, I’ve grown very tired of this glitching problem, and my enthusiasm for further tests and experiments is low. My gut says to accept the software-only solution and call it good, or else maybe to add series resistors. But I don’t want to sweep this problem under the rug, only to have it reappear later in rare circumstances I can’t troubleshoot. If you were me, what would you do?

Read 15 comments and join the conversation 

NASA Struggles to Fix Hubble Space Telescope’s 1980s Computer

Did anybody else see this week’s news about problems with Hubble’s 80s-vintage onboard computer and think: *gasp* My moment in life has arrived! Hold my beer.

“Hello, NASA? Did you guys try PR#6? Sometimes the boot disk gets dust on it; ask an astronaut to blow it off and then reboot. Hello? Hello? Why did they hang up?”

The Hubble’s computer is actually an NSSC-1, an 18-bit machine with 64KB memory. The version on Hubble is likely built from a few dozen discrete MSI chips. There are two fully redundant computers on board, and four independent 64KB memory modules, any of which can be enabled or disabled from the ground in the event of a problem. But 30 years of bombardment by cosmic rays will take a toll on the hardware.

Be the first to comment! 

When A Space Is Not A Space

I just spent more than two hours troubleshooting a seemingly simple HTML problem. When I copied and pasted a small section of HTML, the web browser displayed the newly-pasted section differently from the original. The horizontal spacing between some of the elements was slightly different, causing the whole page to look wrong. But how could this be? The two HTML sections were identical – the new one was literally a copy of the old one.

This simple-sounding problem defied all my attempts to explain it. I came up with lots of great theories: problems with my CSS classes, or with margins and padding. Mismatched HTML tags. Browser bugs. I tried three different browsers and got the same results in all of them.

Feeling very confused, I looked again at the two sections of HTML in the WordPress editor (text view), and confirmed they were exactly identical. Then I tried Firefox’s built-in web developer tools to view the page’s rendered elements, and compared all their CSS properties. Identical – yet somehow they rendered differently. I used the developer tools to examine the exact HTML received from my web server, checked the two sections again, and verified they were character-for-character identical. Firefox’s “page source” tool also confirmed the two sections were exactly identical.

By this point I was getting ready to blame cosmic rays or voodoo magic. I discovered that any time I copy-pasted any similar HTML section, the newly-pasted section would appear in the browser with the wrong element spacing. How could this possibly be? I then tried the W3C Validator, which found some other problems with my page, but nothing that could explain this behavior. And once again, it confirmed that despite rendering differently in the browser, the two sections of HTML were identical.

Clearly something wasn’t adding up. I used curl to download the web page from my web server, viewed the local copy, and saw the same behavior as before. But when I opened the stored .html document with a hex editor, I finally had my answer. The two sections of HTML were not identical: one section used a different type of space character from the other.

What the hell.

I discovered that the original HTML section contained non-breaking spaces. But instead of encoding them with the &nbsp; entity, they were encoded directly as Unicode character C2A0. I’m not sure when or how this happened, but I blame WordPress. When viewing this section in the WordPress HTML editor, the C2A0 spaces appeared like normal spaces, and copy-pasting the section inside the editor silently converted non-breaking spaces to normal spaces with hex value 20. So the copied version rendered differently, even though the source HTML appeared to be the same.

This is like the 21st century version of confusing a zero with a capital letter O, yet worse. I wasn’t even aware that non-breaking spaces have a Unicode character value – I thought &nbsp; was the only way to encode them. I changed the HTML back to use &nbsp; and now it all works fine.

I’m surprised at how many different tools failed to reveal this subtle but important difference between types of spaces in the HTML source. The WordPress HTML editor failed to show or correctly handle the difference. The Firefox web developer tools and page source tools failed. The W3C Validator’s source view failed. Curl plus a hex editor was the only way to finally establish the ground truth about the precise contents of the HTML source.

Read 9 comments and join the conversation 

Older Posts »