Clean Up Google Docs' HTML Programmatically
We are often tasked with entering content into the CMS' we build for our clients. Cleaning up the HTML of those documents is always a chore, no matter who puts them together. I figured out that I spend most of my time stripping out what Microsoft Word, OpenOffice, and Google Docs put in there. My desire is to have simple HTML, that will pick up the styling of my Drupal or Wordpress theme.
My searching for a better way has led me to a great script called GoogleDoc2Html created by Omar AL Zabir. Using Google Docs and the Google Docs Script Editor, this script emails a pretty simple HTML file of the document to your Google email account associated with your Google account.
You can upload your document to Google Drive, and open/convert it with Google Docs, or create a new one. Even with documents created with Google docs, and exported as HTML, I still have to go it and remove things like class="c0 c1" on paragraph and list items, along with a bunch of <span> tags added throughout.
Here's a screenshot of the export of a document created in Google Docs, exported as HTML, and opened in HTMLTidy run in Sublime Text:
Here is the same file exported using the GoogleDoc2Html script:
It is not perfect, but there are no classes, no extra spans or divs. It is a much cleaner starting point.
To use this wonderful script, open, import or create a document in Google Docs, and follow these instructions:
- Open your document in Google Docs
- Go to the Tools menu, select "Script editor..."
- Copy and paste the GoogleDocs2Html code into the script editor.
- Go to the File menu and Save the script as "GoogleDoc2Html".
- From the Run menu, choose "ConvertGoogleDocToCleanHtml"
- A popup window will appear titled, Authorization required.
- Click continue to grant the following permissions:
- Know who you are on Google
- View your email address
- View and manage your documents in Google Drive
- Send email as you
- You will get an email at your Google Account containing the HTML output of the Google Doc.
That's it. I forked the author's script and slightly changed the HTML output. Do you have another way of cleaning up HTML before you put it into your website? Let me know in the comments.