Top questions and answers on content audits and content strategy

Getting Started

What does Content Auditor do?
Content Auditor is a content analysis tool that makes it easier to crawl sites, collect content inventories, and analyse your content.

All the data is collected in a collaborative space - making it easy for distributed teams to assess and manage content together.

Content Auditor gathers data on the quality of your content for every page, including:
- URL structure
- Title tags
- Reading ease
- Reading time
- Word count
- Content age
- Missing metadata
- Media files
It also gives you tools to assess and manage the quality of your content:
- The Site Report gives you a quick analysis of the overall health of your site’s content.
- The Content Inventory allows you to filter your content by different metrics to easily find and address problem areas. For instance, you might want to filter to find pages that are missing metadata, or which content looks outdated.
How do I start a crawl?

You can add a new crawl using the green “Add a site” button on the "Your sites" page. You can return to the "Your sites" page at any time by clicking the "home" icon at the top left of the screen. After clicking "Add a site" you will see a form appear that allows you to enter the URL and a name for the site you're about to crawl. Note that if you enter a URL that includes a subdirectory ( i.e. http://www.example.com/subdirectory ) you will be given the option to either crawl the entire site from the root, or only crawl within the subdirectory.
How long does a crawl take?
When you add a site on Content Auditor two things happen. First we crawl the site, then we analyse the data we collected. The first process, the crawl, goes pretty fast. It takes about 10 seconds per page.

The second process, analysis, takes longer. There's a process that happens at the end of each crawl that takes a few minutes. Because of this, smaller sites actually take longer per-page than larger sites do.

Average crawl times as of Sept 19, 2017:
- A 1 page site takes about 4 minutes
- A 30 page site takes about 15 minutes, or 30 seconds per page
- A 1000 page site takes about 4 hours, or 15 seconds per page
How to crawl dynamically generated content

Background

In general, web content is either "static" or "dynamic". An example of static content might be a "contact us" page that shows contact information. The details might change over time, but usually you'll see the same content there. In contrast, dynamic content changes often, in some cases constantly. An example of dynamic content might be a news site which always shows the most recent articles. Search results are another example of dynamic content. The results change depending on user input and are usually different on each viewing.

Crawling dynamically generated content can get complicated. Consider a list of search results that includes links to related search terms. When a crawler follows a link to related search results it will see a new list of results with new related terms. In this case, the act of crawling the site is in effect generating new pages to crawl. Each set of results might span many pages, each with their own list of related terms. This kind of situation can result in a crawl that includes thousands of pages of search results.

In some cases, dynamically generated websites query external sources for content. Consider for example a site that aggregates results from Twitter related to a given keyword. In cases like this, the number of "pages" on the site is almost limitless. As long as there is more content to aggregate there will be more results to show.

Problematic results

The problem described above can have some negative results for Content Auditor users. One such result is that you could wind up with a site that never stops crawling. To mitigate this problem, we've limited crawls to a maximum of 100k pages. This keeps crawls from continuing forever but it still leaves you with a problem. Let's say you expect your site to have about 20k pages, but due to a large number of dynamically generated pages, your crawl is maxing-out at 100k pages. In such a case, it's possible that all 100k pages could include only search results, and not include any of the pages you actually wanted to audit.

The solution

Content Auditor now has a feature that lets you "blacklist" certain URLs when setting up a new crawl. Content Auditor ignores any URLs in the blacklist when it crawls your site.

For example, if you wanted to crawl a site but ignore all search results, you could add "http://www.example.com/search" to the blacklist.

The powerful thing about the blacklist is that it doesn't need complete URLs, it can also process partial patterns.

So instead of "http://www.example.com/search" you could enter "/search" and get the same results. You could also ignore all URLs containing query strings by adding "?" to your blacklist.
Can I pause or cancel a crawl?

Our crawler doesn’t currently have a pause or cancel function. (If you’d find that a useful feature, let us know).
Can I restart a crawl?

After you’ve completed a crawl, you can restart or redo the crawl by going into the site’s Configure page. Once there, you can click the “Recrawl site” button to kick off a new crawl and analysis run. As always, Content Auditor will notify you when the crawl is complete.
My crawl has failed. What should I do next?

Try restarting your crawl. Errors can occur for lots of reasons, and may not happen on the second attempt.

Email us and let us know, too! We’d like to try figure out what’s gone wrong.
How do I view my results?
You can view the results of your crawl in two different ways:
- Site Report: This report gives you a summary of the health of your website content. It can be accessed by selecting the “Site Report” icon on the Crawled Sites landing page, or using the navigation at the top of screen from the site’s Inventory page.
- Inventory Page: The inventory view gives you a full list of all the pages that have been crawled, along with key metrics for each page, such as reading grade, word count, and missing metadata.
How do I download my crawl data?

There are two kinds of data you can export from Content Auditor: inventory data (also called metadata) and content data. The inventory data includes all the metadata about each page that Content Auditor finds when it crawls your site. This includes the URL, the page title, how many issues we found on the page (broken links, for example), and anything else we know about the page. The content data includes all the text found in the main content areas of your site. In order to avoid needless repetition, Content Auditor automatically filters out text that appears on every page, like main navigation or sidebar elements, and collects only the text that appears to be main content.

There are two ways to download your crawl data from Content Auditor:

1. Download from "your sites" list

Click the download icon for the site in question. A pop-up will appear giving you the option to download either the "metadata" or "content". Keep in mind that, depending on the size of your site, the content download can result in a very large file.

2. Download from a site's inventory

Navigate to your site's inventory page, then click the "download CSV" button. Note that this will download the filtered data in your inventory results, so if you apply a filter first you will only download the filtered results. This is useful if you'd like to get a list of all pages with broken links, for example.
Why does my site have zero pages?

When Content Auditor crawls your site, it checks to see if you have a "robots.txt" file. The robots.txt file informs web robots like Content Auditor's crawler about which areas of a website should not be processed or scanned. If you can view the site in a browser but Content Auditor reports the site as having zero pages, it's probably being blocked by your robots.txt. To check if this is the case, append "robots.txt" to your site's root URL, like this:

http://www.[your-site-here].com/robots.txt

If the file exists, you will see a number of lines of text describing your site's preferences regarding bot behaviour.

Solution:
The best way to address this is to remove or edit your robots.txt file so that it allows Content Auditor's crawler access. If you'd like to provide access specifically to Content Auditor's, you'll need to add a rule to your robots.txt file that creates an exception for our crawler. To do that, you'll need to know that our crawler calls itself "Scrapy/1.0.4 (+http://scrapy.org)", but a simple "Scrapy" should suffice.
Can Content Auditor scrape content from my site?
Yes, Content Auditor can scrape content from a website.

You can download your content in CSV format by clicking the "download" icon on your sites list, then selecting the "content" option. For large sites it will take a while for Content Auditor to compile the content for download, so please be patient.

Currently this "download content" feature has some shortcomings:
- whitespace is stripped
- text formatting is stripped
- unwanted HTML elements in content
- Excel limitations
Whitespace is stripped
While spaces within sentences remain intact, spaces between structural elements do not. This can causes issues where words come together without a space to separate them.

Text formatting is stripped
The CSV format we're using for data downloads doesn't support the use of rich text formatting. This means that we lose any bold, italic, or other styles that help separate content into chunks. We also lose other structural formatting such as headings, paragraphs, and lists.

Unwanted HTML elements in content
While stripping unwanted HTML elements from your scraped content is a good thing. However, our current implementation contains some errors. For example, Content Auditor currently leaves in  and some less common HTML elements such as <video>.

Excel limitations
This is only an issue if you have page content that's longer than 35,000 characters. If you use Microsoft Excel to view the CSV, you may run across this issue. Excel has a limit of about 35,000 characters per cell, which means that if your content is longer than that your data will "get weird". When a cell contains too many characters, Excel tries to place the extra characters into a new cell. New cells created this way aren't formatted properly. The result is a messy spreadsheet with random inconsistencies in formatting. The simplest solution to this issue is to use a different spreadsheet program. My personal preference is OpenOffice, which is entirely free.

What we're planning to do
Our vision is to provide content scraped from your site in a useful format. We've started experimenting with moving from the .CSV format to .ODS, which would allow us to use rich text formatting. We also plan to improve our treatment of whitespace and HTML elements.

Importing Google Analytics data

Before you start...
Importing Google Analytics data is a two-part process.
1. Connect your Content Auditor account to your Google Analytics account.
2. Import analytics data from your Google Analytics account into individual site reports on your Content Auditor account.
Before you start, check the following for the desired site:
- Make sure the site uses Google Analytics.
- Make sure you can access Google Analytics data on the site using the same email address that's on your Content Auditor account. If you don't have access, ask an administrator of that site to grant it.
Part 1: Connect Google Analytics
If this is the first time importing Google Analytics data from this site (or if you changed Google Analytics accounts), you’ll have to give your Google Analytics account access to the OnPoint Suite website:
1. Go to your Content Auditor account page (click Account in the top-right corner).
2. Click Connect Google Analytics at the bottom of the Profile tab.
3. When Google asks for access to onpointsuite.ca to view Google Analytics data, allow it.
You’re returned to the Content Auditor account page when done.
Part 2: Import Google Analytics Data
To import or update Google Analytics data for a connected site:
1. If necessary, open the Your sites list for your account. (Click the Content Auditor logo in the top left corner.)
2. Configure parameters for the import:
  1. In the list of audited sites, tap or hover over the down arrow for the desired site and select Configure.
  2. In the Google Analytics section, click Configure Google Analytics.
  3. Fill in the following. If you’re not sure what to enter, talk to the Google Analytics administrator for the target website.
    
    In Account, select the Google Analytics account that you’re using to access the site. This account must use the same email address to sign in with that you use to sign in with OnPoint Content Auditor.
    
    In Property, select or enter the descriptive Property Name for the Google Analytics property from which you want to import.
    
    In Google Analytics View ID#, enter the View ID for the Google Analytics account. To find the View ID, when signed in to Google Analytics, select the desired Account, Property, and View, then click View Settings. (Most of the time, you’ll use the All Website Data view.)
    
    Select the date range for the imported data. (You can change the data range and re-import analytics data at any time.)
  4. When you’re done, click Save Google Analytics configuration.
3. Click Import data.
If you’re not sure what Google Analytics parameters to enter, talk to the Google Analytics administrator for the target website. It may take a few minutes to import the data. The site’s status will change to “Queued for analysis,” then “Analyzing” as Content Auditor incorporates the imported analytics data with the existing audit. You’ll get an email when the analysis is complete.
View your Google Analytics data
Content Auditor incorporates Google Analytics data into both the Reports and Inventory pages. By default, Content Auditor shows analytics data for all pages. To view analytics measurements for a customized group, select the group from the list in the top left corner.
Reports page

The Reports page displays both average and outlier measurements for the entire site.

Inventory page

When analytics data is available, the Inventory page has new columns to the list of crawled URLs. OnPoint Content Auditor imports the following analytics data for each page:
- Unique page views
- Total page views
- Average time on page
- Bounce rate
- Exit rate
Tip: You can sort, search, and filter the inventory list based on this analytics data, just like any other field. If you filter down to a specific set of pages you're interested in, you can save it by creating a group. (Click the Selected button, then select Create group from selected.)

Troubleshooting

Why is my site language reported as Unknown?
The Content Auditor crawler finds information on language in the HTML of your pages. For example, if you have a French page, the language tag would look like this:

<html lang="fr">...</html>

If this tag doesn’t exist, the language value may be reported as “Unknown”.

Language tags are a good thing. Here’s why:
- Accessibility: they signal to the pronunciation engine of screen readers to switch to another language.
- Search engines: they help search engines rank your pages for searches made in the same language.
What is a server and why is it unknown?

Figuring out the server of your site is more of a geeky thing. It can give you hints about the technology that’s powering your website.

As an agency, this can help you figure out if the technology powering a website is one in which you’ve got expertise. It also helps you optimize your approach for the technology you’re working with.

Our crawler finds server information in the HTTP header of your homepage. If it’s not provided, this value is reported as “Unknown”.
Why is my content age reported as Unknown?

Content age is retrieved from the HTTP header of your webpages. If it’s reported as unknown, it means the Content Auditor crawler was unable to find this information.

If a date’s not provided in your HTTP header, this might indicate an issue with your server or your Content Management System (CMS) - both of these should be adding this metadata to your content.
Why is the audit preview blank?

HTTP (Hypertext Transfer Protocol) is the protocol that enables browsers to connect to the World Wide Web. HTTPS is the secure version of HTTP (the "S" in HTTPS stands for "Secure") When using HTTPS, all communications between your browser and the website you're viewing are encrypted.

If you try to view insecure web content over a secure connection, it results in a security error. Because Content Auditor uses HTTPS, when attempting to preview pages over HTTP this security issue prevents those pages from loading.

The solution:
If the site you're crawling is available in HTTPS, try re-crawling using the HTTPS version.

Inventory Page

How do I filter results?

Just below the Inventory page title, you will see the filters interface.

Searching with the text field will filter by page title and URL. Alternatively you can select any number of filters then click the "Filter inventory" button. Click "Reset all filters" to clear all filters at once.
Why is the page count on the inventory different from what is represented on a Site Report graph?

This occurs when a page's value (reading grade, reading ease, etc.) is equal to the separating value on the respective graph. The site report graphs interpret "between" as "greater than the lower value and less than or equal to the upper value" while the inventory "between" filter performs "greater than or equal to the lower value and less than or equal to the upper value". The result is a rare discrepancy between the inventory and the site report graph.
How to produce a content inventory in 3 easy steps

A content inventory is a catalog of every piece of content on your website. Producing a content inventory manually can be hell, especially for large sites. Manual inventories can take days or weeks to produce. Luckily, Content Auditor makes producing a content inventory quick and easy. Here's how to do it.

Step 1 – Crawl your site

Log in to Content Auditor and click "Add a site", then enter your site's URL and click "Add this site". Your site is now queued to be crawled. Easy.

Step 2 – Wait a bit ...

Depending on the size of your website the crawl could take a while. Small sites finish quickly but large sites containing many thousands of pages can take hours or even days. On average it takes about 8 hours for Content Auditor to crawl a 10,000-page site. As soon as Content Auditor has crawled and analyzed your site data you will receive an email notifying you that your crawl is complete. Click the link in the email to view your results.

Step 3 – Content inventory

When your crawl is complete, start by checking out the "site report" page for high-level insights. To get a more detailed look at your full inventory, click "Inventory". Here you can find every page on your website, perform searches, filter and sort your data. Download your data as a CSV to bring it into your favourite spreadsheet for more detailed work.

Top questions and answers on content audits and content strategy

Getting Started

What does Content Auditor do?

How do I start a crawl?

How long does a crawl take?

How to crawl dynamically generated content

Can I pause or cancel a crawl?

Can I restart a crawl?

My crawl has failed. What should I do next?

How do I view my results?

How do I download my crawl data?

1. Download from "your sites" list

2. Download from a site's inventory

Why does my site have zero pages?

Can Content Auditor scrape content from my site?

Importing Google Analytics data

Before you start...

Part 1: Connect Google Analytics

Part 2: Import Google Analytics Data

View your Google Analytics data

Reports page

Inventory page

Troubleshooting

Why is my site language reported as Unknown?

What is a server and why is it unknown?

Why is my content age reported as Unknown?

Why is the audit preview blank?

Inventory Page

How do I filter results?

Why is the page count on the inventory different from what is represented on a Site Report graph?

How to produce a content inventory in 3 easy steps

Step 1 – Crawl your site

Step 2 – Wait a bit ...

Step 3 – Content inventory