How To Validate Word Counts for SEO

by Christoph Schiessl on JavaScript

Recently, I wrote an article to extract a list of pages from a sitemap.xml file and another one showing how to fetch each of these pages and extract their titles and meta descriptions. The reason for doing this was to validate the length — the number of characters — of the titles and descriptions for SEO purposes. You see, titles and descriptions are supposed to be below a certain number of characters for optimal SEO results.

Today, I want to go one step further and validate the same pages' word counts because there is another piece of advice floating around on the Internet: the number of words on pages should be above a certain threshold. There are, of course, exceptions — highly successful pages with low word count — but, statistically speaking, pages with higher word counts tend to perform better on search engines. There is no widely accepted consensus on the minimum number of words. Still, the impact on pages with less than 1000 words seems particularly noticeable, so I decided to aim for that. This means pages with 1000 words or more are good, and pages with less than that should be re-considered for improvement.

To implement this validation, we first need a JavaScript function that takes an HTML document as input and returns an integer representing the number of words in the document.

Getting the <body> element

So, given an object that implements the Document interface, we must first get its <body> element. I'm working with window.document for now because it's automatically available, but any Document object would work, including those you create yourself (e.g., with DOMParser). That said, I'm using the querySelector() method to get the document's <body> element — if there is no such element, then this method returns null, which we would have to handle, of course, in a real-world application.

const doc = window.document; // or any object implementing the Document interface
const bodyElement = doc.querySelector("body");

Getting the <body>'s text

Once we have a Node object representing the <body> element, we must extract its text, but using its textContent property would not be completely honest in this context. Instead, it's better to use its innerText property, whose behavior is slightly different because it only returns visible text. So, invisible text — that's, for instance, hidden due to some CSS rule — is not included in the returned String. Since we want to count the words on the page, or in other words, measure the length of the content visitors experience, it's better to use the property closer to the user's experience when they view the page.

const bodyText = bodyElement.innerText;

Cleaning up the text

Next, we must clean up the text to strip out all characters that do not constitute a proper word. For that, I want to use the replaceAll() method of String objects with a regular expression. Essentially, I want to count sequences of one or more "word characters", for which we can use JavaScript's built-in character class \w. The official documentation explains this as follows:

Matches any alphanumeric character from the basic Latin alphabet, including the underscore. Equivalent to [A-Za-z0-9_]. For example, /\w/ matches "a" in "apple", "5" in "$5.28", and "3" in "3D".

So, sequences of word characters are what we need to count in the end, but for filtering, we need precisely the opposite. Luckily, JavaScript also has an inverse character class called \W, which matches everything except word characters. The official documentation states:

Matches any character that is not a word character from the basic Latin alphabet. Equivalent to [^A-Za-z0-9_]. For example, /\W/ or /[^A-Za-z0-9_]/ matches "%" in "50%".

Note that \W also matches white space characters like spaces and line breaks. So, for the replacement, I'm replacing all sequences of one or more non-word characters with a single space. This is a good approach that will be handy, as you'll see in a minute.

const bodyTextClean = bodyText.replaceAll(/\W+/g, " ");

One last point: the regular expression for replaceAll() must use the global flag because otherwise, replaceAll() raises an exception, which would abort the execution of our script.

Splitting the text and counting the number of words

We can use a simple split() to convert the cleaned text to an Array containing all space-separated words. Given the previous step, we can conclude that these words must be sequences of word characters as defined above.

const listOfWords = bodyTextClean.split(" ");
const wordCount = listOfWords.length;

Finally, we take the Array's length property, which is, of course, the number of words we have been looking for.

Putting everything together

So, for convenience, we can combine all the steps we have discussed so far into a function for later re-use. The only difference here is that I added an early exit to return a word count of 0 for documents that don't have a <body> element.

function countNumberOfWords(doc) {
  const bodyElement = doc.querySelector("body");
  if (bodyElement === null) return 0;

  const bodyText = bodyElement.innerText;
  const bodyTextClean = bodyText.replaceAll(/\W+/g, " ");
  const listOfWords = bodyTextClean.split(" ");
  return listOfWords.length;
}

With that said, we can now test our function in the console, using window.document as input, to determine the word count of some page.

Console Output for window.document

The current page, window.document, has a word count of 1226.

If you also read my previous articles, it should be straightforward to integrate word counting into the script to validate the titles and descriptions of all pages mentioned in your /sitemap.xml. Anyway, for the sake of completeness, I'm putting the complete code below:

async function extractPageURLsFromSitemapXML() {
  let rawXMLString = await fetch("/sitemap.xml").then((response) => response.text());
  let parsedDocument = (new DOMParser()).parseFromString(rawXMLString, "application/xml");
  let locElements = parsedDocument.querySelectorAll("loc");
  return locElements.values().map((e) => e.textContent.trim()).toArray();
}

function countNumberOfWords(doc) {
  const bodyElement = doc.querySelector("body");
  if (bodyElement === null) return 0;

  const bodyText = bodyElement.innerText;
  const bodyTextClean = bodyText.replaceAll(/\W+/g, " ");
  const listOfWords = bodyTextClean.split(" ");
  return listOfWords.length;
}

async function extractPageTitleAndDescription(pageURL) {
  let title = null, description = null;

  let response = await fetch(pageURL);
  let rawHTMLString = await response.text();
  let parsedDocument = (new DOMParser()).parseFromString(rawHTMLString, "text/html");

  let titleElement = parsedDocument.querySelector("head > title");
  if (titleElement !== null) { title = titleElement.textContent.trim(); }

  let descriptionElement = parsedDocument.querySelector("head > meta[name=description][content]");
  if (descriptionElement !== null) { description = descriptionElement.getAttribute("content").trim(); }

  return { title, description, wordCount: countNumberOfWords(parsedDocument) };
}

const pagesWithErrors = {};
for (const pageURL of await extractPageURLsFromSitemapXML()) {
  const { title, description, wordCount } = await extractPageTitleAndDescription(pageURL);
  const errors = [];

  if (title === null) { errors.push('title missing'); }
  else if (title.length > 70) { errors.push(`title too long ${title.length}`); }

  if (description === null) { errors.push('description missing'); }
  else if (description.length > 160) { errors.push(`description too long ${description.length}`); }

  if (wordCount < 1000) { errors.push(`word count too small ${wordCount}`); }

  if (errors.length > 0) { pagesWithErrors[pageURL] = errors.join(", "); }
}
console.table(pagesWithErrors);
Console Output showing all pages with errors

List of pages, extracted from the /sitemap.xml of bugfactory.io, that are violating at least one SEO rule.

Please don't hesitate to reach out if you have any questions. Thank you for reading, and see you soon!

Ready to Learn More Web Development?

Join my Mailing List to receive two useful Articles per week.


I send two weekly emails on building performant and resilient Web Applications with Python, JavaScript and PostgreSQL. No spam. Unscubscribe at any time.

Continue Reading?

Here are a few more Articles for you ...


How to Validate your Page Titles and Descriptions for SEO

Learn how to use JavaScript to programmatically check the meta information of all pages listed in your sitemap.xml to potentially rank better on Google.

By Christoph Schiessl on JavaScript

Extracting all URLs of your sitemap.xml with JavaScript

Learn JavaScript techniques needed to parse your sitemap.xml in order to obtain a list of all pages making up your website.

By Christoph Schiessl on JavaScript

Word Level Levenshtein Distance

This article explains how to use the Levenshtein algorithm to compare sentences by looking for word differences instead of character differences.

By Christoph Schiessl on Python

Christoph Schiessl

Christoph Schiessl

Independent Consultant + Full Stack Developer


If you hire me, you can rely on more than a decade of experience, which I have collected working on web applications for many clients across multiple industries. My involvement usually focuses on hands-on development work using various technologies like Python, JavaScript, PostgreSQL, or whichever technology we determine to be the best tool for the job. Furthermore, you can also depend on me in an advisory capacity to make educated technological choices for your backend and frontend teams. Lastly, I can help you transition to or improve your agile development processes.