A couple of months ago, our customer approached me with an interesting task that appeared not complicated at first glance. I needed to create a simple web app with a Grammarly-like view based on a WYSIWYG HTML editor. This app should count different text units like paragraphs, sentences, words, and characters. The best thing is that despite the specifics of text source: websites, Google Docs, MS Office, email clients, your content will be carefully analyzed for units and displayed in a form convenient for perception.

No prizes for guessing what kind of issue I faced in this case :) External content from most of the resources contains redundant HTML tags, which influences the correctness of calculations. Moreover, messed code makes testing an annoying process. So, eventually, I came up with the solution for cleaning HTML from unnecessary tags and polishing it for further work with text.

Table of contents:

  1. Intro part

  2. The principles of HTML code prettifying

  3. My solution – the TagTide library

  4. Approaches and use cases

  5. An example of Google Docs content cleanup

  6. Kind thanks to colleagues

  7. Useful links

The principles of HTML code prettifying

Let’s think about basic steps to get "editor-friendly" HTML code. I’m going to describe the idea with a demonstrative example. Imagine we have a perfect code that defines paragraphs.

The code could be the following:

Shakespeare produced most of his known works between 1589 and 1613.</p>
His early plays were primarily comedies and histories and are regarded as some of
the best work produced in these genres. He then wrote mainly tragedies until 1608,
among them Hamlet, Romeo and Juliet, Othello, King Lear, and Macbeth, all considered
to be among the finest works in the English language.</p>

The example above contains a couple of paragraphs. The second paragraph has two sentences, and the first one – just one.

Let’s look at the hypothetical messy equivalent of the same content:

<div id="content">
    <div class="sent">Shakespeare produced most of his known works between
        <span style="color: red; font-size: 24px;">1589</span> and
        <span style="color: red">1613</span>.
    </div>
    <div>
        <div class="sent">His early plays were primarily comedies and histories and are regarded
        as some of the best work produced in these genres.</div>
        <div class="sent">He then wrote mainly tragedies until 1608, among them Hamlet, Romeo
        and Juliet, Othello, King Lear, and Macbeth, all considered to be among the finest works
        in the English language.</div>
    <div>
</div>

What issues this messiness brings:

  1. Redundant root div that should be ignored.

  2. Since a div inside another div is correct, but a paragraph inside another paragraph is a mistake, it’s challenging to split the text into paragraphs.

  3. Inline styles uglify whole text perception, but they don’t even influence calculation quality. We should get rid of them for aesthetic reasons.

Our main goal is to transform the messy text into perfect paragraphs. Let’s call this kind of text "editor-friendly". As far as the task contains various logic, we need to split the flow into separate activities. Let’s dig into that!

The first reasonable thing is that we need to ignore root div.

<div class="sent">Shakespeare produced most of his known works between <span style="color:
red; font-size: 24px;">1589</span> and <span style="color: red">1613</span>.</div> <div><div
class="sent">His early plays were primarily comedies and histories and are regarded as some
of the best work produced in these genres.</div> <div class="sent">He then wrote mainly tragedies
until 1608, among them Hamlet, Romeo and Juliet, Othello, King Lear, and Macbeth, all considered
to be among the finest works in the English language.</div><div></div></div>

Looks better. Secondly, we should flatten the second paragraph. It means we should leave only one level of divs (in our case – root):

Shakespeare produced most of his known works between 1589 and 1613.</div> <div>
His early plays were primarily comedies and histories and are regarded as
some of the best work produced in these genres. He then wrote mainly tragedies until
1608, among them Hamlet, Romeo and Juliet, Othello, King Lear, and Macbeth, all
considered to be among the finest works in the English language.</div>

Our text looks much better because the second paragraph looks good. But if we talk about editor-friendly HTML code, we should change divs to pure paragraphs:

<p class="sent">Shakespeare produced most of his known works between 1589 and 1613.</p><p>
His early plays were primarily comedies and histories and are regarded as some of the best
work produced in these genres. He then wrote mainly tragedies until 1608, among them Hamlet,
Romeo and Juliet, Othello, King Lear, and Macbeth, all considered to be among the finest works
in the English language.</p>

Looks almost perfect! We have just minor stuff left. We need to remove all redundant tag attributes ("class" in our case):

<p>Shakespeare produced most of his known works between 1589 and 1613.</p><p>
His early plays were primarily comedies and histories and are regarded as
some of the best work produced in these genres. He then wrote mainly tragedies
until 1608, among them Hamlet, Romeo and Juliet, Othello, King Lear, and
Macbeth, all considered to be among the finest works in the English language.</p>

That’s it. We have 100% editor-friendly HTML code that contains a couple of pure paragraphs!In conclusion, I want to summarize all provided steps that you can use as separate pattern:

  1. Start from a certain point in the original code. You can ignore redundant tags that aren’t important anymore.

  2. Flatten the code by removing redundant inner tags. The main goal of this step is preparation for paragraph splitting.

  3. Change div tags to paragraphs.

  4. Remove redundant attributes.

Technical approaches

At the start of the project, I intended to resolve the principles above via sets of pure regular expressions. As you can guess, this attempt failed. Regular expressions are not friendly with deep-nested lexemes. In this case, by lexemes, I mean nested HTML tags and their attributes. Having thought a little, I decided to use the AST-based approach because I shouldn’t reinvent the wheel in this case. I chose the HTML-parse-stringify library because this solution is minimalistic yet contains all the expected functionality regarding HTML AST processing.

My solution – the TagTide library

So, please, welcome TagTide. The main idea of the solution is a set of patterns you can apply anytime when you need them. Also, I suggest you apply them in sequence, just like the steps above.

Let’s repeat the steps of the above text transformations using TagTide.

import { TagTide } from "tag-tide";

const original = `
<div id="content"><div class="sent">Shakespeare produced most of his known works
between <span style="color: red; font-size: 24px;">1589</span> and <span style="color: red">1613</span>.</div>
<div><div class="sent">His early plays were primarily comedies and histories and are regarded
as some of the best work produced in these genres.</div>
<div class="sent">He then wrote mainly tragedies until 1608, among them Hamlet, Romeo and Juliet,
Othello, King Lear, and Macbeth, all considered to be among the finest works in the English language.</div><div></div>`;
const tagTide = new TagTide(original).startAfter("id", /^content/);
const startedAfter = tagTide.result();

console.log(startedAfter, "
");

tagTide.flatten();
const flattened = tagTide.result();

console.log(flattened, "
");

tagTide.rootParagraphs();
const paragraphs = tagTide.result();

console.log(paragraphs, "
");

tagTide.removeAttributes();
const pureHtml = tagTide.result();

console.log(pureHtml);

Also, TagTide allows you to get the expected result in the shortest way handling the request in sequence by different objects (I use a chain of responsibility pattern):

import { TagTide } from "tag-tide";

const original = `
<div id="content"><div class="sent">Shakespeare produced most of his known works
between <span style="color: red; font-size: 24px;">1589</span> and <span style="color: red">1613</span>.</div>
<div><div class="sent">His early plays were primarily comedies and histories and are regarded
as some of the best work produced in these genres.</div>
<div class="sent">He then wrote mainly tragedies until 1608, among them Hamlet, Romeo and Juliet, Othello,
King Lear, and Macbeth, all considered to be among the finest works in the English language.</div><div></div>`;
const pureHtml = new TagTide(original).startAfter("id", /^content/).flatten().rootParagraphs().removeAttributes().result();

console.log(pureHtml);

Use cases

There is a couple of fundamental approaches how to change your HTML code using TagTide:

  1. Direct changes via traverse;

  2. Changes via the set of patterns.

This part contains different examples illustrating the approaches above.

Change some content in 2nd nesting level

import { El, TagTide } from "tag-tide";

const original = "<div>level 1 <div>level 2 <div>level 3</div></div></div>";
const prettified = new TagTide(original)
.traverse((el: El, level: number) => {
    if (level === 2 && el.content) {
        el.content = `modified ${el.content}`;
    }
})
.result();

console.log(prettified);
Output

<div>level 1 <div>modified level 2 <div>level 3</div></div></div>

Aggregate numeric values from different nesting levels

import { El, TagTide } from "tag-tide";

const original = "
1
2
3
";
let total = 0;
new TagTide(original).traverse((el: El) => {
    if (el.content) {
        total += +el.content.trim();
    }
});

console.log(total);

Output:

`6`

Strip some tags

import { TagTide } from "tag-tide";

const original = "<div>level 1 <div><a href='#'><span>level <i>2</i></span></a> <div>
level 3<br></div></div></div>";
const prettified = new TagTide(original)
.result(['a', 'span', 'i', 'br']);

console.log(prettified);

Output:

<div>level 1 <div>level 2 <div>level 3</div></div></div>

Start after or Start from

This pattern allows us to ignore some parent structures.

import { TagTide } from "tag-tide";

const source = `<body><div class="container-1"><p>content</p></div></body>`;
const result = new TagTide(source)
  .startAfter("class", /^container-d/)
  .result();

console.log(result);

Output:

<p>content</p>

The following code illustrates another approach to set starter-tag. Also, a related tag can be included:

import { TagTide } from "tag-tide";

const source = `<body><div class="container-1"><p>content</p></div></body>`;
const result = new TagTide(source)
  .startFrom("class", /^container-d/)
  .result();

console.log(result);

Output:

<div class="container-1"><p>content</p></div>

Important notes:

  1. In the "Start from" case, the result will include the associated (search) tag.

  2. If no matching tag is found, these patterns don’t affect the result.

Flatten

This pattern will be useful if we need to remove nested tags.

import { TagTide } from "tag-tide";

const original =
"<div>1 <div id='first'>2 <div><span class='foo' style='color: red;'>3</span></div></div></div> middle <div>4 <div>5</div></div>";
const prettified = new TagTide(original).flatten().result();

console.log(prettified);

Output:

<div>1 2 3</div> middle <div>4 5</div>

Also, you can omit any tags that you want except for those you need to keep (ex. <br>, <b>,etc.):

import { TagTide } from "tag-tide";

const original =
  "<div>1 <div id='first'>2 <div><span class='foo' style='color: red;'><a href='1' target='_blank'>3</a></span></div></div></div> middle
<div>4 <div>5</div></div>";
const prettified = new TagTide(original).flatten(['a']).result();

console.log(prettified);

Output:

<div>1 2 <a href="1" target="_blank">3</a></div> middle <div>4 5</div>

Remove attributes

This pattern allows removing all or some attributes through the whole content. In the following example, I’ve removed all attributes:

import { TagTide } from "tag-tide";

const original =
  "<div>1 <div id='first'>2 <div><span class='foo' style='color: red;'>3</span></div></div></div> middle <div>4 <div>5</div></div>";
const prettified = new TagTide(original).flatten().removeAttributes().result();

console.log(prettified);

Output:

<div>1 <div>2 <div><span>3</span></div></div></div> middle <div>4 <div>5</div></div>

In the following example, all attributes have been removed except:

*id attribute in all span tags

  • all style attributes

import { TagTide } from "tag-tide";

const original =
  "<div>1 <div id='first'>2 <div><span id='s1' class='foo' style='color: red;'>3</span></div></div></div>
middle <div style='color: red;'>4 <div>5</div></div>";
const prettified = new TagTide(original).removeAttributes({'span': ['id'], '*': ['style']}).result();

console.log(prettified);

Output:

<div>1 <div>2 <div><span id="s1" style="color: red;">3</span></div></div></div>
middle <div style="color: red;">4 <div>5</div></div>

Root paragraphs

We need to replace the divs in the first level of HTML with paragraphs. If the plain text appears at the first level instead of another tag, it should be enclosed in a paragraph.

The following example shows how this approach works:

import { TagTide } from "tag-tide";

const original = "
start
 middle
finish
";
const prettified = new TagTide(original).rootParagraphs().result();

console.log(prettified);

Output:

<p>start</p><p> middle </p><p>finish</p>

An example of Google Docs content cleanup

The icing on the cake is the case with code that I took from an actual Google Docs document. Not to overload this article, I placed this example of code containing HTML on GitHub.

If you checked the link, I suppose you might be surprised by the volume of HTML in the document. Now let’s imagine you need to work with the text inside. Can you see the text there, by the way? – I guess not really because of lots of trash tags and attributes. The most exciting thing in this story is that using the TagTide script, you can transform code from my example into the following HTML:

<p> Test test test </p><p> Test test test </p><p><a href="http://www.microsoft.com">Test</a> test test</p><br/><br/><br/><p>
12 34 </p><br/><br/><ul><li>Aaa</li><li>Bbb</li><li>Ccc</li></ul><br/><p>The end</p>

And you can repeat the same with the code taken from email clients, MS Word documents, and other sources!

In conclusion, I want to thank:

And in case you’re interested in other solutions for text editing and processing, check my Multi-Highlighting for DraftJS article. There I share how I added an option to highlight paragraphs, phrases, and words in DraftJS with my solution written in TypeScript.