A couple of months ago, our customer approached me with an interesting task that appeared not complicated at first glance. I needed to create a simple web app with a Grammarly-like view based on a WYSIWYG HTML editor. This app should count different text units like paragraphs, sentences, words, and characters. The best thing is that despite the specifics of text source: websites, Google Docs, MS Office, email clients, your content will be carefully analyzed for units and displayed in a form convenient for perception.
No prizes for guessing what kind of issue I faced in this case :) External content from most of the resources contains redundant HTML tags, which influences the correctness of calculations. Moreover, messed code makes testing an annoying process. So, eventually, I came up with the solution for cleaning HTML from unnecessary tags and polishing it for further work with text.
Table of contents:
1. Intro part
2. The principles of HTML code prettifying
3. My solution – the TagTide library
4. Approaches and use cases
5. An example of Google Docs content cleanup
6. Kind thanks to colleagues
7. Useful links
Let's think about basic steps to get "editor-friendly" HTML code. I'm going to describe the idea with a demonstrative example. Imagine we have a perfect code that defines paragraphs.
The code could be the following:
The example above contains a couple of paragraphs. The second paragraph has two sentences, and the first one – just one.
Let's look at the hypothetical messy equivalent of the same content:
1. Redundant root div that should be ignored.
2. Since a div inside another div is correct, but a paragraph inside another paragraph is a mistake, it's challenging to split the text into paragraphs.
3. Inline styles uglify whole text perception, but they don't even influence calculation quality. We should get rid of them for aesthetic reasons.
Our main goal is to transform the messy text into perfect paragraphs. Let's call this kind of text "editor-friendly". As far as the task contains various logic, we need to split the flow into separate activities. Let's dig into that!
The first reasonable thing is that we need to ignore root div.
Looks better. Secondly, we should flatten the second paragraph. It means we should leave only one level of divs (in our case – root):
Our text looks much better because the second paragraph looks good. But if we talk about editor-friendly HTML code, we should change divs to pure paragraphs:
Looks almost perfect! We have just minor stuff left. We need to remove all redundant tag attributes ("class" in our case):
That's it. We have 100% editor-friendly HTML code that contains a couple of pure paragraphs!
In conclusion, I want to summarize all provided steps that you can use as separate pattern:
1. Start from a certain point in the original code. You can ignore redundant tags that aren't important anymore.
2. Flatten the code by removing redundant inner tags. The main goal of this step is preparation for paragraph splitting.
3. Change div tags to paragraphs.
4. Remove redundant attributes.
At the start of the project, I intended to resolve the principles above via sets of pure regular expressions. As you can guess, this attempt failed. Regular expressions are not friendly with deep-nested lexemes. In this case, by lexemes, I mean nested HTML tags and their attributes. Having thought a little, I decided to use the AST-based approach because I shouldn't reinvent the wheel in this case. I chose the HTML-parse-stringify library because this solution is minimalistic yet contains all the expected functionality regarding HTML AST processing.
So, please, welcome TagTide. The main idea of the solution is a set of patterns you can apply anytime when you need them. Also, I suggest you apply them in sequence, just like the steps above.
Let's repeat the steps of the above text transformations using TagTide.
Also, TagTide allows you to get the expected result in the shortest way handling the request in sequence by different objects (I use a chain of responsibility pattern):
There is a couple of fundamental approaches how to change your HTML code using TagTide:
1. Direct changes via `traverse`;
2. Changes via the set of patterns.
This part contains different examples illustrating the approaches above.
This pattern allows us to ignore some parent structures.
The following code illustrates another approach to set starter-tag. Also, a related tag can be included:
1. In the "Start from" case, the result will include the associated (search) tag.
2. If no matching tag is found, these patterns don't affect the result.
This pattern will be useful if we need to remove nested tags.
Also, you can omit any tags that you want except for those you need to keep (ex. <br>, <b>,etc.):
This pattern allows removing all or some attributes through the whole content.
In the following example, I've removed all attributes:
In the following example, all attributes have been removed except:
* `id` attribute in all `span` tags
* all `style` attributes
We need to replace the `divs` in the first level of HTML with `paragraphs`. If the `plain text` appears at the `first level` instead of another tag, it should be enclosed in a `paragraph`.
The following example shows how this approach works:
The icing on the cake is the case with code that I took from an actual Google Docs document. Not to overload this article, I placed this example of code containing HTML on GitHub.
If you checked the link, I suppose you might be surprised by the volume of HTML in the document. Now let's imagine you need to work with the text inside. Can you see the text there, by the way? – I guess not really because of lots of trash tags and attributes. The most exciting thing in this story is that using the TagTide script, you can transform code from my example into the following HTML:
<p> Test test test </p><p> Test test test </p><p><a href="http://www.microsoft.com">Test</a> test test</p><br/><br/><br/><p> 12 34 </p><br/><br/><ul><li>Aaa</li><li>Bbb</li><li>Ccc</li></ul><br/><p>The end</p>
And you can repeat the same with the code taken from email clients, MS Word documents, and other sources!
- Waqas Younas for long and fruitful cooperation
And in case you're interested in other solutions for text editing and processing, check my Multi-Highlighting for DraftJS article. There I share how I added an option to highlight paragraphs, phrases, and words in DraftJS with my solution written in TypeScript.
Let's discuss how Valor Software can help with your project's development needs!