Readability
In working on readability-component, it was necessary to develop a function that given a word was able to return the number of syllables in that word.
This problem is surprisingly difficult to solve across the wide amount of odd cases within our language, so a heuristic is generally a good approach.
In the readability-component scenario, the main use case for this syllable counting functionality is due to the number of syllables in a piece of text contributing towards the Flesch reading-ease score. It essentially contributes towards the average number of syllables per word in a piece of text. The higher this contribution is equating to a more difficult piece of text.
The heuristic within readability-component for computing the number of syllables uses the following rules:
- Vowels consists of
a
,e
,i
,o
,u
,y
- Isolated single vowels contribute 1 syllable for the word
- Consecutive vowels contribute overall 1 syllable for the word
- If a word end in an "e" and has at least 2 syllables, remove the "e"
- For silent "e" handling for words like "plane"
- If a word end in an "ed" and has at least 2 syllables, remove the "ed"
- For cases where "ed" does not contribute a syllable, for words like "jumped"
Now, let's step back and think about it.
In one extreme case we could store a dictionary that holds all words and there associated number of syllables. At this point, figuring out the number of syllables for a word would only be a matter of looking that word up in the stored dictionary.
The opposite end of the spectrum is using a completely rules based approach. Given a word, a series of rules, such as the above, is ran against the word, then a number representing the amount of syllables for that word is output.
Typically the best approach though, seems to be a hybrid. Storing a chosen corpus of words mapping to there syllable counts for lookup, then using rules based syllable count approaches on words not contained in this "offline" dictionary.
This is actually how the readability-component works.
Before selecting an offline dictionary, I wanted to understand how the rules above performed on a set of words. For this, I used hyphenation data made available via the Moby Project.
Data descriptor | Data value |
---|---|
Number of words for which rules correctly identified syllable count | 127,317 |
Number of words for which rules incorrectly identified syllable count | 43,898 |
Number of words for which syllable count was wrong and are common words | 1,063 |
Number of words in Moby Project corpus | 187,175 |
Numer of words filtered out from Moby Project corpus (word contains hyphen or space, > 6 syllables) | 15,960 |
Number of words within Moby Project corpus used in calculations | 171,215 |
Percentage of words for which syllable counts were correctly identified by rules | 74.361% |
Average difference from rules based syllable count to real syllable count on incorrectly identified words | 1.055 |
The code used to generate this performance test can be found in the readability-component repository.
Looking at the data above, it can be seen that the rules do surprisingly well. Correctly identifiying around 70% of syllable counts of words in the Moby Project corpus.
Then, when the rules are wrong, they are on average off by around 1 syllable.
Still though, there are around 43,000 words on which it was wrong. That's still sizable, but is all that needs to be focused on, since the rules work well in other cases.
I wanted to trim this further, as this dictionary would be served in the readability-component javascript bundle. With that, I wanted to store locally what would be the most impactful. My theory may not be the best, but is simple enough. Only store words for which the rules were incorrect and are also considered common words.
For a corpus of common words I made use of the 10,000 most common English words derived from Google's Trillion Word Corpus.
From the 43,000 cases where the rules were incorrect, I then ran these words against the common words list to see if they were common. The result being around 1,000 common words.
These are the words that make up readability-component's offline dictionary.
Alternative readability tests
There are more readability formulas than what is mentioned here. Although the Flesch-Kincaid readability ease is one of the most popular. Another popular one I'll call out is Dale-Chall Readability Formula.
In this case, a corpus of words is used, that is primarily driven by familiarity or understanding by at least 80% of 5th graders. If a word is not in the corpus, it is considered difficult.
It should be noted that this readability metric has nothing to do with syllable counting.
Resources
- https://www.tug.org/docs/liang/liang-thesis.pdf
- https://dl.acm.org/doi/10.1145/364995.365002
- https://eprints.soton.ac.uk/264285/1/MarchandAdsettDamper_ISCA07.pdf
- https://dspace.mit.edu/bitstream/handle/1721.1/16397/03491095-MIT.pdf
- https://sites.cs.ucsb.edu/~pconrad/cs8/10F/labs/lab10/
- https://web.stanford.edu/class/archive/cs/cs106a/cs106a.1144/handouts/210%20Assignment%204.pdf