Suggestions for SPEAKER

Home » Suggestions for SPEAKER

Notes about the plugin SPEAKER for Text-To-Speech conversion. These are based on versions 3.4.2 and 3.4.4. This page will soon be updated to version 4.0.x.

Dated 9 January 2023, updated 13 February
Bernard Bel

1) TRACE FILE

I added 3 patches (referenced as "Patch nnnn Bernard Bel" etc.) to the code in "SpeakerCaster.php" sent to the technical team. The aim is to produce a trace file — for example "post-10938.txt" — located in the uploads/speaker folder:

https://lebonheurestpossible.org/wp-content/uploads/speaker/post-10938.txt

For the time being, I read these files via FTP and a code editor.

➡ A suggestion is to display a link to the trace file in your plugin after the production of a vocal transcription.

As indicated by the "a" option in fopen(), the "post-*.txt" file is produced incrementally. Therefore, it must be deleted before starting a conversion. This is achieved by patch 4, which also deletes any "post-10938-WARNING.txt" file created in the previous attempt. In addition, patch 4 deletes all "tmp-*.mp3" files which remain in the folder in case the production failed due to a time-out (see point 5).

In case the conversion fails — "error 3" displayed by the plugin (see point 4) —, the last part of "post-10938.txt" is the one which caused the process to fail. This makes it easy to identify errors caused by inappropriate regex rules.

2) ERROR 503

I haven't been able to explain the origin of the "error 503" alert displayed (on my installation) when processing a long text. In fact, this alert is meaningless because the process goes on (and generally succeeds) even if the "OK" button has not been clicked.

The alert shows up after approximately 1 minute — therefore, on long posts. Note that it does not appear in batch processing.

In my (limited) experience of web design, error 503 may signal a PHP syntax error, or (more likely) an attempt to open an unreachable resource. For instance, calling "https://lebonheurestpossible.org/wp-admin/post.php?post=nnnn&action=edit" when "nnnn" is not the id of an existing post, displays an error 503.

3) SOFT HYPHENS

I struggled a very long time with regex rules that did not seem to work. For instance, by default, Google's French TTS pronounces "rebouteux" (bone-setter) as "raybouteu", which sounds utterly wrong. Yet, the rule:

/rebouteux/
rœbouteu

sometimes worked and sometimes didn't. Looking at the trace file, I discovered that "rebouteux" was (sometimes) converted to "rebouteux" or "rebouteux", so that the preceding rule would not apply.

I wrongly assumed that these "" (soft hyphens) had been inserted by SPEAKER. In reality, they are produced by some WordPress themes, and indeed plugins such as wp-Typography if line-breaking is activated. In any case, they should be erased before applying regex rules.

➡ This issue has been solved in version 3.4.4.

4) ERROR 3

Earlier, I had experienced two cases of failure to produce the vocal transcription: "Error 3", and an apparent time-out on long posts. In many (yet not all) cases, replacing the "fr-FR-Neural2-D" voice with "fr-FR-Wavenet-E" would bypass this error. However, looking at the trace file revealed that, in most cases, this "error 3" was caused by garbage syntax produced by wrong (yet syntactically correct) regex rules. Looking at the end of the trace file (see point 1) after a defective production allowed me to notice syntax/rule problems and get rid of most "error 3" failures. This is the main reason for creating trace files (point 1).

Still, this issue remains open, since, in a few rare cases, I get an "error 3" alert even though the syntax of the last part of text seems perfectly correct. I haven't yet understood why this happens. In several cases, however, the "error 3" failure disappears when switching from a Neural2 voice to WaveNet or Standard.

Worse, a syntax error in any regex rule causes regex_content_replace() to crash, and even the previous MP3 file is deleted!

➡ This is a serious issue: the plugin should check (and display) error messages returned by preg_replace().

5) TIME-OUT

Ignoring "error 503" (point 2), sometimes the process is simply abandoned, which suggests a time-out without warning. In this case, we have an incomplete "post*.txt" trace file in the uploads/speaker folder, along with several "temp-post*.mp3" which have not been glued to create the final audio file, because their list is incomplete. This crowding of the uploads/speaker folder by failed "tmp" MP3 files is a serious problem.

This failure does not always occur in the same parts of text, which incites me to think it is merely a time-out problem. It occurs roughly after 3 minutes of process. I guess this is a time-out of the plugin's "session" with Google.

Setting to high values "set_time_limit()" in wp-config.php and/or "php_value max_execution_time" in htaccess did not change anything.

This is a matter of concern, because TTS on websites is used by visually-impaired people, plus readers wishing to listen to the posts on their smartphones while doing something else such as driving or walking. They don't mind listening to more than 1 hour of audio, as they usually do it with radio postcasts. The SPEAKER plugin (along with regex rules) produces a voice rendering which is fair enough for a long listening.

➡ We need a warning of this timed-out. For instance, look for "temp-post*.mp3", display an alert, and then erase them.

➡ Indeed, designers probably can't increase Google session time, whatever this means. However (when time permits) they could fragment the process to as many sessions as necessary. If they get a pointer to the last part of text which was correctly processed, start more sessions from there and send the following parts, etc. Once all parts have been processed, launch the glueing of "temp" files. This would be an outstanding upgrade!

6) SSML

I haven't checked all options.

I noticed that when using say-as to speak a cardinal, often the </say-as> tag is followed with , which creates an unwanted break. If the number was followed with an end-of-sentence period, the period falls after the , and therefore it is spoken as "period" or "point".

➡ The main missing feature is the conversion of a text fragment to the international phonetic alphabet (IPA or, better, SAMPA). The <phoneme> element is ignored. For instance, phonetic transcription won't be used in the following text:

<phoneme alphabet="ipa" ph="ˈhændbʊk ɒv njuːˈtrɪʃᵊnᵊl ˈvæljuː ɒv fuːdz ɪn ˈkɒmən ˈjuːnɪts">Handbook of Nutritional Value of Foods in Common Units</phoneme>

Another enhancement would be the usage of the <lexicon> element, read custom lexicon.

7) LONG HTML PARTS

In general, putting a closing HTML tag in the middle of a paragraph creates the risk that it will be used as an end of part for the chunking of text sent to Google. This results in having an unwanted pause and a change back to the default voice. I face this problem (read below) with citations bounded by <blockquote> to which I automatically assign a different voice.

Parts of HTML sent to Google are limited to 4500 chars. The text is chunked using closing HTML tags "</". However, an error may occur if aa fragment of text without "</" has a size larger than 4500 chars. In my use of the plugin, this may occur in citations. Citations (inside <\/?blockquote> elements) are automatically assigned a voice different from the main voice. To this effect, I use regex rules changing the voice after <blockquote> and again after </blockquote>. However, this creates a problem if a "" is found in the citation, and if this "" happens to be used as the cutting point for creating the part of text.

If the citation is broken due to "" borders, the voice assignment will not occur on the next fragment. Voice will return to the default main voice. I haven't yet figured out regex rules that would assign the citation voice to every paragraph inside a <\/?blockquote>.

To avoid this, I suppress all "" inside citations. For instance, instead of:

<!-- wp:quote {"extUtilities":[]} -->
<blockquote class="wp-block-quote"><!-- wp:paragraph -->
<p>Blah blah 1</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p>Blah blah 2</p>
<!-- /wp:paragraph --></blockquote>
<!-- /wp:quote -->

I edit:

<!-- wp:quote {"extUtilities":[]} -->
<blockquote class="wp-block-quote"><!-- wp:paragraph -->
<p>Blah blah 1<br><br>Blah blah 2</p>
<!-- /wp:paragraph --></blockquote>
<!-- /wp:quote -->

Then the following regex rules are applied:

/<blockquote[^>]*>/u
<voice name="fr-FR-Wavenet-C"> … Start citation…
/<\/blockquote>/u
… </voice>End citation…

Consequently, if the citation grows larger than 4500 chars, an overflow will occur in the Google process. The user should be warned against it. This is the task of patch 2 in "SpeakerCaster.php": if some fragments are larger than 4500 chars, a file named "post-xxxx-WARNING.txt" is created in the uploads/speaker folder, which contains a notice and the faulty parts of text. (I see it immediately because I keep checking this folder via FTP.)

➡ I suggest that the plugin detects the presence of this "post-xxxx-WARNING.txt" file and displays a link to its content.

There is nothing else we can do, technically, in this case. It's up to the editor of the post to break citations to smaller parts.

8) A SUITABLE PLAYER

Current SPEAKER players (up to versions 3.4.4) are not suitable for text-to-speech. The following essential features are missing:

Jumping forth/back 10 or 15 seconds
Option to display the speaker in a "floating" or "sticky" mode, so that its control remains possible while scrolling the page

For the time being, I rely on MEKS Audio Player which entirely fulfils these requirements. It sticks at the bottom of the page, offers ±15 second jumps, and is user-friendly for dragging the pointer to a specific time.

9) LANGUAGE-SPECIFIC REPLACEMENT PATTERNS

Regex (regular-expression) rules, named Replacement patterns on SPEAKER's interface, are among the most powerful tools available on this plugin. I've used them for customising optional LIaisons in French TTS, adding missing mandatory ones and cutting (a few) prohibited ones. They are also used to spell out abbreviations, for instance reading "et al." as "and colleagues", as an educated reader would do.

A strong limitation is that a unique set of rules (patterns) applies to the entire text. This is troublesome when several voices/languages are used on the same page, with rules used to provide a better pronunciation of a foreign word.

For instance, when the French TTS reads the title of journal "The Lancet" it produces incompréhensible garbage. This is solved by the following rule:

/The\sLancet/u
ze lent sept

However, if the same rule is applied in a part of text using an English voice, this weird replacement will in turn produce a weird result…

The solution would be to have a set of generic replacement patterns (applicable to all voices/languages) plus sets of specific replacement patterns, one for each language.

A new version of regex_content_replace() needs to be designed to this effect, taking into account the language to select its associated set of rules. Near line 300 of Speaker.Caster.php we would have:

$ssml = $this->specific_regex_content_replace($post_content, $lang_code);
$ssml = apply_filters( 'speaker_before_synthesis', $ssml, $post_id );

➡ This change has been done in version 3.4.4, although the interface does not seem to offer options for language-specific replacement patterns.

10) MULTIPLE PAGES

Posts on multiple pages are not converted entirely: only page 1 is processed.

To pick up the entire page/post, the following instructions could be used:

define('BASE_WP',"...");
define('WP_TABLE',"...");
$wp_database_host  = "...";
$wp_database_user = "...";
$wpl_database_pwd = "...";
$bdd_wp = new PDO("mysql:host=".$wp_database_host.";dbname=".BASE_WP, $wp_database_user, $wpl_database_pwd);

$query = "SELECT * FROM ".BASE_WP.".".WP_TABLE." WHERE post_status = \"publish\" AND ID = \"".$id."\"";
$result = $bdd_wp->query($query);
$ligne = $result->fetch();
$result->closeCursor();
$post_content = $ligne['post_content'];

➡ Follow-up

lovelysupport replied on 30 January 2023:

Hello, thanks for tirelessly providing details to improve the Speaker. The points 3 (SOFT HYPHENS) and 9 (LANGUAGE-SPECIFIC REPLACEMENT PATTERNS) will be added in the next version of the plugin. The points 6 and 10 will be added in future updates. The points 2,4,7 cannot be reproduced or fixed for technical reasons. The point 1,5,8 will be considered in the futere.

➡ This page is a complement of https://lebonheurestpossible.org/tts-fr/ which can be read with automatic translation.