Transcription Guidelines

Use of Transcripts

In addition to audio, you may also input transcripts. See Input Options on the benefits of using transcripts.

For each audio file, the transcript contains the words heard in the audio, written in the native writing system of that language. For example, the audio file

felix_1.wav

has the transcript:

This is actually a story about my grandfather um who fought in the war

Transcripts are optional, but they do help improve quality of lip sync. When using transcripts, you must use the correct language pack for the specific language or dialect.

Transcription Best Practices

Following are best practices regarding transcription.

Encoding

All transcripts, regardless of lanugage, should have UTF-8 encoding.

Accuracy

The transcription should be as accurate as possible. For best results, transcribe after recording rather than using the original script. For example, if a line in the script was "No, no way" but the actor actually said "Nope, no way", transcribe it as the latter.

However, the system is fairly tolerant of mistakes. If something is mis-transcribed, the auto-correction feature will usually keep the lip sync accurate. (See Phonetic and Acoustic lip sync processing.)

Numerals

Arabic numerals should be avoided since their pronunciation is unpredictable. You should spell out numerals as they are spoken. Thus 1701 should be written out as "seventeen oh one", "seventeen hundred and one", or "one thousand seven hundred and one", depending on how it was actually spoken in the audio.

Punctuation

Punctuation marks (periods, commas, hyphens at line breaks, etc.) are not required. They don’t have to be removed, since they are generally ignored; but it is worth noting that punctuation adds nothing to the animation in terms of interpreting pauses, since pauses are determined directly from the audio.

However, do use punctuation where it is part of the correct spelling of the word, for example:

Apostrophes in contractions (English don’t, French c’est)
Hyphens in compounds, or special forms like French pronoun inversion (a-t-il)

Similarly, also include accents, umlauts, and other diacritical marks that are part of the correct spelling of the word (e.g., French très). Doing so will optimize the accuracy of pronunciation.

Special Characters

Special characters, including $ £ € & % # @ ¥ ₽ ° will be ignored. For best results they should be written out using the words that are heard in the audio; for example, "dollars", "and", "at", etc.

Initialisms

Initialisms should be written with dots or spaces to separate the letters:

A.B.C.
A. B. C.
A B C

These are all acceptable and are understood to be pronounced by saying the individual letters. Initialisms written without dots or spaces (e.g., ABC) will be interpreted as being pronounced as a word and should be avoided.

Letter Repetition

Avoid using repetition of letters to indicate the sustain of a word, as in "Oh noooooooooo!". Our system determines duration from the audio only, not from spelling.

Non-Words (Breath, Emotes, Grunts)

With non-word vocalizations like breaths, grunts, ers, ums and ahs, you can try to spell them out where they occur, using simple spellings like “ah”, “oh”, “hm”, “aw”, and the system will attempt to use that information to pronounce the sounds.

Alternatively, you can omit those sounds from the transcript, and let the system interpret the sounds without help. The auto-correct feature automatically switches from the phonetic lip sync system to the acoustic lip sync system when unexpected sounds are detected (see Phonetic and Acoustic lip sync processing). This may often lead to better-looking lip sync than spelling out the sounds, because the acoustic system can handle a variety of non-linguistic sounds better.

Beyond lip sync, SGX also uses non-speech vocalizations to help drive nonverbal behavior. For example, effort sounds (like grunts) will trigger the auto-mode Effort (see Auto Modes for more information). This does not require any intervention in the transcript.