Audio Evidence Forensics
- Apr 23, 2019
- 13 min read

Unlocking The Keys to Analyzing The Known and Unknown Speaker

Updated: Apr 29, 2019

In criminal and family law, an audio forensic analyst may be asked to review cell phone recorded evidence of an unknown speaker and compare it with a known speaker sample to analyze for verification of consistencies and inconsistencies that exist between the two voices. This article discusses speech relating to how jaw mechanics, tones, volume, voice features, articulations, pitch fluctuations, human acoustical space and the vocal tract combine to produce and manipulate sound. Phonetics, breath control, rhythm and timing, speech impediments, connecting and transitioning vowels and consonants, consistency in volume, vocal production execution, and terminology defining specific voice features are discussed. This article discusses how the subconscious is engaged and linked to producing social dialects and speech habits. This article provides examples that will describe and provide valid practical application techniques to use during critical listening analysis that can be supported by reliable findings obtained with EQ and spectrum analysis tools.

I. INTRODUCTION

Audio can be welcome crucial evidence to a family or criminal law case. In some states, audio evidence remains relatively new to the discovery process. However, crime is a growth industry and police body cams and cell phone recordings made by individuals are commonly expected to surface. During the forensic process, when we enhance and clean up audio in order to make voices more present in the mix, we unmask a speaker’s vocal production and therefore we hear and perceive emotion, style, and impact. We know that enhancement releases more of what may have been buried by the stereo field noise and poor recordings.

A challenging request is to compare recordings of a known and an unknown speaker to reveal consistencies or inconsistencies between the two voices. A deeper understanding of voice features, learning what to listen for, identification of what you are hearing with specific terms, and basic industry standard voice terminology will allow us to conduct a critical listening analysis between two voices that can also be supported by visual spectrum analyzer tools. Since voice and speech are interchangeable, the sections of this article will cover and cross-reference voice production features, subconscious habits, and analysis methods to consider when presented with a known speaker recording sample and comparing that with the unknown voice recording. The article will describe how to identify with our ears, some specific voice features and how they are executed. It further reveals feature terminology necessary to explain these features and deliver a report on the evidence. The article’s intent is to offer practical application information for conducting a reliable voice evidence analysis. The article concludes with a summary.

II. CONDUCTING CRITICAL LISTENING ANALYSIS OF MECHANICS AND PHONETICS

2.1 Understanding Basic Voice & Speech

When one looks at a face, one can see its bone structure. A large or small nose, or a large or small jaw. We know our physical make up consists of the nasal cavity, the hard and soft palates of the bone structure and the tongue, the vocal folds, the larynx, the trachea, the oesophagus and the vocal tract. Fig. 1.

The individual produces a core sound that is modified by that individual’s physical make-up and mechanics. A voice can be monotone, it can fluctuate in pitch, it can be loud or soft, it can have consistent distortions, it can have a regional accent, it can sound thin or full. It is a fabulous instrument. A person’s speech can be trained, untrained, or it can have impediments that restrict flexibility from making vocalizations the individual may want to make. For example, it is this analyst’s experience and opinion from studies conducted, that some speech impaired individuals have great difficulty in holding a sound for even a few tics of a second, making the ability to sing extremely difficult. Subconsciously, this is a confidence issue as well. People find a great deal of comfort in their own voices until challenged to control it. [1]

Alternatively, it is a well-known fact that people who stutter have become famous singers working very hard to train their voices and overcome their confidence issues. The talker can have a natural cultural accent or manufacture one which would require training and practice for execution. Individuals speak in slang and have trendy or cultural rhythmic and tone patterns even while speaking the same language. For example, the Scottish obviously do not sound like the Irish, and so forth. Some voices are remarkably pleasant and gifted and some you want to run away from before your ears are damaged! (Pinch your nose and speak as loudly and monotone as you can). We can say that everyone is for the most part, sonically unique due to their individual physiology, mechanics, voice features and execution. However, we cannot rely on such a broad statement alone. “The crucial question for the trier of fact, then, is not whether voices are unique, but whether an examiner can show that he or she can accurately distinguish two similar voices spoken under significantly different conditions, as is almost always the case in forensic settings.” [2] It is in the execution that we find numerous opportunities for analysis.

2.2 Features Relevant to Voice Production

Placement: The headspace contains a front and back space. When the jaw opens and the soft palate at the top of the throat rises and the jaw drops down it creates an acoustical dome. The nasal cavity and the top of the inside of the jaw structure is a hard palate. The front of the face is traditionally called the mask. The sound travels up through the vocal folds causing them to vibrate due to the air passing through. Jaw mechanics, the acoustical dome, and the mask modify the core sound creating the pitch and the tones of the voice. The voice is essentially placed and executed from the vocal tract. Speakers have differences in timing, rhythm, and expression, and these features are often adopted from other speakers’ sound habits subconsciously transferred into our own vocal execution during singing, language and accents, and in social dialect trends. Some persons speak very quickly, shrill and nasally, slur words together, speak in short breathy blasts and in phrases, or they can speak with more control, more stately or smooth. This is crucial during comparison activities because we have the unknown voice recording which also takes into consideration the speaker situation. Is the unknown voice whispering, are they yelling? Are they crying or do they have a cold? While the known speaker is repeating the same utterances and providing the sample from a controlled environment, we can look for discrepancies or consistencies in the execution. Note: It may be wise to request a long sampling of dialog from the known speaker to ascertain features and habits. Here are a couple to look for:

Accent is a term that also describes a technique where the sound fluctuates with emphasis in the weight, heavy or light, of the execution of a short sound. For example, in the phrase “Baby listen to me”, we may see the accents in three ways as follows:

BAby listeN to ME.

Baby, LISten to me.

BaBY LISten TO ME.

This is important because a talker may rhythmically utilize accents predictably and this creates pitch fluctuations that we can see visually.

Tone is best described as color in the voice. For example, a voice high in pitch but remaining rich, full and round sounding represents engagement of a rounded acoustical dome and backspace. A voice that sounds thin and more nasally is projected forward into the mask and the hard pallets of the face and a flatter inside surface. You may notice the voice sounds generally up in the front of the face which makes for more output amplitude or volume due to the projection, while some speakers seem to have their voices trapped behind their jaws making for less amplitude and a muddier sound. A voice can have a consistent distortion or manufacture a vocal fry. These features and artifacts depend on how sounds are placed in the acoustical dome and the mask and opened and closed through jaw mechanics. Features may reveal if the person is speaking naturally or adding a forced artifact such as a fry to their voice to sound more compelling or portray an emotion such as grief or trauma for the purpose of misleading the listener.

Avoid the Bias: It is not for the analyst to consider if the intent of the speaker is to be fraudulent or misleading. We are not functioning as licensed private investigators in our scope of work. It is the science only of what the audio says and whether that can either confirm or disprove what the attorney is seeking, that being for them to decide. Simply, in voice verification, we identify linguistics, sonic features and vocal production, and non-linguistic features such as ethnic background, sex and emotional state, and the unknown speaker environment and situation, to describe the voice being analyzed.

Vowel Sounds and Diphthongs: An important voice feature to compare is how a speaker combines vowel sounds. These are called diphthongs. For example, i-e as in like, or a-e as in make, or o-u as in mouse. Pay attention to the length of the vowel sounds one going into the next and if there are any distinguishing characteristics additional to these combinations such as a cry or whine, or an accent giving added weight to the sound. Vowel sounds can also be distorted, meaning that subconsciously they are not fully executed suggesting a style or a speech impediment.

Consonants: Placement is important to consonants. For example, in speaking an “S” sound you may hear a lisp. The “S” may sound a bit thick in comparison to other speakers who make their consonants sharp and very distinct indicative of a tongue against the teeth and a very forward into the mask projection, as opposed to the sound being trapped more towards the back of the vocal tract. Similarly, with a “D”, a forward placement will register higher in pitch, thin and crisp, versus a placement towards the back giving it a darker register.

Volume and timing habits: When listening to consonants during a phrase, often in English it is natural for someone to slur words together. If a consonant is at the end of a word and the following word begins with a vowel, it is natural to blend these two sounds together. Not every single word is robotically separated by a pause in between. For one simple example if you utter the phrase “I will open the door”, it can blend like this: “Iwillopenthedoor”. We all do this in speech naturally. However, some people tend to slur words too much, or they mumble making words that should be separated hard to distinguish. Other people speak more abruptly and sharp or crisp either delicately or loudly. Some speakers have a habit of cutting off the ends of words, decreasing their volume consistently before the end of the phrase by subconsciously closing the jaw and then dumping their remaining air at the very end. Some talkers speak in short phrases or otherwise place frequent hesitations between lines. Gaps are good places for tampering the evidence with edits.

Pitch: In EQ analysis, pitch and harmonics are visualized over the bandwidth. Some voices alter pitch frequently and some are monotone. Listen to how the voice rises and falls during a critical listening session while conducting a visual analysis. Pitch will change during diphthongs and as the voice moves from the front of the mask to the back of the acoustical dome and the throat. Phonetics and jaw mechanics work together to manipulate and modify the core sound coming from the body.

Distortion: Many men’s voices are easily perceived by distortion or gravel present in the lower tones. This gives the voice a great deal more weight in the Hz frequency ranges as opposed to the kHz frequency ranges and often a broader band across the spectrum. Tenor voices, children and women’s voices typically occupy higher ranges and with less to no audibly perceived distortion.

Whispering: When attempting to disguise a voice such as a caller on the phone, whispering is common and will raise frequency levels as the solid tones of the voice are diminished, more air is releasing, and the vocal fold vibrations change. If the speaker is not aware or capable of significantly altering the voice by imitating another voice or an accent, the analyst may find valuable features and identifiable consistencies to analyze even during a whisper.

2.3 Sociolinguistics

The Valley Girl: The Valley Girl marvel was an 80s era female social dialect originating from the San Fernando Valley wherein, among other executions and slang terms, sentences and phrases frequently ended with a high rising inflection of tones that trailed up in pitch. Trail-ups impose a question when there is none. Thousands of Southern California valley girls were speaking with these pitch fluctuations in the 80s. “Like, gag me with a spoon”, and “Tohtalaay”. Actress, Tracy Christine Nelson claims to have developed the “Valleyspeak” for a character she played for the 1982-1983 TV sitcom Square Pegs. Valleyspeak was further catapulted by a Frank Zappa hit entitled “Valley Girl” that contained a dialog recorded by his then 14-year-old daughter Moon Zappa. [4][5] Like the surfer talk of Southern California, Valleyspeak exploded into an international fad, and will be remembered as a major contributor to the culture of this era.

When an individual adopts a style, repetition of the vocal production intonations, pitch and rhythm patterns, artifacts and inflections making up the dialect may become subconscious execution. One is creating the sound but not necessarily thinking of the sound to sound mechanics once the dialect is successfully adopted. We see this in singing trends as well. Young singers can adopt styles and combinations of sounds they like through practice, even though they may not be keenly conscious of the mechanical process unless or until they are trained in vocal production. [3]

The vocal fry or the creaky voice: The vocal fry is a current social trend among women. The vocal folds vibrate slightly to produce a low rumble distortion in the voice. Two examples are the international reality star Courtney Kardashian who utilizes the fry often, and Christine Blasey Ford who testified against Supreme Court Justice nominee Brett Kavanaugh at the heavily publicized hearings. Ford’s fry and creaky voice conjured up emotional responses from the public with the main stream media proclaiming she was “overwhelmingly convincing”. Feminist movements and celebrities joined in releasing statements that we must “believe the survivors”. Ford further executed cries and whines, numerous pitch fluctuations, and guttural blasts which incorporate another technique called the abdominal punch while she read her statement. These productions are not laryngeal pathologies, they are forced: “Ordinarily vocal fry constitutes one of several physiologically available types of voice production on the frequency-pitch continuum and hence, of itself, is not logically classified among the laryngeal pathologies. While the excessive use of fry could result in a diagnosis of voice disorder, this quality is too often heard in normal voices (especially in descending inflections where the voice fundamentally falls below frequencies in the modal register) to be exclusively a disorder.” [6]

2.3 Critical Listening

After the analyst has conducted an enhancement session in order to bring up and out the recorded dialog, the tedious critical listening process can begin. When comparing voice features between a known and an unknown speaker, apply critical listening and then run an EQ analysis to identify bandwidth and frequency strength that will indicate the range where the voice has the most presence among other things. The tonality of the voice will occupy a frequency range during general speaking and some consistency can be determined especially if numerous samples are available and samples that are longer in time which will reveal the speaker’s timing, habits, impediments if any, volume and features. The amplitude will change as will be shown on the spectrum but there should be a concentrated range in amplitude that peaks. Most controlled speakers utilize a consistent volume. Pitch will fluctuate. However, over the time continuum, it is possible and probable to find consistencies that can be measured if the known and unknown speakers are the same, and alternatively inconsistencies if the speakers are not the same. We can use our ears to reveal and then explain what we hear phonetically, mechanically, acoustically, and then back it up with the visual tools. Here is an example of two speakers saying the same phrase from the EQ tool in Logic Pro:

Fig. 2

This example reflects a male voice with a consistent distortion and a more monotone delivery which was heard to be consistent over time. The delivery was slow and distinct with rounded vowel sounds and blending words was infrequent. This voice overall was more controlled in execution and the pitch fluctuations were not abrupt.

Fig. 3

This example reflects a cleaner and higher pitched male voice with presence and energy concentrated in a narrower band peaking between 500 Hz and 1 kHz. The timing of this speaker was faster, he cut off the ends of his words frequently, often blended his words reflecting in a faster delivery and slurs and was not monotone. Even during a whisper, there will be numerous other features articulated that can be discovered through critical listening and measured. It goes without saying that the steep drop at about 2k on both samples is what we see frequently in cell phone and lossy compression.

Fig 4.

RX reveals the two samples side by side where speakers offered the same phrase. This is a good visual comparison that can be explained to a jury and to clients in terms that can be more easily understood by the average person. Basically, the orange areas show the length over time of vibrations of voice #1 from Fig. 2 who had the slower speech and distortion. Alternatively, on the right, voice #2 from Fig. 3 shows the presence of energy and pitch fluctuations wholly distinguishable from the other voice even when this recording was moderately degraded due to the environment. These samples are not consistent with each other. Critical listening analysis and the spectrum results should confirm and cross-reference each other on the analyzed samples.

III. VOICE FEATURE COMBINATIONS

If enough sample content exists for voice analysis using the protocol herein described, then during critical listening we can identify voice feature combinations and describe how one flows into the next. Describe the overall tone and projection, execution of the vowel sounds and the consonants, any signature habits as described herein, the timing, the frequencies, distortions, volume and speech impediments if any exist. In the examples provided herein, the known speaker had perceivable differences from the unknown speaker in executing consonants, the tone and length of vowel sounds, the smoothness of the speech over time, in general an overall more controlled and trained voice than did the voice in the unknown recording. In the unknown recording, the speaker was whispering off and on. When tones were produced, the consistent frequency ranges came into sharper contrast revealing that the overall rhythm of the unknown voice was faster with blended words and there were vowel sound distortions, short phrases, and air dumps. In effect, the subconscious habits of both speakers provided content to analyze and discuss.

IV. REPORT WRITING

It is important to note for forensic report writing for the criminal case, that the consistencies and inconsistencies found in the critical listening process are distinguishable from the recordings of cell phones, the recording environments creating broadband ambient noise, and any compression and that these did not mask the speech. The recordings contained frequency content adequate and strong enough to assess and make determinations. Describe the vocal production over the time continuum of the sample, sound by sound. For each sample analyzed and compared, detail the measurable differences or similarities, whichever that may be, for all the features found in the vocal productions. Support those findings with the visual spectral analysis. Begin your report with a bullet point list of what the report will show, a terminology section pertinent to the analysis with references to the universal definitions, then repeat and revise at the end adding your personal conclusion statement. Finally, include an amendment clause in case further audio becomes available for analysis.

V. CONCLUSION

This paper is intended to apply towards a functioning process that will work with attorneys and juries by describing the sounds being heard in the samples. Industry standard and universal vocal terminology described herein can be found from numerous on-line sources including this Author’s own publication referenced herein for a more in-depth look at vocal features and how they work together to become vocal productions. Findings can then be further analyzed and supported through EQ and visual analysis tools that show pitch fluctuations, vibrations, distortion, and timing as well as frequency ranges and amplitude. Analyzing and comparing voices through critical listening analysis and voice feature identification is critical to the current environment of audio forensics in the legal arena.

VI. REFERENCES

[1] The Speech Chain (P. Denes and E. Pinson) The Physics and Biology of Spoken Language 2nd. Edition 1993.

[2] 54 Am Jur Trials (Jordan S. Gurber and Fausto Tito Poza) Voicegram Identification Evidence 1995 §24.

[3] A Singer’s Journey, A Guide to Finding Your Best Vocal (Lauren Comele Morris) 2014 Amazon.

[4] https://en.wikipedia.org/wiki/Valleyspeak.

[5] Weemawee Yearbook Memories: Tracy Nelson and Claudette Wells", a featurette on the DVD release Square Pegs: The Like, Totally Complete Series ... Totally (Sony Pictures Home Entertainment, 2008).

[6] Modern Techniques of Vocal Rehabilitation Cooper, Morton (1973).