Kokoro-82M/demo/HEARME.txt

Kokoro is a frontier TTS model for its size of 82 million parameters.

On the 25th of December, 2024, Kokoro v0 point 19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2 license.

At the time of release, Kokoro v0 point 19 was the number 1 ranked model in TTS Spaces Arena. With 82 million parameters trained for under 20 epics on under 100 total hours of audio, Kokoro achieved higher Eelo in this single-voice Arena setting, over larger models. Kokoro's ability to top this Eelo ladder using relatively low compute and data, suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.

Licenses. Apache 2 weights in this repository. MIT inference code. GPLv3 dependency in espeak NG.

The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro.

Evaluation. Metric: Eelo rating. Leaderboard: TTS Spaces Arena.

The voice ranked in the Arena is a 50 50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as A-F dot PT, but you can trivially re-produce it.

Training Details.

Compute: Kokoro was trained on "A100 80GB v-ram instances" rented from Vast.ai. Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB v-ram instances used for training was below $1 per hour per GPU, which was around half the quoted rates from other providers at the time.

Data: Kokoro was trained exclusively on permissive non-copyrighted audio data and IPA phoneme labels. Examples of permissive non-copyrighted audio include:

Public domain audio. Audio licensed under Apache, MIT, etc.

Synthetic audio[1] generated by closed[2] TTS models from large providers.

Epics: Less than 20 Epics. Total Dataset Size: Less than 100 hours of audio.

Limitations. Kokoro v0 point 19 is limited in some ways, in its training set and architecture:

Lacks voice cloning capability, likely due to small, under 100 hour training set.

Relies on external g2p, which introduces a class of g2p failure modes.

Training dataset is mostly long-form reading and narration, not conversation.

At 82 million parameters, Kokoro almost certainly falls to a well-trained 1B+ parameter diffusion transformer, or a many-billion-parameter M LLM like GPT 4o or Gemini 2 Flash.

Multilingual capability is architecturally feasible, but training data is almost entirely English.

Will the other voicepacks be released?

There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo.

Acknowledgements. yL4 5 7 9 for architecting StyleTTS 2.

Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.

Model Card Contact. @rzvzn on Discord.
first commit 2025-01-17 13:42:37 +08:00			`Kokoro is a frontier TTS model for its size of 82 million parameters.`

			`On the 25th of December, 2024, Kokoro v0 point 19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2 license.`

			`At the time of release, Kokoro v0 point 19 was the number 1 ranked model in TTS Spaces Arena. With 82 million parameters trained for under 20 epics on under 100 total hours of audio, Kokoro achieved higher Eelo in this single-voice Arena setting, over larger models. Kokoro's ability to top this Eelo ladder using relatively low compute and data, suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.`

			`Licenses. Apache 2 weights in this repository. MIT inference code. GPLv3 dependency in espeak NG.`

			`The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro.`

			`Evaluation. Metric: Eelo rating. Leaderboard: TTS Spaces Arena.`

			`The voice ranked in the Arena is a 50 50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as A-F dot PT, but you can trivially re-produce it.`

			`Training Details.`

			`Compute: Kokoro was trained on "A100 80GB v-ram instances" rented from Vast.ai. Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB v-ram instances used for training was below $1 per hour per GPU, which was around half the quoted rates from other providers at the time.`

			`Data: Kokoro was trained exclusively on permissive non-copyrighted audio data and IPA phoneme labels. Examples of permissive non-copyrighted audio include:`

			`Public domain audio. Audio licensed under Apache, MIT, etc.`

			`Synthetic audio[1] generated by closed[2] TTS models from large providers.`

			`Epics: Less than 20 Epics. Total Dataset Size: Less than 100 hours of audio.`

			`Limitations. Kokoro v0 point 19 is limited in some ways, in its training set and architecture:`

			`Lacks voice cloning capability, likely due to small, under 100 hour training set.`

			`Relies on external g2p, which introduces a class of g2p failure modes.`

			`Training dataset is mostly long-form reading and narration, not conversation.`

			`At 82 million parameters, Kokoro almost certainly falls to a well-trained 1B+ parameter diffusion transformer, or a many-billion-parameter M LLM like GPT 4o or Gemini 2 Flash.`

			`Multilingual capability is architecturally feasible, but training data is almost entirely English.`

			`Will the other voicepacks be released?`

			`There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo.`

			`Acknowledgements. yL4 5 7 9 for architecting StyleTTS 2.`

			`Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.`

			`Model Card Contact. @rzvzn on Discord.`