Creating Auto Chapters and Timestamps with STT API

by Joseph

When you transcribe a long audio file, you can read the full content as text, but it is still difficult to quickly find the section you want. For long audio such as meeting recordings, lectures, and interviews, a transcript becomes much easier to read when it also includes a table of contents showing when each topic begins.

In this tutorial, we will build a Python example that transcribes an audio file with the RTZR STT API and then generates automatic chapters and timestamps from the transcription result.

This example is based on a sample transcription JSON generated from the National Folk Museum of Korea's Creative Commons licensed video, [Exhibition Library] "The Happiness of That Winter" Gilsang Special Exhibition.

Example Video Source

Title: [Exhibition Library] "The Happiness of That Winter" Gilsang Special Exhibition
Channel: National Folk Museum of Korea
Length: 7 minutes 28 seconds
License: YouTube Creative Commons Attribution license (reuse allowed)

1. Goal and Result

What We Will Build

The goal of this example is to take a long audio file as input and automatically generate the following result.

Transcribe the audio file with the RTZR STT API.
Split the transcription result into paragraph-level units.
Convert each paragraph into an embedding vector.
Find points where the context changes significantly and use them as chapter boundaries.
Display the start time of each chapter as a timestamp.
Show one actual utterance that represents each chapter.

The overall flow is as follows.

Audio file
  ↓
RTZR STT API transcription
  ↓
Save transcription JSON
  ↓
Generate paragraph embeddings
  ↓
Detect context change points
  ↓
Generate chapter timestamps and representative utterances

Result Preview

The final result is saved as Markdown and JSON files. In this example, a 7-minute 28-second audio file is divided into 5 chapters. The code generates the chapter start times and representative utterances; the section titles below were added in the article to help readers follow the structure.

00:00:12 | Introducing the Exhibition

대표 발화: ...겨울을 맞이하여 길상 특별전, 그 겨울의 행복을 개최합니다.

00:00:12 | 국립민속박물관은 다가오는 겨울을 맞이하여 길상 특별전, 그 겨울의 행복을 개최합니다.
00:00:18 | 이번 전시는 코로나 19팬데믹이 장기화되고 있는 시점에서 관람객 여러분들에게 길상이라는 소재를 통해 긍정과 희망의 메세지를 드리기 위해 기획되었습니다.
00:00:30 | 길상은 좋은 일이 일어날 조짐을 뜻하는 한자어인데요.
00:00:34 | 쉽게 말하면 행복을 바라는 마음을 상징적으로 표현하는 것을 뜻합니다.
00:00:39 | 이번 전시를 통해 행복과 행운이 오기를 바라는 염원과 관련된 우리 문화 요소를 살펴보고, 자신의 행복을 돌아보는 기회를 가지실 수 있기를 바랍니다.

00:00:52 | Structure of the Exhibition

대표 발화: ...행복한 순간을 담은 그림과 사진자료들을 영상으로 담아냈고, 복과 운을 바라는 여러 요소들을 전시하였습니다.

00:00:52 | 전시회 구성은 1부에서 길상과 행복의 의미를 환기시킨 후에 2부와 3부에서 본격적으로 길상의 모습을 살펴볼 수 있도록 했습니다.
00:01:01 | 공간적으로 2부와 3부는 일종의 중전과도 같은 행복의 정원을 두고 연결됩니다.
00:01:07 | 2부에서는 전통적 오복 개념에 기초하여 네 부분으로 나누어 길상을 살펴보고, 3부에서는 근현대 시기의 길상의 변화와 지속을 살펴봅니다.
00:01:17 | 또한 2부의 끝부분에서는 직물과 자수, 나전칠기, 도자와 나무 등 재료에 따른 길상 무늬를 감상할 수 있는 공간도 있고.
00:01:26 | 곳곳에 탑 쌓기와 새점치기, 소원 쓰기 등 다양한 체험 콘텐츠도 마련되어 있습니다.
00:01:42 | 전시의 1부는 길상과 행복의 의미를 일깨워보는 취지로 구성하였습니다.
00:01:47 | 복을 바라는 것은 예나 지금이나 매우 자연스러우면서도 중요한 일이었습니다.
00:01:53 | 마치 쌍한경으로 행복한 순간을 들여다보듯이 행복한 순간을 담은 그림과 사진자료들을 영상으로 담아냈고, 복과 운을 바라는 여러 요소들을 전시하였습니다.
00:02:16 | 이분는 다양한 길상 무늬를 담은 별전으로 시작됩니다.
00:02:20 | 길상은 상징적, 함축적으로 의미를 드러내는 방식이기 때문에 다양한 무늬로서 표현되었는데, 이를 잘 보여주는 것이 바로 동전 모양의 장식품인 별전입니다.
00:02:31 | 시각장애인을 위해 별전의 형태를 손으로 만져볼 수 있도록 확대하여 촉각물로 제작하였습니다.

00:02:39 | Traditional Symbols of Good Fortune

대표 발화: 또한 대표적인 장수의 상징입니다. 출세, 즉 과거에 합격하여 입신 양명하고 부귀해지는 것 또한 옛 사람들이 꼽은 중요한 요소였습니다.

00:02:39 | 장수는 옛 사람들의 가장 큰 관심사였습니다.
00:02:42 | 아이가 태어나면 장수를 기원하며 베네쩌고리를 만들어 입혔고, 사주를 따져 이름을 지었습니다.
00:02:49 | 어르신들에게는 회갑 때 장수를 기원하는 시를 써서 추원하는 모습을 볼 수 있습니다.
00:02:55 | 고양이 역시 70세 노인을 뜻하는 한자와 발음이 같아 장수를 상징하며 올해 사는 10가지인 10장 생.
00:03:02 | 또한 대표적인 장수의 상징입니다.
00:03:07 | 출세, 즉 과거에 합격하여 입신 양명하고 부귀해지는 것 또한 옛 사람들이 꼽은 중요한 요소였습니다.
00:03:15 | 부귀를 상징하는 대표적인 것은 바로 모란입니다.
00:03:18 | 꽃 중에 왕이라 불리는 모란은 부귀영화를 상징하여 생활용품에 장식되었습니다.
00:03:24 | 개와 같은 갑각류는 등갑을 뒤집은 말인 갑등이 1등이라는 발음과 같아서 출세의 상징으로 여겨졌습니다.
00:03:33 | 잉어 역시 등용문 고사에서 유래하여 합격을 의미하는 의미로 문방구 등의 문늬로 많이 활용되었습니다.
00:03:42 | 화목하고 평안함 역시 많은 이들이 꿈꾸는 행복의 조건입니다.
00:03:46 | 가와 만 사성이라 하여 집안이 화목해야 모든 일이 잘된다는 문구는 널리 알려져 있습니다.
00:03:53 | 평안을 상징하는 길상무늬는 쌍쌍이 어우러진 새들 꽃과 나비 등이 대표적입니다.
00:04:01 | 저출산 시대이자 난임과 불임이 많아진 요즘에도 다시 중요해진 행복의 요소로는 자녀를 낳는 곳도 있습니다.
00:04:09 | 옛 사람들이 다산을 상징하는 의미로 사용한 길상무늬들의 특징은 빽빽하게 많은 것을 다산의 의미로 여겼다는 것입니다.
00:04:17 | 예를 들어 석류, 수박, 포도, 오이 가지 등 열매나 씨앗이 많은 것들, 또 물고기처럼 알을 많이 낳는 것 등 모두 다산을 의미했습니다.
00:04:30 | 다양한 재료에 나타나는 길상 무늬들을 감상하실 수 있는 코너도 마련되어 있습니다.
00:04:37 | 한 땀 한 땀 시를 꿰매고 자수로 새긴 꽃과 동물, 조은 문구 등 여러 길상 무늬는 돌이나 혼렛 때 입는 으레복 등에 표현되었습니다.
00:04:47 | 나전칠기와 도자 나무에도 역시 다양한 동식물과 문자, 무늬, 기하학적인 무늬 등 여러 가지의 좋은 뜻을 담아 장식하였습니다.

00:05:11 | Modern Views of Happiness

대표 발화: ...가치에 대한 측면이었지만, 행복에는 즐거움, 만족감 같은 정서적인 측면도 있습니다.

00:05:11 | 앞서 1부와 2부가 전통적인 행복에 대해 다루었다면, 3부는 현대의 행복에 대해 생각해 보는 취지로 마련된 공간입니다.
00:05:20 | 앞서 살펴본 오복은 성취하는 어떤 가치에 대한 측면이었지만, 행복에는 즐거움, 만족감 같은 정서적인 측면도 있습니다.
00:05:30 | 물론 현대에도 같이 해서 행복을 찾는 경향도 여전히 있지만, 마음에서 찾는 작은 행복도 있다는 점을 보여드리고 싶었습니다.
00:05:39 | 또한 3부에서는 현대에 오면서 더욱 두드러진 개념인 행운에 대해서도 다루고 있습니다.
00:05:45 | 우리나라뿐만 아니라 각기 다른 다양한 외국의 행운의 상징들을 모아보았습니다.
00:05:52 | 작은 것에서 행복을 찾고 있는 다양한 행복의 변화상을 느끼셨으면 좋겠습니다.

00:06:01 | Space Design and Closing Remarks

대표 발화: 전시 관람이라는 행위 자체가 행복한 경험이 되기를 바라는 취지에서 공간을 조성했고, 휴식과 관람이 조화를 이룰 수 있도록 구성하였습니다.

00:06:01 | 이번 전시 공간은 행복의 정원이라는 콘셉트로 구성되었습니다.
00:06:05 | 여러 개의 방들과 마당처럼 조성된 공간에서 전시를 보다가 마당으로 나와서 잠시 휴식을 취할 수도 있고, 여러 가지 체험 요소들도 즐길 수 있습니다.
00:06:17 | 전시 관람이라는 행위 자체가 행복한 경험이 되기를 바라는 취지에서 공간을 조성했고, 휴식과 관람이 조화를 이룰 수 있도록 구성하였습니다.
00:06:27 | 큰 벽에 비춰지는 겨울 풍경을 담은 메인 영상을 보면서 휴식을 취해 보시면 좋겠습니다.
00:06:34 | 또한 이번 전시에는 장애인을 위한 콘텐츠들도 마련했습니다.
00:06:38 | 저시력자와 시각장애인을 위한 점자 리플렛 큰 글씨로 주요 유물을 설명하는 대활자 책자를 각 부마다 비치했습니다.
00:06:48 | 별전은 촉각물로 제작하여 그 무늬를 직접 만져보실 수도 있고, 청각장애인을 위해 전시 영상에 자막과 함께 수어 해설 영상도 덧붙였습니다.
00:06:59 | 행복은 어떤 가치를 이루는 것에서 올 수도 있지만, 작고 구체적인 경험에서도 온다고 합니다.
00:07:05 | 이 전시가 다가오는 겨울에 행복한 경험이 되기를 바랍니다.
00:07:09 | 또한 전시를 관람하시면서 자신이 어떤 행복을 바라는지에 대해 생각해보고, 그 답을 찾을 수 있다면 더욱더 좋겠습니다.
00:07:17 | 감사합니다.

Each chapter time uses the start_at value included in the transcription result. Because the STT result already includes timing information, we can use it directly without adding a separate audio alignment step.

2. Project Setup

Project Structure

The project is divided into two major stages: transcription and chapter generation.

stt-chapter-generator/
├── transcribe.py
├── rtzr_openapi_client.py
├── chapterize.py
├── pyproject.toml
├── uv.lock
├── .env.example
├── data/
│   ├── audio/
│   ├── transcripts/
│   └── outputs/

Each file has the following role.

transcribe.py: Reads execution options and saves the RTZR STT API transcription result.
rtzr_openapi_client.py: A client that separates the RTZROpenAPIClient flow from the RTZR official documentation.
pyproject.toml: Defines project metadata and Python dependencies required for execution.
uv.lock: A lock file used to run the project reproducibly with the same dependency versions.
.env.example: An example of the environment variables used to store API credentials.
data/audio: Stores input audio files.
data/transcripts: Stores STT API transcription result JSON files.
data/outputs: Stores chapter generation results.

Libraries Used

This example uses the following libraries.

requests: Used inside RTZROpenAPIClient to send HTTP requests to the RTZR STT API.
sentence-transformers: Used to convert paragraphs into embedding vectors.
kiwipiepy: Used to extract nouns from Korean utterances.
numpy: Used for vector operations and similarity calculations.

For the text embedding model, this example uses Hugging Face's google/embeddinggemma-300m. The model is loaded with SentenceTransformer from the sentence-transformers library and converts each paragraph into a vector.

When downloading a text embedding model from Hugging Face, license approval may be required depending on the model. If a model is marked as a gated repository, you need to log in to Hugging Face and accept the license terms on the model page before using it.

Environment Setup

This example is written for Python 3.12, and the version is specified in .python-version and pyproject.toml.

Running uv sync creates a .venv based on uv.lock and installs the required libraries.

uv sync

To call the RTZR API, save RTZR_CLIENT_ID and RTZR_CLIENT_SECRET in a .env file.

RTZR_CLIENT_ID=...
RTZR_CLIENT_SECRET=...

When running the scripts later, load the .env file together with the command.

3. Transcribing with the RTZR STT API

API Request Flow

transcribe.py calls the RTZR STT API in the following order.

RTZR_CLIENT_ID / RTZR_CLIENT_SECRET
  ↓
Issue access_token
  ↓
Request audio file transcription
  ↓
Receive transcribe_id
  ↓
Poll until transcription is complete
  ↓
Save transcription JSON

First, the script calls the authentication API and receives an access_token. Then, when it uploads an audio file and requests transcription, the API does not immediately return the full transcript. Instead, it returns a transcribe_id.

Because transcription is processed asynchronously on the server, the client periodically checks the status using this transcribe_id. When transcription is complete, it receives the final transcription result and saves it in the data/transcripts folder.

In this example, the RTZR API call logic is separated into an RTZROpenAPIClient class. transcribe.py reads execution options, builds the configuration, calls the client, creates the transcription job, and saves the result.

RTZROpenAPIClient.token: Issues an access_token using RTZR_CLIENT_ID and RTZR_CLIENT_SECRET.
RTZROpenAPIClient.transcribe_file(): Uploads the audio file and creates a transcription job.
RTZROpenAPIClient.get_transcription(): Retrieves transcription status using the transcribe_id.
RTZROpenAPIClient.wait_for_result(): Polls until transcription is complete and returns the final result.

Transcribing an Audio File

Place the audio file you want to transcribe in the data/audio folder, then run transcribe.py.

uv run --env-file .env -- python transcribe.py \
  data/audio/gilsang_winter_happiness.wav \
  --model-name sommers \
  --language ko \
  --use-paragraph-splitter \
  --paragraph-max 40 \
  --use-disfluency-filter

By default, the transcription result is saved to the following path. In the next step, we will assume that a sample transcription JSON file in this form is already prepared.

data/transcripts/gilsang_winter_happiness.transcript.json

This example uses paragraph_splitter and disfluency_filter together.

paragraph_splitter is used to split the transcription result into paragraph-level units that are easier to post-process. For automatic chapter generation, moderately split paragraphs are easier to handle than one very long transcript.

disfluency_filter is used to reduce spoken-language expressions with weak semantic value, such as "um", "uh", and repeated expressions. If many spoken-language fillers remain, they can affect paragraph embeddings and keyword extraction results, so reducing unnecessary expressions is helpful during post-processing.

Values Used from the Transcription Result

The STT API transcription result contains several pieces of information. In this example, we use only the values needed for chapter generation.

msg: The transcribed utterance text.
start_at: The time when the utterance starts.
utterances: The list of utterance-level results.

The transcription result roughly looks like this.

{
  "id": "transcribe-id",
  "status": "completed",
  "results": {
    "utterances": [
      {
        "start_at": 52411,
        "duration": 8780,
        "msg": "전시회 구성은 1부에서 길상과 행복의 의미를 환기시킨 후에 2부와 3부에서 본격적으로 길상의 모습을 살펴볼 수 있도록 했습니다.",
        "spk": 0
      },
      {
        "start_at": 61711,
        "duration": 5440,
        "msg": "공간적으로 2부와 3부는 일종의 중전과도 같은 행복의 정원을 두고 연결됩니다.",
        "spk": 0
      }
    ]
  }
}

For automatic chapter generation, msg is used for context analysis, and start_at is used as the chapter timestamp.

For example, when a paragraph is selected as the start of a new chapter, the start_at value of the first utterance included in that paragraph is converted into a format such as 00:02:39.

4. Generating Chapters from the Transcription Result

Generating Chapters

Once the transcription result is ready, run chapterize.py.

uv run python chapterize.py data/transcripts/gilsang_winter_happiness.transcript.json

The result files are saved to the following paths.

data/outputs/gilsang_winter_happiness.chapters.json
data/outputs/gilsang_winter_happiness.chapters.md

You can check the Markdown result with the following command.

cat data/outputs/gilsang_winter_happiness.chapters.md

chapterize.py performs three main steps.

Read the transcription result as paragraph-level segments.
Find chapter boundaries using paragraph embeddings.
Select a representative utterance from each chapter.

In chapterize.py, the chapter generation flow is divided into the following functions.

load_segments(): Reads utterances from the transcription result and extracts the text and start time needed for chapter generation.
encode_segments(): Converts each paragraph into an embedding vector.
calculate_c99_boundary_scores(): Calculates boundary scores between paragraphs.
select_ranked_boundaries(): Selects the final chapter boundaries based on the relative rank of boundary scores.
select_representative_segment(): Selects the utterance that represents each chapter.

Finding Chapter Boundaries

chapterize.py converts each paragraph in the transcription result into a text embedding vector. It then compares the similarity between adjacent groups of paragraphs and treats points where the context before and after the point changes significantly as chapter boundary candidates.

Reference
This example uses a simplified idea from the C99 algorithm. C99 is a text segmentation algorithm proposed in Freddy Y. Y. Choi's paper, Advances in domain independent linear text segmentation.

The process is as follows.

Convert transcribed paragraphs into embedding vectors.
Calculate cosine similarity between paragraph vectors and create a similarity matrix.
Following the C99 approach, compare each value with nearby values and convert the matrix into a local rank matrix. A local rank is a score between 0 and 1 that represents the relative rank of a value compared with nearby values.
For every gap between paragraphs, compare up to 5 paragraphs on the left and 5 paragraphs on the right to measure the degree of context change.
Compare the average rank inside the left group, the average rank inside the right group, and the average rank between the two groups.
If the left and right groups are each internally similar, but the value between the two groups is low, the point is treated as a likely context change.
This difference is used as the boundary score, and high-scoring points are selected as chapter boundary candidates.

The boundary score can be understood with the following intuition.

Boundary score = average internal rank of left/right groups - average rank between left/right groups

For example, assume that the values calculated around a certain position are as follows.

Average rank inside the left paragraph group: 0.78
Average rank inside the right paragraph group: 0.74
Average rank between the left and right paragraph groups: 0.32

In this case, the average internal rank of the left and right groups is 0.76, and the average rank between the two groups is 0.32.

Boundary score = 0.76 - 0.32 = 0.44

The paragraphs on the left are similar to one another, and the paragraphs on the right are also similar to one another. However, the value between the left and right groups is low, so this point is likely to be a context change candidate.

In other words, the boundary score is not a simple comparison between one paragraph and the next paragraph. It represents how different the paragraph groups before and after a point are. This allows the script to consider the surrounding flow rather than being overly affected by one or two short utterances.

Deciding the Number of Chapters

If there are too many chapter boundary candidates, the result becomes overly fragmented. Conversely, if too few candidates are selected, the structure of a long audio file is not represented well.

In this example, not every boundary candidate is used. Instead, candidates with high boundary scores are prioritized. First, boundary scores are sorted in descending order, and only candidates that are relatively high among all candidates are kept. In the code, this relative-rank threshold is fixed at 0.385. To prevent the result from being divided too finely, the upper limit on the number of selectable boundaries is determined based on the total number of characters in the transcript. Finally, to avoid repeatedly selecting positions that are too close to each other, at least 5 paragraph segments are required between boundaries.

This method works without requiring the user to manually specify the number of chapters. However, if you are working with a video or lecture where the desired number of sections is already known, you can later extend the script by adding an option to specify the number of chapters directly.

Selecting Representative Utterances

From the transcription text of each chapter, one representative utterance is selected using the following criteria.

Run the utterances in the chapter through a morphological analyzer and extract only nouns.
Keep up to 8 nouns that appear multiple times or are relatively important within the chapter as internal keyword candidates.
For each utterance candidate, check whether those keywords are actually included in the original string.
Give a higher score to utterances that include more keywords, treating them as better representatives of the chapter.
If an utterance is too long, extract a shorter excerpt centered around the part containing important keywords.
In the final output, do not expose the keyword list separately; show only a representative utterance that is easy for users to read.

Keyword inclusion is checked against the original string, not inferred by a generative model. For example, if the main keywords of a chapter are 길상, 행복, and 장수, the script counts whether these words are actually included in each utterance and prioritizes utterances that contain more key terms.

This approach keeps the execution structure simple because it does not use a generative model. It also extracts text from the actual transcript, so it does not create new content that was not present in the original audio.

5. Result

The Markdown result is intended to be read directly by people.

# Chapters: gilsang_winter_happiness

- **00:00:12**
  - 대표 발화: ...겨울을 맞이하여 길상 특별전, 그 겨울의 행복을 개최합니다.
- **00:00:52**
  - 대표 발화: ...행복한 순간을 담은 그림과 사진자료들을 영상으로 담아냈고, 복과 운을 바라는 여러 요소들을 전시하였습니다.
- **00:02:39**
  - 대표 발화: 또한 대표적인 장수의 상징입니다. 출세, 즉 과거에 합격하여 입신 양명하고 부귀해지는 것 또한 옛 사람들이 꼽은 중요한 요소였습니다.
- **00:05:11**
  - 대표 발화: ...가치에 대한 측면이었지만, 행복에는 즐거움, 만족감 같은 정서적인 측면도 있습니다.
- **00:06:01**
  - 대표 발화: 전시 관람이라는 행위 자체가 행복한 경험이 되기를 바라는 취지에서 공간을 조성했고, 휴식과 관람이 조화를 이룰 수 있도록 구성하였습니다.

The JSON result is an array of chapter objects used for later processing. The example below shows only one chapter.

[
  {
    "number": 1,
    "start_at": 12843,
    "start": "00:00:12",
    "representative_text": "...겨울을 맞이하여 길상 특별전, 그 겨울의 행복을 개최합니다.",
    "segment_count": 5
  }
]

6. Closing

In this tutorial, we transcribed an audio file with the RTZR STT API and used the transcription result to generate automatic chapters and timestamps.

For long audio content, it is important to provide a structure that lets users jump directly to the section they want, rather than only providing the full transcript. The utterance start times, paragraph-level transcription results, and Disfluency Filter provided by the RTZR STT API help simplify the post-processing pipeline.

Because timestamps can be created without a separate audio alignment step, and because paragraphs with reduced spoken-language noise can be used as input, post-processing tasks such as automatic chapter generation become easier to build. This is especially helpful when building workflows around Korean STT results, where paragraph quality and utterance-level timing directly affect downstream processing.

This example is designed to work without a separate LLM. If needed, you can later extend it by connecting a summarization model or an LLM to generate more natural chapter titles.

The full example code is available in the GitHub repository below.

References

일반 STT | RTZR STT OpenAPI

일반 STT API는 음성 파일을 텍스트로 변환하는 HTTP 기반 REST API입니다.

RTZR Logo