Podcast-DB: Feat Triform parallel actions

Podcast-DB: Feat Triform parallel actions


Find out **who’s talking about a company or person** in any Swedish podcast. Also just cool to have this kind of data and database.

Podcast-db landing page

Last week, I decided to undertake a project to challenge myself: Find all Swedish podcasts, download them, and then transcribe them and build a search interface.

In this post, I go over the week journey end-to-end (mostly free time and over the weekend), starting from low infrastructure or any experience around building a podcast search engine. Some highlights:

  • A cluster of 8 Hopper H200 GPUs transcribing over 300,000 podcasts
  • At peak, 25 crawlers are downloading 250 podcasts in parallel (Downloading 9000 podcasts and hour).
  • 12 TB of Audio data in our S3 bucket.
💡
You can play around with a deployed instance of this search engine as a Live demo

Here's a high-level architecture map of the system that will be covered in this post:

Proving ground

I started off by creating a minimal playground to experiment and explore if the idea was possible. I found a website that had done most of the work of collecting all the data called https://mediafacts.se/poddindex. By exploring their internal API calls, I found this endpoint:

https://api.mediafacts.se/api/podcast/v1/podcasts

From that call, we could get a dataset of 2400 podcasts with both name and an id.

{
    "adb8141c-fb56-4092-6f88-08dbd636bc42": "#Fallet",
    "e1e3a139-1b65-4a43-f843-08dbaa32d14e": "#FPLSVERIGE",
    ...
    "866dcf58-058d-4468-b0da-08dd8c879318": "...och bilen går bra?",
}

With that ID, we can get full details:

curl --location 'https://api.mediafacts.se/api/podcast/v1/podcasts/details?id=857538db-0c16-4f7d-b053-08dc85405cb3'
{
  "rssFeedUrl": "https://rss.podplaystudio.com/1477.xml",
  "podcastName": "Fallen jag aldrig glömmer",
  "supplierName": "Podplay",
  "networkName": "Bauer Media",
  "genre": "Thriller/Crime"
}

Main target: rssFeedUrl Keep the rest (supplier, network, genre) for future analysis.

I’m using Supabase to store all data because I already have a Supabase subscription running. Now, from this mapping, I felt that the idea was possible; we just needed a lot of storage and computing to finish this project.

STATUS

Step 1 – Get podcast profiles Success rate: 99.04%

status_code Distribution:
================================================================================
200                   2377 (99.04%)
404                     19 ( 0.79%)
400                      1 ( 0.04%)
410                      1 ( 0.04%)
500                      1 ( 0.04%)
503                      1 ( 0.04%)


Total unique status_code: 6
Null/Missing status_code: 0

Conclusion: Minor data loss; we don't care to fix it now

RSS_request_status_code Distribution:
================================================================================
200                   2315 (96.46%)
404                     77 ( 3.21%)
400                      2 ( 0.08%)
410                      2 ( 0.08%)
500                      2 ( 0.08%)
503                      1 ( 0.04%)
0                        1 ( 0.04%)


Total unique RSS_request_status_code: 7
Null/Missing RSS_request_status_code: 0

Conclusion: Here the data loss is a bit more significant but still enough.

RSS SOURCE DISTRIBUTION

Most feeds come from a few platforms:

Base URL Distribution:
================================================================================
https://feeds.acast.com                             1136 (47.33%)
https://feed.pod.space                               521 (21.71%)
https://api.sr.se                                    272 (11.33%)
https://rss.podplaystudio.com                        150 ( 6.25%)
https://rss.acast.com                                109 ( 4.54%)
https://podcast.stream.schibsted.media                94 ( 3.92%)
https://access.acast.com                              70 ( 2.92%)
https://feed.khz.se                                   20 ( 0.83%)
https://cdn.radioplay.se                              12 ( 0.50%)
http://www.ilikeradio.se                               8 ( 0.33%)
https://www.ilikeradio.se                              8 ( 0.33%)


Total unique base URLs: 11
Null/Missing RSS URLs: 0

Conclusion: This means that the structure is likely the same on all podcasts from each platform. Looking at the data, we can see that most of them follow the same structure with iTunes tags with the same names and titles and links, etc. That consistency makes the ETL pipeline easy to build and maintain.

RSS STRUCTURE

Most of them follow the same iTunes XML layout:

<item>
  <title>705. Lex Birgitta Ed</title>
  <pubDate>Thu, 23 Oct 2025 22:58:22 +0000</pubDate>
  <guid isPermaLink="false">7258a43677f100f9d45ef43f19395a99</guid>
  <link>https://pod.space/alexosigge/705-lex-birgitta-ed</link>
  <itunes:image href="https://assets.pod.space/system/shows/images/397/fef/88-/large/Alex_och_Sigge.jpg"/>
  <description><![CDATA[]]></description>
  <enclosure url="https://media.pod.space/alexosigge/aos705.mp3" type="audio/mpeg"/>
  <itunes:duration>00:59:21</itunes:duration>
</item>