Update: PronounceMe – implementation details

I have several post about the PronounceMe experiments - automatic video and voice generator for English learners. If you missed previous posts please review #pronounceMe for more information about the project, ideas behind and some statistics In this post I'd focus on the technical implementation with some diagrams and noticeable code snippets.

Service diagram

Essentially, the service consist number of the components:

  • Terms Database - dataset of the words
  • Voice generator - engine which generates human-alike voices
  • Picture lookup service - huge part which is responsible for finding relevant background picture
  • Video generator - renderer of the video which composes the clips and voices
  • Youtube Uploader - implementation of the client Youtube API
  • Management panel - very basic web-based admin panel allowing to observe current status, database and statictics
  • Statistics extractor - a regular fetching some statistics data

services diagram

The central focus of the service is a Term - one or few words which need to be pronounced. Service is built around the pipeline which takes the term and transforms it into the uploaded Youtube video.

pipeline diagram

Bear in mind this project is at the MVP stage. That means there are several compromises I have chosen to reduce Time To Market. It's never late to improve code if the project is useful.

Technologies used

I could say it uses most-of-the-buzzwords such as Computer Vision, Mesh API, Cloud Services, Docker, Microservices and it won't be a lie. But we always want to see more specifics:

Generator service:

  • Kotlin - of course, there are no other candidates. Most of the engine is built on it
  • http4k - webserver written in kotlin, for kotlin
  • kotlinx-html - dsl for html. There is no javascript in the project as such
  • kmongo - as a DAO layer for the db
  • java-pixabay - for Pixabay API access, the original project has been abandoned, I had to fork it

Video generator and render

It's a microservice working in the separate container, communicating with the engine via HTTP

  • python3 - because of moviepy
  • moviepy - the best programmatic video generator; I found anything similar for JVM.
  • imageio - for image manipulation
  • cherrypy - apparently the easiest way to expose python function via http REST service


  • jib - jvm docker image generator
  • mongo - default database for experiments/MVP
  • docker - as a runtime
  • docker-compose - for container orchestration
  • docker-machine - for provisioning
  • make - as a frontend for the deployment commands

Cloud API and services:

  • AWS Rekognition - cloud computer vision API. It allows to filterout pictures with faces(often they create a lot of noise) and image image labeling for choosing the best picture
  • AWS Polly - TTS engine provided by Amazon. Previously I tried one from MS Azure but quality wasn't satisfying. Polly generates nearly perfect voice with different accents
  • YouTube API - for video uploads and statistics collection
  • AWS EC2 - for hosting
  • PaperTrail - for logs from the docker containers


Well, there were a lot of issues, mostly related to the external services

1) YouTube API has limitations - each call counts towards daily quota(which is 1M units). From my experience, it's only possible to upload about 50 videos a day. Although I'd like to upload way more than limit set that didn't bother me much since the process is automated
2) moviepy heavily leaks memory. From my experiments that after 10 rendered videos python process held about 2Gb of RAM. Since it's MVP I have chosen the simplest solution - just restart microservice. More precisely, to configure docker-swarm to kill it once it consumed too much memory. I believe it's a very practical decision for the given project stage.
3) To make the video stand out from others it has to have a relevant background picture for the term. If the user looks for the "how to pronounce tomato" it's more likely that video with tomato on the background would be chosen rather than one with grey colour. To find images I used Pixabay API(if you like service don't forget to donate them too!). For the obvious reasons often some irrelevant pictures are returned, so I had to filter irrelevant pictures using Amazon computer vision.
4) Imagemagik policies hurt. It's a great library but I found it tricky to configure since it has a configuration file where defaults are very tight. For example it's impossible to generate video into the /tmp folder by default. Thanks to docker it's very easy to build up the image with embedded configuration.
5) Apparently, docker-compose has changed the behaviour for the container limits so I had to downgrade configuration file from version 3.3 to 2.3
6) I wanted to keep MongoDB outside of the container on the host machine for my personal reasons. If you ever tried to do so you know it's not easy. The container ecosystem is pushing a user to use containers only. I ended up binding /var/lib/mongodb/mongod.sock from host to container and use jnr-unixsocket to make mongo to use unix socket instead of TCP
7) Youtube API documentation seems to be very convoluted, I had a hard time to understand how to go from the simple youtube upload to something like "create a playlist if need and then specify description along with tags and location of the video in different languages"

Enjoyable parts

This project is actually quite interesting to work on. It uses many external APIs, works with computer vision(a lot of fun with debugging!), etc

  • kotlin is soo nice, as usual. Can't imagine myself using python which can expode after every single type or java where I would write a few books and still it's not that clean
  • Writing web pages in kotlin with kotlinx-html is really fun. Just think - statically typed html templates!
  • Amazon Rekognition works like a magic, I'd say in 90% it sees what I'd say about the picture. Prices are very competitive for my use case
  • Sealed classes work really well for the statistics collection and voices description
  • kmongo allows to express db queries via staticly typed DSL. As most ORM it fails on the complex constructions but perfomance of the DB communication is never consern for this project
  • java-pixabay library has been outdated, I made a few PRs but author had not got back to me. For that reason I continued to work on my fork - ruXlab/pixabay-java-api

Code snippets

I'd like to highlight some code used in PronounceMe service

kotlinx-html templates

private inline fun <reified T : Any> BODY.dumpListAsTable(
    list: List<T>, fields: Collection<KProperty1<T, *>> = T::class.memberProperties
) = table("table table-striped table-hover") {
    thead {
        tr {
            for (field in fields)
                th { +field.name }
    tbody {
        for (row in list) {
            tr {
                for (field in fields)
                    td { +field.get(row).toString() }
private fun youtubePlaylists(req: Request): Response = pageTemplate("Youtube playlists", autoreload = false) {
        listOf(YoutubePlaylist::id, YoutubePlaylist::title, YoutubePlaylist::itemsCount, YoutubePlaylist::description)
        . . . .


Navbar and html body

body {
    nav("navbar navbar-expand-md navbar-dark bg-dark fixed-top") {
        div("collapse navbar-collapse") {
            ul("navbar-nav mr-auto") {
                WebApp.webRoutes.filter { it.verb == Method.GET }.forEach {
                    li("nav-item") {
                        a(it.url, classes = "nav-link") {
            ul("navbar-nav mr-auto") {
                li("nav-item") {
                    if (PronounceApp.isRunning.get()) {
                        h2 {
                            span("badge badge-danger badge-secondary") { +"Generator is running" }
                    } else {
                        form(action = "/forcestart", method = FormMethod.post, classes = "form-inline") {
                            button(classes = "btn btn-success", type = ButtonType.submit) { +"Force start" }
                li("nav-item") {
                    a("https://papertrailapp.com/groups/XXXXXXX/events", "_blank", "nav-link button") {
    main {
        h1 { +title }

Result looks like:

Decent design for the private admin panel to be used every once in a few months by a single person, isn't it? :)

Server and routes

Handlers are defined as a list of routes with URL and handler function as expected in http4k

val webRoutes = listOf(
    Route(GET, "/ping", "Ping it") { req -> Response(OK).body("pong") },
    Route(GET, "/stat", "Some stats", this::stat),
    Route(GET, "/events", "Recent events", this::recentEvents),
    Route(GET, "/", "All urls available", this::root),
    Route(GET, "/config", "Runtime config", this::config),
    Route(GET, "/stat_channel", "Channel statistics", this::channelStat),
    Route(GET, "/youtube_playlist", "YT playlists", this::youtubePlaylists),

    Route(POST, "/forcestart", "Force start", this::forceStart),
    Route(POST, "/createplaylist", "Create playlist", this::youtubePlaylistCreate)

The webserver itself is literally 3 lines of code

routes(*webRoutes.map { it.url bind it.verb to it.handler }.toTypedArray())

The heart of the image lookup component:

fun findImageForWordWithCandidates(
    word: String,
    category: Category?,
    stopList: List<String>,
    mandatoryList: List<String>? = null,
    allowFaces: Boolean = false,
    pixabyPage: Int = 1
): ImagesWithCandidates? {
    val stopList = stopList.mapTo(HashSet(), String::toLowerCase)
    val mandatoryList = mandatoryList?.mapTo(HashSet(), String::toLowerCase)
    val allImages = imageLookup.searchPixabay(word, category, pixabyPage)
        ?.also { log.info("findImageForWord: got {} images for {}", it.size, word) }
        ?.mapNotNull {
            // fetch pictures locally
            runCatching {
                ImageRuntimeInfo(it.largeImageURL, URL(it.largeImageURL).asCachedFile("pixaby-${it.id}-large"), pixaby = it)
            .onFailure { log.warn("findImageForWord: during image saving", it) }
    val images = allImages
        // filter pics with faces if necessary
        ?.let { if (allowFaces) it else withoutFaces(word, it) }
        ?.also { log.info("findImageForWord: got ${it.size} pics after face filtering") }
        // image labelling
        ?.let { findLabels(word, it) }
        // exclude stop list words
        ?.filterNot { it.normalizedLabels.any { it in stopList } }
        // exclude images without mandatory words
        ?.filter {
            if (mandatoryList == null) true
            else it.normalizedLabels.any { it in mandatoryList }
        ?.also { log.info("findImageForWord: got ${it.size} pics after filtering by label") }
        ?: return null

    if (images.isEmpty()) {
        log.warn("findImageForWord: No eligible images were found for {}", word)
        return null

    val sortedImagesByConfidence = images
        .map {
            // find the best matches by original word
            val labelWithWord = it.labels
                .sortedByDescending { it.confidence }
                .firstOrNull { it.name.contains(word, ignoreCase = true) }
            it to (labelWithWord?.confidence ?: -1.0F)

    log.debug("findImageForWord: ${sortedImagesByConfidence.size} candidates for $word: \n{}",
        sortedImagesByConfidence.joinToString("\n") { "   - ${it.first} with ${it.second} confidence" })

    val firstBestMatch = sortedImagesByConfidence
        .firstOrNull { it.second > 0.0F } // return first by confidence

    log.info("findImageForWord: best match for {} by label in word - {}",
        word, firstBestMatch)

    if (firstBestMatch != null)
        return ImagesWithCandidates(firstBestMatch, allImages)

    // we don't have best extact match by word in labels
    val firstMatchByConfidence = sortedImagesByConfidence.firstOrNull()
    log.info("findImageForWord: good match by confidence for {} - {}",
        word, firstMatchByConfidence)

    return ImagesWithCandidates(firstMatchByConfidence?.first, allImages)

Clips compose

Pardon me for my python

for idx, _ in enumerate(voice_title_clips):
    prevoice_clip = CompositeVideoClip([static, voice_title_clips[idx]], size=screensize)
    prevoice_clip.duration = pre_voice_pause
    postvoice_clip = prevoice_clip.set_duration(post_voice_pause)
    voice_title_clips[idx] = CompositeVideoClip([static, voice_title_clips[idx]], size=screensize)
    voice_title_clips[idx].duration = voice_clips[idx].duration * voice_repeats + voice_repeats_pause_times * voice_clips[idx].duration
    silence_clip = silence.set_duration(voice_clips[idx].duration * voice_repeats_pause_times)
    voice_title_clips[idx].audio = concatenate_audioclips(intersperse([voice_clips[idx]] * voice_repeats, silence_clip))
    clips = [prevoice_clip, voice_title_clips[idx], postvoice_clip, static.set_duration(pause_between)]
    voice_title_clips[idx] = concatenate_videoclips(clips, padding=-1, method="compose")

What is next

Subscribe for the blog to see where this project goes. Breaking news is awaiting!

Checkout more project updates from posts grouped by #pronounceMe hashtag

Functional Kotlin part 4: collections manipulation

This is a part 4 of the #kotlin-showoff series and it's going to be about the standard functions over the collections(mostly iterables to be precise) allowing developer to express data modification in the clean and functional way.

General convention

Although one might think that kotlin has inherited all the base collection types from the Java it's not quite true. Kotlin transparently maps existing Java collections into the Kotlin by using some tricks such as typealiasing. Collections hierarchy in Kotlin make code even more safer by imposing separation between mutable and immutable data structures. Take a look on the interfaces diagram:

diagram originally posted on the kotlinlang.org

Having dedicated interfaces for immutable collections makes expressions are purely functional - no need to worry if api consumer modifies list on the way or even worse, attempt to insert into the immutable collection(goodbye UnsupportedOperationException!). Indeed, immutability is enforced in compile time by contract.

A note about Iterable vs Sequence

Those are very similar types of the base entities even with the same signatures, let's take a look

public interface Sequence<out T> {
    public operator fun iterator(): Iterator<T>

public interface Iterable<out T> {
    public operator fun iterator(): Iterator<T>

Those two base classes define the way data will be processed in the chain of the calls:

  • Operations on Iterable produce result immideatelly, so the full intermideate result will be passed between calls in the chain. The result is evaluated eagerly after each step.
  • Operations on the Sequence treat data items comming thorough as it would be an infinite stream, the closest analogy would be java8 Stream or RxObservable. Items passed via the chain of the calls one by one. Result is evaluated lazily.

As for now we focus on the Iterable and it's descendants(Collection, List, Map, etc..) . Luckily, many operations exist for both interfaces with exactly the same signatures

Simple list transformations filter, map, forEach

Those are the probably the most widely used operators and they do exactly after their name. The provided function is applied to the each element of the operation

val adminNames = users
  .filter { it.isAdmin }
  .map { it.name }

pupils.forEach { 
  println("${it.name}: ${it.score}")

filter* and map* families

There are way more similar operations provided in the Kotlin stdlib giving extra flexibility when it need:

val userList = users
  .filterNot { it.isBanned }
  .mapTo(mutableHashSet()) { it.userId }
  .mapIndexed { (idx, userId) -> "#${idx}: {it.userId}" }

In many occasions you'll find the same pattern - verb [not] [indexed] [to]. No need to memorise - the names come out intuitively:

Operations returning single element: first, last, single, elementAt, get

first and last return first and last elements (obviously).

val firstUser = users.first()
val firstAdminUser = users.first { it.isAdmin }
val lastBannedUser = users.last { it.isBanned }

single returns one element and throws exception if more than 1 element in collection matches the predicate

val oneLove = listOf("java", "kotlin",  "javascript").singleOrNull { it == "kotlin" } 

Also those operations can have return alternative value - provided by closure or null:

val oneLove = languages.singleOrNull { it == "kotlin" }
val tenthWinnerName = user.getOrElse(10) { "NO WINNER" }
val secondPerson = user.getOrNull(2)

Aggregation operations count, average, min, max

Again, intuitively those operations perform aggregations:

val avgScore = pupils.average { it.score }
val topStudent = pupils.max { it.score }
val channagingStudent = pupils.min { it.score }

Conditional oprations all, none, count, any

val numberOfTopStudents = pupils.count { it.score > 4.5 }
val allPassed = pupils.all { it.score > 2.0 }
val hasNeedleInHaystack = heap.any { it.object == NEEDLE }
val allGood = results.none { it.error != null }

List to Map transformation associate*, groupBy

Both operations produce a Map and they are different on how keys are collided. While assciate* simply overwrites existing value with associated key, groupBy adds value to the list of values:

val usersById = users.associate { it.id to it } // result type: Map<UserId, User>
val usersById = users.associateBy { it.id } // same output
val pupilsByScore = pupils.groupBy { it.score } // result type Map<Int, List<Pupil>>

Many more

There a way more functional operations over collections are available in Kotlin stdlib such as fold, reduce, minus(-), plus(+), contains(in) etc:

// result - list of the both users
val allUsers = fbUsers + twitterUsers 

// result - elements of allUserIds which are not in bannedUsersIds
val activeUserIds = allUserIds - bannedUsersIds 

// result - the longest length of the name
val longestName = names.reduce { longest, item -> if (longest.length < item.length) item else longest }

// result - same as above, the longest length of the name
val longestLength = names.map(String::length).fold(0, ::max))

// result - if Wally was there
val isWallyLovesKotlin = "Wally" in kotlinLovers

Those extension functions are very intuitive and widely used, essentially can cover most of the everyday tasks.


Kotlin collection functions provide a lot of flexibility to express your ideas and business logic in very concise, clear and functional way

Hopefully you found this article useful for you, please check out other posts by #kotlin-showoff hashtag

Dynamically typed languages are selling snake oil

I truly believe they are

I hear the same statements and misunderstandings over and over again from people who like dynamically typed languages. Obliviously, that spikes a lot of endless conversations and fights between two camps.

Generally, I'm avoiding conversations about static vs dynamic typing but every once in the while I drifted into that and hear the same statements, all the time. Often both sides just aren't able to listen to each other and thus conversation end up in the dead end.

Read more

JavaScript vs logic

In programming world we are working with logic. Everything relies on it, it's a fundamental part of computers.

If we do 3+4 we always expect to get 7. Call to createDatabase shall not destroy database. As experience grows developer grasps more and more concepts and approaches because of the past experience and logic. It's very important part of programming ecosystem which helps to grow skill set without getting another Masters degree or attending classes/courses

People ended up with very common concepts and gave them names - algorithms, design patterns, data types, naming conventions.

Read more

Nexmo Voice API demo: voicemail app

This article features voicemail service built using Nexmo Voice APIs and Spring Boot

As a business owner it's not always easy to handle huge volume of calls 24/7. On another hand each customer is important and it deserve to be served well.

To kick off development you can checkout demo repository

What to expect in this tutorial

In this tutorial we build simple voice mail forwarder where callers asked to leave a voice message which will be sent to the email using Nexmo Voice API as an attachment.
Example of result:

Read more

Flying to/from London?

London is the one of must to see places in the world. It has long history and always been an important economic center. Well, if you landed to that page you're already consider trip here.

UK is an island, hence there are not so many options to get here. I'm sure you are going to take a fight to London. Other options do exist but they either time consuming or quite expensive

Here is a good news: London has 5-6 airports(technically only two, other are just outside of London metropolitan area). These airports accommodate enormous flights all over the world reaching pretty much all possible countries on all continents. As a hub for many low cost carriers such as Ryanair, easyjet, Norwegian, etc it is possible to find tickets as little as £5-10+ for European destinations and £150+ for North America

Read more

Про айфон


Текст написанный ниже не только от линуксоида но и от андроид разработчика, который волею судеб и личным интересом пользовался iPhone 4s на протяжении месяца

Добро пожаловать в ад

Я буду стараться быть выдержанным и на сколько это возможным, объективным. У меня почти поломался мой основной телефон Sony - это было хорошим поводом попробовать этот иноземный айфон за котором энтузиасты встают в очередь и продают последние почки.

Дисклэймер 2: Я не буду осуждать скорость работы потому что это 4s и очевидно(за некоторыми нюансами) iOS в целом тут не причём. Хотя тормозит он как не в себя. Мало того, в текущем проекте мы наелись проблем как iOS-паблишеры(эти те кто пишет и выгкладывает приложения в App Store), поэтому некоторые комментарии буду связанны с этим опытом, о котором 99.999% пользователей не имеют представления

Read more

Руксы в Лондоне: arrange viewing

Надеюсь вы уже составили список интересных вам мест на одном или нескольких сайтов из первой части. Не удивляйтесь, если половина мест уже выкупленна, особенно если вы написали во время выходных. И попробуйте найти время просмотр на те, места, которые ещё не разобрали. Чем дольше откладываете - тем скорее вам перезвонят(хорошо если перезвонят заранее) и скажут что прямо перед вами внесли депозит.

Arrange a viewing

Составте себе табличку-опросник, чтоб не забыть спросить агента все вопросы:

Read more

Asus x202e: ubuntu & win8 dual boot

Старенький lenovo s10-2 давно уже изжил своё. Все больше появлялась необходимость таскать ноутбук с собой, экран побольше и, конечно же, мощности(на 2gb и atom n270 1.6GHz далеко не уедешь, особенно с джавой).

Долго выбирал - у меня есть чёткие требования к железу и внешнему виду, самые главные из них:

  • Стрелки должны быть отдельностоящим блоком, не сливаться с остальными, желательно с отступом от шифта
  • По enter можно попасть легко
  • Экран - не более 13"
  • Вес не более 1.5кг
  • Время автономной работы от 4ч
  • Память минимум 4гб
  • Процессор не меньше core i3
  • Желательно большой тачпад
  • По цене заметно меньше macbook air
  • Разумеется, чтоб линукс встал без проблем (сейчас это проблема, но все таки)
  • Обязан быть VGA

Read more

Удивительное время!

Посмотрите, что происходит вокруг! Жизнь людей меняется с молниеносной скоростью!

Технологии внедряются в повседневную жизнь с нереальной скоростью! Они развиваются на столько быстро, что люди не поспевают за ними.
Так было всегда - любые инновации долго входили в массы. Но в отличии от настоящего времени раньше были альтернативы. Современные инструменты \"подсаживают\" на себя людей, не оставляя альтернативы. Поэтому сейчас невозможно не учиться всю жизнь (в разной степени в зависимости от специфики)

Обернитесь, посмотрите, что поменялось за 10 лет у нас и за 20 в развитых странах!

Read more