Update: PronounceMe – implementation details

I have several post about the PronounceMe experiments - automatic video and voice generator for English learners. If you missed previous posts please review #pronounceMe for more information about the project, ideas behind and some statistics In this post I'd focus on the technical implementation with some diagrams and noticeable code snippets.

Service diagram

Essentially, the service consist number of the components:

  • Terms Database - dataset of the words
  • Voice generator - engine which generates human-alike voices
  • Picture lookup service - huge part which is responsible for finding relevant background picture
  • Video generator - renderer of the video which composes the clips and voices
  • Youtube Uploader - implementation of the client Youtube API
  • Management panel - very basic web-based admin panel allowing to observe current status, database and statictics
  • Statistics extractor - a regular fetching some statistics data

services diagram

The central focus of the service is a Term - one or few words which need to be pronounced. Service is built around the pipeline which takes the term and transforms it into the uploaded Youtube video.

pipeline diagram

Bear in mind this project is at the MVP stage. That means there are several compromises I have chosen to reduce Time To Market. It's never late to improve code if the project is useful.

Technologies used

I could say it uses most-of-the-buzzwords such as Computer Vision, Mesh API, Cloud Services, Docker, Microservices and it won't be a lie. But we always want to see more specifics:

Generator service:

  • Kotlin - of course, there are no other candidates. Most of the engine is built on it
  • http4k - webserver written in kotlin, for kotlin
  • kotlinx-html - dsl for html. There is no javascript in the project as such
  • kmongo - as a DAO layer for the db
  • java-pixabay - for Pixabay API access, the original project has been abandoned, I had to fork it

Video generator and render

It's a microservice working in the separate container, communicating with the engine via HTTP

  • python3 - because of moviepy
  • moviepy - the best programmatic video generator; I found anything similar for JVM.
  • imageio - for image manipulation
  • cherrypy - apparently the easiest way to expose python function via http REST service

Infrastructure:

  • jib - jvm docker image generator
  • mongo - default database for experiments/MVP
  • docker - as a runtime
  • docker-compose - for container orchestration
  • docker-machine - for provisioning
  • make - as a frontend for the deployment commands

Cloud API and services:

  • AWS Rekognition - cloud computer vision API. It allows to filterout pictures with faces(often they create a lot of noise) and image image labeling for choosing the best picture
  • AWS Polly - TTS engine provided by Amazon. Previously I tried one from MS Azure but quality wasn't satisfying. Polly generates nearly perfect voice with different accents
  • YouTube API - for video uploads and statistics collection
  • AWS EC2 - for hosting
  • PaperTrail - for logs from the docker containers

Challanges

Well, there were a lot of issues, mostly related to the external services

1) YouTube API has limitations - each call counts towards daily quota(which is 1M units). From my experience, it's only possible to upload about 50 videos a day. Although I'd like to upload way more than limit set that didn't bother me much since the process is automated
2) moviepy heavily leaks memory. From my experiments that after 10 rendered videos python process held about 2Gb of RAM. Since it's MVP I have chosen the simplest solution - just restart microservice. More precisely, to configure docker-swarm to kill it once it consumed too much memory. I believe it's a very practical decision for the given project stage.
3) To make the video stand out from others it has to have a relevant background picture for the term. If the user looks for the "how to pronounce tomato" it's more likely that video with tomato on the background would be chosen rather than one with grey colour. To find images I used Pixabay API(if you like service don't forget to donate them too!). For the obvious reasons often some irrelevant pictures are returned, so I had to filter irrelevant pictures using Amazon computer vision.
4) Imagemagik policies hurt. It's a great library but I found it tricky to configure since it has a configuration file where defaults are very tight. For example it's impossible to generate video into the /tmp folder by default. Thanks to docker it's very easy to build up the image with embedded configuration.
5) Apparently, docker-compose has changed the behaviour for the container limits so I had to downgrade configuration file from version 3.3 to 2.3
6) I wanted to keep MongoDB outside of the container on the host machine for my personal reasons. If you ever tried to do so you know it's not easy. The container ecosystem is pushing a user to use containers only. I ended up binding /var/lib/mongodb/mongod.sock from host to container and use jnr-unixsocket to make mongo to use unix socket instead of TCP
7) Youtube API documentation seems to be very convoluted, I had a hard time to understand how to go from the simple youtube upload to something like "create a playlist if need and then specify description along with tags and location of the video in different languages"

Enjoyable parts

This project is actually quite interesting to work on. It uses many external APIs, works with computer vision(a lot of fun with debugging!), etc

  • kotlin is soo nice, as usual. Can't imagine myself using python which can expode after every single type or java where I would write a few books and still it's not that clean
  • Writing web pages in kotlin with kotlinx-html is really fun. Just think - statically typed html templates!
  • Amazon Rekognition works like a magic, I'd say in 90% it sees what I'd say about the picture. Prices are very competitive for my use case
  • Sealed classes work really well for the statistics collection and voices description
  • kmongo allows to express db queries via staticly typed DSL. As most ORM it fails on the complex constructions but perfomance of the DB communication is never consern for this project
  • java-pixabay library has been outdated, I made a few PRs but author had not got back to me. For that reason I continued to work on my fork - ruXlab/pixabay-java-api

Code snippets

I'd like to highlight some code used in PronounceMe service

kotlinx-html templates


private inline fun <reified T : Any> BODY.dumpListAsTable(
    list: List<T>, fields: Collection<KProperty1<T, *>> = T::class.memberProperties
) = table("table table-striped table-hover") {
    thead {
        tr {
            for (field in fields)
                th { +field.name }
        }
    }
    tbody {
        for (row in list) {
            tr {
                for (field in fields)
                    td { +field.get(row).toString() }
            }
        }
    }
}
private fun youtubePlaylists(req: Request): Response = pageTemplate("Youtube playlists", autoreload = false) {
    dumpListAsTable(
        youtubeClient.getPlaylists(),
        listOf(YoutubePlaylist::id, YoutubePlaylist::title, YoutubePlaylist::itemsCount, YoutubePlaylist::description)
    )
        . . . .
}

Outputs:

Navbar and html body

body {
    nav("navbar navbar-expand-md navbar-dark bg-dark fixed-top") {
        div("collapse navbar-collapse") {
            ul("navbar-nav mr-auto") {
                WebApp.webRoutes.filter { it.verb == Method.GET }.forEach {
                    li("nav-item") {
                        a(it.url, classes = "nav-link") {
                            +it.description
                        }
                    }
                }
            }
            ul("navbar-nav mr-auto") {
                li("nav-item") {
                    if (PronounceApp.isRunning.get()) {
                        h2 {
                            span("badge badge-danger badge-secondary") { +"Generator is running" }
                        }
                    } else {
                        form(action = "/forcestart", method = FormMethod.post, classes = "form-inline") {
                            button(classes = "btn btn-success", type = ButtonType.submit) { +"Force start" }
                        }
                    }
                }
                li("nav-item") {
                    a("https://papertrailapp.com/groups/XXXXXXX/events", "_blank", "nav-link button") {
                        +"logs"
                    }
                }
            }
        }
    }
    main {
        h1 { +title }
        builder(this@body)
    }
}

Result looks like:

Decent design for the private admin panel to be used every once in a few months by a single person, isn't it? :)

Server and routes

Handlers are defined as a list of routes with URL and handler function as expected in http4k

val webRoutes = listOf(
    Route(GET, "/ping", "Ping it") { req -> Response(OK).body("pong") },
    Route(GET, "/stat", "Some stats", this::stat),
    Route(GET, "/events", "Recent events", this::recentEvents),
    Route(GET, "/", "All urls available", this::root),
    Route(GET, "/config", "Runtime config", this::config),
    Route(GET, "/stat_channel", "Channel statistics", this::channelStat),
    Route(GET, "/youtube_playlist", "YT playlists", this::youtubePlaylists),

    Route(POST, "/forcestart", "Force start", this::forceStart),
    Route(POST, "/createplaylist", "Create playlist", this::youtubePlaylistCreate)
)

The webserver itself is literally 3 lines of code

routes(*webRoutes.map { it.url bind it.verb to it.handler }.toTypedArray())
    .asServer(SunHttp(port))
    .start()

The heart of the image lookup component:

fun findImageForWordWithCandidates(
    word: String,
    category: Category?,
    stopList: List<String>,
    mandatoryList: List<String>? = null,
    allowFaces: Boolean = false,
    pixabyPage: Int = 1
): ImagesWithCandidates? {
    val stopList = stopList.mapTo(HashSet(), String::toLowerCase)
    val mandatoryList = mandatoryList?.mapTo(HashSet(), String::toLowerCase)
    val allImages = imageLookup.searchPixabay(word, category, pixabyPage)
        ?.shuffled()
        ?.also { log.info("findImageForWord: got {} images for {}", it.size, word) }
        ?.mapNotNull {
            // fetch pictures locally
            runCatching {
                ImageRuntimeInfo(it.largeImageURL, URL(it.largeImageURL).asCachedFile("pixaby-${it.id}-large"), pixaby = it)
            }
            .onFailure { log.warn("findImageForWord: during image saving", it) }
            .getOrNull()
        }
    val images = allImages
        // filter pics with faces if necessary
        ?.let { if (allowFaces) it else withoutFaces(word, it) }
        ?.also { log.info("findImageForWord: got ${it.size} pics after face filtering") }
        // image labelling
        ?.let { findLabels(word, it) }
        // exclude stop list words
        ?.filterNot { it.normalizedLabels.any { it in stopList } }
        // exclude images without mandatory words
        ?.filter {
            if (mandatoryList == null) true
            else it.normalizedLabels.any { it in mandatoryList }
        }
        ?.also { log.info("findImageForWord: got ${it.size} pics after filtering by label") }
        ?: return null

    if (images.isEmpty()) {
        log.warn("findImageForWord: No eligible images were found for {}", word)
        return null
    }

    val sortedImagesByConfidence = images
        .map {
            // find the best matches by original word
            val labelWithWord = it.labels
                .sortedByDescending { it.confidence }
                .firstOrNull { it.name.contains(word, ignoreCase = true) }
            it to (labelWithWord?.confidence ?: -1.0F)
        }

    log.debug("findImageForWord: ${sortedImagesByConfidence.size} candidates for $word: \n{}",
        sortedImagesByConfidence.joinToString("\n") { "   - ${it.first} with ${it.second} confidence" })

    val firstBestMatch = sortedImagesByConfidence
        .firstOrNull { it.second > 0.0F } // return first by confidence
        ?.first

    log.info("findImageForWord: best match for {} by label in word - {}",
        word, firstBestMatch)

    if (firstBestMatch != null)
        return ImagesWithCandidates(firstBestMatch, allImages)

    // we don't have best extact match by word in labels
    val firstMatchByConfidence = sortedImagesByConfidence.firstOrNull()
    log.info("findImageForWord: good match by confidence for {} - {}",
        word, firstMatchByConfidence)

    return ImagesWithCandidates(firstMatchByConfidence?.first, allImages)
}

Clips compose

Pardon me for my python

for idx, _ in enumerate(voice_title_clips):
    prevoice_clip = CompositeVideoClip([static, voice_title_clips[idx]], size=screensize)
    prevoice_clip.duration = pre_voice_pause
    postvoice_clip = prevoice_clip.set_duration(post_voice_pause)
    voice_title_clips[idx] = CompositeVideoClip([static, voice_title_clips[idx]], size=screensize)
    voice_title_clips[idx].duration = voice_clips[idx].duration * voice_repeats + voice_repeats_pause_times * voice_clips[idx].duration
    silence_clip = silence.set_duration(voice_clips[idx].duration * voice_repeats_pause_times)
    voice_title_clips[idx].audio = concatenate_audioclips(intersperse([voice_clips[idx]] * voice_repeats, silence_clip))
    clips = [prevoice_clip, voice_title_clips[idx], postvoice_clip, static.set_duration(pause_between)]
    voice_title_clips[idx] = concatenate_videoclips(clips, padding=-1, method="compose")

What is next

Subscribe for the blog to see where this project goes. Breaking news is awaiting!

Checkout more project updates from posts grouped by #pronounceMe hashtag

Functional Kotlin part 4: collections manipulation

This is a part 4 of the #kotlin-showoff series and it's going to be about the standard functions over the collections(mostly iterables to be precise) allowing developer to express data modification in the clean and functional way.

General convention

Although one might think that kotlin has inherited all the base collection types from the Java it's not quite true. Kotlin transparently maps existing Java collections into the Kotlin by using some tricks such as typealiasing. Collections hierarchy in Kotlin make code even more safer by imposing separation between mutable and immutable data structures. Take a look on the interfaces diagram:

diagram originally posted on the kotlinlang.org

Having dedicated interfaces for immutable collections makes expressions are purely functional - no need to worry if api consumer modifies list on the way or even worse, attempt to insert into the immutable collection(goodbye UnsupportedOperationException!). Indeed, immutability is enforced in compile time by contract.

A note about Iterable vs Sequence

Those are very similar types of the base entities even with the same signatures, let's take a look

public interface Sequence<out T> {
    public operator fun iterator(): Iterator<T>
}

public interface Iterable<out T> {
    public operator fun iterator(): Iterator<T>
}

Those two base classes define the way data will be processed in the chain of the calls:

  • Operations on Iterable produce result immideatelly, so the full intermideate result will be passed between calls in the chain. The result is evaluated eagerly after each step.
  • Operations on the Sequence treat data items comming thorough as it would be an infinite stream, the closest analogy would be java8 Stream or RxObservable. Items passed via the chain of the calls one by one. Result is evaluated lazily.

As for now we focus on the Iterable and it's descendants(Collection, List, Map, etc..) . Luckily, many operations exist for both interfaces with exactly the same signatures

Simple list transformations filter, map, forEach

Those are the probably the most widely used operators and they do exactly after their name. The provided function is applied to the each element of the operation

val adminNames = users
  .filter { it.isAdmin }
  .map { it.name }

pupils.forEach { 
  println("${it.name}: ${it.score}")
}

filter* and map* families

There are way more similar operations provided in the Kotlin stdlib giving extra flexibility when it need:

val userList = users
  .filterNotNul()
  .filterNot { it.isBanned }
  .mapTo(mutableHashSet()) { it.userId }
  .mapIndexed { (idx, userId) -> "#${idx}: {it.userId}" }

In many occasions you'll find the same pattern - verb [not] [indexed] [to]. No need to memorise - the names come out intuitively:

Operations returning single element: first, last, single, elementAt, get

first and last return first and last elements (obviously).

val firstUser = users.first()
val firstAdminUser = users.first { it.isAdmin }
val lastBannedUser = users.last { it.isBanned }

single returns one element and throws exception if more than 1 element in collection matches the predicate

val oneLove = listOf("java", "kotlin",  "javascript").singleOrNull { it == "kotlin" } 

Also those operations can have return alternative value - provided by closure or null:

val oneLove = languages.singleOrNull { it == "kotlin" }
val tenthWinnerName = user.getOrElse(10) { "NO WINNER" }
val secondPerson = user.getOrNull(2)

Aggregation operations count, average, min, max

Again, intuitively those operations perform aggregations:

val avgScore = pupils.average { it.score }
val topStudent = pupils.max { it.score }
val channagingStudent = pupils.min { it.score }

Conditional oprations all, none, count, any

val numberOfTopStudents = pupils.count { it.score > 4.5 }
val allPassed = pupils.all { it.score > 2.0 }
val hasNeedleInHaystack = heap.any { it.object == NEEDLE }
val allGood = results.none { it.error != null }

List to Map transformation associate*, groupBy

Both operations produce a Map and they are different on how keys are collided. While assciate* simply overwrites existing value with associated key, groupBy adds value to the list of values:

val usersById = users.associate { it.id to it } // result type: Map<UserId, User>
val usersById = users.associateBy { it.id } // same output
val pupilsByScore = pupils.groupBy { it.score } // result type Map<Int, List<Pupil>>

Many more

There a way more functional operations over collections are available in Kotlin stdlib such as fold, reduce, minus(-), plus(+), contains(in) etc:

// result - list of the both users
val allUsers = fbUsers + twitterUsers 

// result - elements of allUserIds which are not in bannedUsersIds
val activeUserIds = allUserIds - bannedUsersIds 

// result - the longest length of the name
val longestName = names.reduce { longest, item -> if (longest.length < item.length) item else longest }

// result - same as above, the longest length of the name
val longestLength = names.map(String::length).fold(0, ::max))

// result - if Wally was there
val isWallyLovesKotlin = "Wally" in kotlinLovers

Those extension functions are very intuitive and widely used, essentially can cover most of the everyday tasks.

Conclusion

Kotlin collection functions provide a lot of flexibility to express your ideas and business logic in very concise, clear and functional way

Hopefully you found this article useful for you, please check out other posts by #kotlin-showoff hashtag

Project update: Alexa London Bus Stop

A while ago I have published post about the first skill for Alexa I developed. Personally I use it since then practically every day and I found it very useful. I didn't even bother to check analytics since, well, it does work for me and I expected people to use it as well if it's useful.

Thanks to my wife, I recently learnt London Bus Stop skill:

  • still in the top 30 skills in the area because I'm receiving $100 credit for AWS every month;
  • it's not listed anymore! That fact slipped through the cracks!

Read more

Functional kotlin part 3: scoping functions

In the part 3 of the series of the posts about kotlin we going to look into the one of the intensively used kotlin extension functions from the standard library - they allow to write very expressive and safe, functionally-looking code.

For folks who got lost on the word "extension functions" - it's a way to attach a function or property to the instances of the existing classes. For example, val d = 10.twice()It's very much like a classic Java Util classes with method twice(int) but done in a very clean way. Visually it looks like you're calling a member of the class, but in reality, the compiler calls your function passing receiver as an argument.

Read more

Functional kotlin part 2: elvis operator

Continuing series of posts #kotlin-showoff about functional constructions in kotlin I want to demostrate use of elvis operator

Essentially, elvis operator lvalue ?: rexpression is returning left value if it's not null or executes rexpression otherwise. The crazy thing about kotlin is most of the constructions are expressions and that gives another way to express business logic.

Read more

Functional kotlin part 1: safe calls

For the seasoned Java developer it's very easy to switch to kotlin. Even more, thanks to the great effort of JetBrains team for java interop, there is no need to wait for the greenfield project to start to write kotlin code. You can start koding straight away by either implementing new functionality in kotlin or converting existing classes into the new language by employing Intellj Idea automagic converter

This is a first of this series of posts unioned by tag #kotlin-showoff

Read more

Presentation – GCP APIs with kotlin

I was invited for the talk as part of kotlin/everywhere at GDG Cloud London meetup on June 8th 2019. Unlike previous talks in this one I focused on the live coding part after brief intro into the language.

The demo project I prepared is a web site allowing user to upload pictures into GCP Storage, automatically annotate content using Vision API, synthesise voice which describes content of the image.

Read more

Dynamically typed languages are selling snake oil

I truly believe they are

I hear the same statements and misunderstandings over and over again from people who like dynamically typed languages. Obliviously, that spikes a lot of endless conversations and fights between two camps.

Generally, I'm avoiding conversations about static vs dynamic typing but every once in the while I drifted into that and hear the same statements, all the time. Often both sides just aren't able to listen to each other and thus conversation end up in the dead end.

Read more

Update: PronounceMe

It's been 3 months since I announced PronounceMe project I was working on at the beginning of 2019.

The initial approach was simple - build and run the MVP, see if it gets some organic traction. MVP included:

  • Written expectations and desirable figures
  • Generator engine - core which renders videos
  • Endless data source - video production process should be never stopped
  • Basic internal analytics for metrics I focus on
  • Autonomous deployed a system which restarts itself if something breaks

Read more