A colleague recently needed to transcribe some recorded interviews. They used AWS Transcribe which outputs a json file that is not super easy to use directly. Luckily a few open source tools have popped up to make them more generally legible. I helped to turn the results into a docx and wanted to document the process for my colleague and anyone else interested in the process.
I found kibaffo33's tool, so this is basically a guide to getting that running and how to use it in its most basic form to process a few files.
I already had homebrew installed. If you don't have that, go get it now.
Get python3 and pip set up
First, I was on a freshly installed Catalina machine. It ships with python2, but pip complains about that (as it should) so I need to get python3 installed. From OpenSource.com article I got python3 installed using these commands:
$ brew install pyenv
The opensource article suggests using pyenv to install 3.7, but I wanted the latest, so first I checked:
$ pyenv install -l | grep 3
I found that 3.8.2 is an acceptable argument, so I did:
$ pyenv install 3.8.2
$ pyenv global 3.8.2
I'm still on bash, so I did this to get bash set up:
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> .bash_profile
Then I closed the terminal window and reopened it so that the bash profile change would take effect.
Finally, to install the package:
$ pip install tscribe
Using tscribe to transform AWS json to a docx
I had several json files named things like KK.json To turn these into a docx file the process is to:
pythonto start up the python command line.
import tscribeto load the tscribe library.
tscribe.write("KK.json")to get your computer to process the file.
The results of that command by default are to write a docx file that has the name that was given to AWS as the Job Name, so in this screenshot it is KK_CDS.docx.