Python & Redash makes you cruise!

Signalflow
5 min readNov 11, 2019

--

99 Problems, but no more “Signal 9 Error”-blues

In short: This post is about processing large amounts of data with python on a remote AWS RDS cluster via Pandas and Pymysql.

The Problem:

I love and often use Redash for various reasons, but this week I had to solve a particular problem, which Redash was not designed to handle well (which is totally understandable). My problem was:

How to query a relatively large amount of data from AWS and do various other slicing (without getting a signal 9 error).

SIGKILL-error

After reading the redash documentation, I figured out that it is the Redash’s query runner, which ran out of memory and caused the signal 9 error (#2 below):

  1. Your database — Redash passes the query as is and it’s your database’s responsibility to handle your query
  2. Memory of the query runner (that’s us) — there needs to be enough memory to handle the result set. In this case, if the result set is too large, the query might fail and you should see an error.
  3. Network — some really large queries might take a while to download. While downloading the results, the UI might still say that the query is being executed while in fact, it’s just waiting for the data to reach your browser.

The Solution:

A summary of the steps I took; were:

  • Connect securely to AWS server and DB
  • Query RDS Mysql / Aurora Read Replica
  • Write the Output to a CSV-file(s)

1. Connect to AWS via Secure Shell (SSH)

Server Details

You need exactly the same details as when you add a new data-source in the Redash dashboard (view screenshot below), but instead of using your regular aws-rds host-url (i.e. <unique-id>.<region>.rds.amazonaws.com:3306), you will use localhost and port 3308 instead. Get these details from your friendly sysadmin; we will use it in a bit (get it? “bit” -hahaha! 😇).

SSH Keys

Create a private — public key pair by following this great tutorial by the folks at github / gitlab. You will then have to copy your public key and send it to your sysadmin over a secure connection. The public and private key will pair and allow you to connect securely to the remote server.

SSH Configs

Create / update your config-file by using nano in the terminal. You will have to create a config-file if you don’t already have one. It will contain all the necessary details of your AWS server.

# To print:
$ cat ~/.ssh/config
# To edit:
$ nano ~/.ssh/config

Add the following lines in your config-file and save it before exiting.

Host dev-tools
Hostname 20.200.20.20
User joe
ForwardAgent yes
TCPKeepAlive yes
StrictHostKeyChecking no
GlobalKnownHostsFile /dev/null
UserKnownHostsFile /dev/null
LocalForward 3307 <unique-id>.<region>.rds.amazonaws.com:3306

After saving your config-file, type in the following command and hit enter.

$ ssh -N -L 3308:<unique-id>.<region>.rds.amazonaws.com:3306 joe@dev-tools

This terminal command will output that it understood and could successfully add the host-name’s ip.

....
....
Warning: Permanently added '20.200.20.20' (ECDSA) to the list of known hosts.

White-listing

Typically you could also white-list a specific IP to connect to a specific server over SSL in the AWS console (i.e. notifying AWS that Redash over IP 52.71.84.157 will connect to it).

Troubleshooting:

In the beginning we had some errors and had to troubleshoot my connection. I will add more details later, but essentially, I installed “telnet”, which helped to sort out the error by giving us an output log to work with.

$ brew install telnet
....
....
$ telnet localhost 3308

2. Query RDS Mysql / Aurora Read Replica

Venv

We will use python within a virtual environment to connect over ssh. You can now set-up a virtual environment with Pycharm (or your favourite IDE), like below:

Install the following two libraries in your venv either by pycharm’s preferences > projects interpreter, or by using pycharm’s terminal:

pip install pandas pymysql

Note: There are also alternative libraries one can use in your script depending on the database-type you connect to. Check-out these other two tutorials:

Script

After installing the libs, create your python-script that will connect over shh to your database (ensure that the details align with the details in your ssh config-file).

#### query.py ####import pandas as pd
import pymysql
host = "127.0.0.1"
port = 3308
dbname = "...."
user = "...."
password = "...."
conn = pymysql.connect(host, user=user, port=port, passwd=password, db=dbname)q = '''SELECT *
FROM accounts
LIMIT 10;'''

df = pd.read_sql(q, con=conn)
print(df.head(10))

The UI should look like this:

3. Write the Output to a CSV-file

CSV-creation

Add the following line of code at the bottom of your script and execute it again.

df.to_csv('dataset.csv')

After running query.py, you will have to wait a couple of minutes whereafter a “.csv”-file gets created with your query’s output. You can also do various other manipulations with the dataframe-object (i.e. df) in the same run. For example: With 3 lines of code below, one can create three different csv-samples for further exploratory analysis. This would have taken way longer to create in Excel.

df[:10000].to_csv('10000.csv') df[10000:50000].to_csv('10kto50k.csv') df[60000:100000].to_csv('60kto100k.csv')

When you are done running your python script, you can safely close the connection by hitting CNTRL+C and you will see the following output in the terminal:

....
....
packet_write_wait: Connection to 20.200.20.20 port 22: Broken pipe
$

Conclusion

This tutorial only scratched the surface of what is possible once you can securely access your AWS cluster. You can also:

  • Run way more complex queries with higher volume of rows, as well as consider pickling the data for faster processing.
  • Use Python libraries and functions for further analysis: numpy, scipy, scikit-learn, NLTK, etc.
  • Plot your data with matplotlib or plotly.
  • Create reports with Jinja and PDFkit and email them to key stakeholders
  • Create a simple roll-up reporting datawarehouse
  • Put your queries under git version control
Time to open up a terminal and surf!

Who in this world does not like feedback?

  • ⚡Claps: If you loved my post, consider showing some support by clapping once or twice. If you really liked the post, consider clapping until you can’t clap no more :)
  • ⚡Comments: Is there something I could improve or perhaps address in a next post? Let your fingers do the talking and please drop me a comment below.
  • ⚡Followers: If you loved this post, please consider following me on Medium and Twitter for future posts on various indie tasks, workflows, and open source tech.
  • ⚡Readers: If you found this article helpful, please feel free to also scan over some of my other workflows.

--

--

Signalflow
Signalflow

Written by Signalflow

Insightfull tasks to optimise your startup’s worflows.

No responses yet