Category Archives: Python

Leeching an FTP with Python

TL;DR;

This script will leech all the files from a folder in an FTP. It’s especially appropriate for dealing with enormous amount of files – hundreds of thousands or even millions.

My FTP issues

I have set up an IP security camera to save an image to my shared hosted FTP whenever it recognised any movement. The camera was cheap, old and not that accurate, so it saved lost of photos – more than half a million.

I needed to delete most of these photos but not all of it. Some of the photos captured interesting moments and I wanted to save these. I couldn’t just delete the folder, I had to download and check every photo. Checking the photos wasn’t as difficult as it might sound. Most of it were either almost identical or completely empty (zero bytes). Downloading it was the problem.

The 10000 limit
If you’re using a shared hosting it’s likely that your FTP server is limited. By default most shared FTP servers are probably running PureFTP and have a default limit of 10,000 files. This means you can only list 10,000 files and you will not even know how much files are in there.
You can probably ask your hosting provider to increase the “ftp recursion limit”, but I’m not sure they will be willing to up high enough.
Anyway, it’ll be cumbersome to deal with so many files and most FTP clients will freeze even trying to list less the 10k. I’ve tried a few on different OSs, eventually FileZilla for Windows seemed to be the best. But still, dealing with so many files was an extremely tedious process. And it’s even worse since I can’t even tell how many files are in that folder and how many times I will have to repeat this tedious process:

  1. Connect to the server and list the folder – 10 minutes for listing 10,000 files
  2. Even TranZilla failed to list the files from time to time – try again
  3. Move all the listed files to my local folder – ~1-2 hours for very small files
  4. If it fails for some reason – Repeat all steps
  5. If it succeeded – Repeat  all steps

Overall if I wanted to do it manually I would have to repeat that annoying process for hundreds of times (including failures).

Python to the rescue

Obviously python is an amazing scripting language (and beyond). Writing a python script to leech an FTP is very easy and straight forward to create. I was able to PoC with just a few lines of code after looking at the ftplib docs.

Eventually I added a few extra features like logging and retries. But not too much, it was supposed to be quick and save me time – and it definitely did :)

In order to make even more robust one would consider adding stuff like full tree traversal (leech the whole FTP and not just one folder),  multithreading to download from multiple connection simultaneously, and more. All were beyond the scope of this script and would be relatively easy to add in Python.

The Supervisor

While the leacher.py script is supposed to run continuously until it leeches the whole folder, it might still break in some cases. And on the Mac the machine goes to sleep and stops the script.

That’s why I added this supervisor.py script which runs the leacher.py again if it breaks and also prevents the system from sleeping on the Mac.

How to

Download both leacher.py and supervisor.py from here. Edit the params on the bottom of  leacher.py with your FTP info.

Run like this: python supervisor.py leacher.py

Watch it leech. Check leacher.log for more verbose info.

Summary

Overall the leacher.py script processed about half a million files and saved me a lot of time and annoyance.

leacher_log

Python is awesome!

Encode and Decode URLs in Python for Google Appengine

While developing in Python for Google Appengine you’ll might want to encode or decode URLs. Sounds like a simple task, as it is in many other languages. Somehow in Python 2.5.x which is the version supported by appengine, it’s not as straight forward, at least it wasn’t for me. There are tones solutions, suggestions and examples out there, not all work as expected.

After some trial and error it finally worked:

import urllib

text = 'some text'

#decodeURI
text = urllib.unquote(text.encode('ascii')).decode('utf-8')

#encodeURI
text = urllib.quote(text.encode('utf-8'))

Might spare you some time.

If your gonna work with unicode on your appengine app than your in for some other troubles. This presentation, and this article (and its comments) might help a bit.