Project Announcements

One of the biggest issues on any system of mine is the cluttering of the Downloads folder. In the modern Internet age, we download a ton of files. Checking right now, the contents of my Downloads folder sum to about 15GB in filesize. Although that probably isn’t too much of an issue given the cheapness of storage today, it remains troublesome when hand-searching through the files.

When I only ran Windows, I used Cyber-D’s Autodelete to delete my old downloads. This worked perfectly for me, except for the fact that sometimes it would delete files that I forgot that I wanted. But with Windows, I could always find those files in the recycle bin.

Fast forward several years, and I now run Linux as my primary operating system. Without doing much research to see if a program already existed, I drafted a “spec” for download-sweeper, a program that would delete old files in the Downloads directory, but also allow the user a “grace period” where they could still recover old Download files if they were removed.

Sure, this could probably be created in less than 100 lines of C, but I was looking to create a robust and portable solution that would allow me to quickly make changes if I needed to. Thus, I created a clean (~350 lines) Python solution that would do just that.

Further, when integrated with systemd, everything works perfectly. I currently have the application deployed on my Laptop, my Desktop, and on the webserver that’s running this blog. Of course, there aren’t any downloads on the webserver, so I use it to act as a “virus/malware quarantining tool”.

You can view the project on its GitHub page, here: https://github.com/brandonio21/download-sweeper

Recently during University of California, San Diego’s quarterly Beginner’s Programming Competition, it was asked whether we could automate the scoring process. That is, we hand out balloons to contestants that complete a certain number of problems. Using hackerrank as our competition management solution, we wanted to know if there was a way to get the scores of people from the hackerrank leaderboard. Hackerrank, having an outdated and incomplete API, was perfect for web scraping.

Basically, web scraping is useful whenever you need information from a webpage, but they do not have an official way to access that information. So, a script is created that acts like a normal user browsing the web and “scrapes” information from the webpage. Usually, this is not a problem – send an HTTP GET request to a webpage with a utility like cURL and then parse the returned source code.

The problem lies when the website is written using a dynamic-rendering language, such as Javascript. Tools like cURL do not wait for the page to render and just return the static content that would always be displayed. This is where SourceGetter comes into play.

SourceGetter is a simple node servlet running PhantomJS that renders a URL passed via a URL parameter and returns the rendered source code. Thus, if you simply use cURL through the SourceGetter servlet, you will get the rendered code.

In order to use it, take a look at the following examples:

Getting the source code of bpforums.info:
https://brandonio21.com:3000/bpforums.info

Getting the source code of google.com:
https://brandonio21.com:3000/google.com

Basically, the servlet runs on port 3000 of this website, https://brandonio21.com:3000, and thus all HTTP GET requests need to be sent through here. Don’t worry, we’re not tracking any of your data. Everything is serve-and-forget. Here is a bit of useful information:

Getting the rendered source code of a URL

HTTP GET to https://brandonio21.com:3000/URL

Getting the rendered source code of a URL under HTTPS

HTTP GET to https://brandonio21.com:3000/+URL
Putting a “+” in front of the URL indicates that you want the web page to be summoned under the HTTPS protocol.

Getting the rendered source code of a web-page with slashes

HTTP GET to https://brandonio21.com:3000/URL+WITH+SLASHES
Simply replace the slashes with “+” signs.

Examples

Getting the rendered source for http://hackerrank.com/contests/ucsd-wic-bpc-wi15/leaderboard/
HTTP GET to https://brandonio21.com:3000/hackerrank.com+contests+ucsd-wic-bpc-wi15+leaderboard

Getting the rendered source for https://hackerrank.com/contests/ucsd-wic-bpc-wi15/leaderboard/

HTTP GET to https://brandonio21.com:3000/+hackerrank.com+contests+ucsd-wic-bpc-wi15+leaderboard

Hopefully you enjoy this product. If you have any questions or comments, please ask!

brandonio21

Observations about life and software engineering

Category Archives: Project Announcements

Introducing download-sweeper

Announcing SourceGetter

Getting the rendered source code of a URL

Getting the rendered source code of a URL under HTTPS

Getting the rendered source code of a web-page with slashes

Examples