Announcing SourceGetter

Recently during University of California, San Diego’s quarterly Beginner’s Programming Competition, it was asked whether we could automate the scoring process. That is, we hand out balloons to contestants that complete a certain number of problems. Using hackerrank as our competition management solution, we wanted to know if there was a way to get the scores of people from the hackerrank leaderboard. Hackerrank, having an outdated and incomplete API, was perfect for web scraping.

Basically, web scraping is useful whenever you need information from a webpage, but they do not have an official way to access that information. So, a script is created that acts like a normal user browsing the web and “scrapes” information from the webpage. Usually, this is not a problem – send an HTTP GET request to a webpage with a utility like cURL and then parse the returned source code.

The problem lies when the website is written using a dynamic-rendering language, such as Javascript. Tools like cURL do not wait for the page to render and just return the static content that would always be displayed. This is where SourceGetter comes into play.

SourceGetter is a simple node servlet running PhantomJS that renders a URL passed via a URL parameter and returns the rendered source code. Thus, if you simply use cURL through the SourceGetter servlet, you will get the rendered code.

In order to use it, take a look at the following examples:

Getting the source code of bpforums.info:
https://brandonio21.com:3000/bpforums.info

Getting the source code of google.com:
https://brandonio21.com:3000/google.com

Basically, the servlet runs on port 3000 of this website, https://brandonio21.com:3000, and thus all HTTP GET requests need to be sent through here. Don’t worry, we’re not tracking any of your data. Everything is serve-and-forget. Here is a bit of useful information:

 

Getting the rendered source code of a URL

HTTP GET to https://brandonio21.com:3000/URL

Getting the rendered source code of a URL under HTTPS

HTTP GET to https://brandonio21.com:3000/+URL
Putting a “+” in front of the URL indicates that you want the web page to be summoned under the HTTPS protocol.

Getting the rendered source code of a web-page with slashes

HTTP GET to https://brandonio21.com:3000/URL+WITH+SLASHES
Simply replace the slashes with “+” signs.

 

Examples

Getting the rendered source for http://hackerrank.com/contests/ucsd-wic-bpc-wi15/leaderboard/
HTTP GET to https://brandonio21.com:3000/hackerrank.com+contests+ucsd-wic-bpc-wi15+leaderboard

 

Getting the rendered source for https://hackerrank.com/contests/ucsd-wic-bpc-wi15/leaderboard/

HTTP GET to https://brandonio21.com:3000/+hackerrank.com+contests+ucsd-wic-bpc-wi15+leaderboard

 

Hopefully you enjoy this product. If you have any questions or comments, please ask!

Leave a Reply

Your email address will not be published. Required fields are marked *