In which, I am annoyed by copy protection

I’ve mentioned a few times recently that my wife has been coming up with reasons for me to create scripts.  Mostly, these are because she wants to have local copies of fan-translated Chinese novels.

Personally, I don’t have a lot of interest in the content… but it’s great at motivation.  I want her to be happy, I get to learn new stuff, I justify the existence of the new server that materialized in my server closet a couple of months ago, it all works out.

The last tool I made is one that crawls a web site and generates a list of links from the site.  She can import this list into a program on her iPad and it downloads all of the linked pages as local copies.  I’m not certain how they’re stored and I have no idea how to back them up or view them outside of this specific program, but I’m not really invested and as long as she’s happy then I’m happy.

Not long after I created this tool, she told me she was having trouble downloading links from a specific fan translator’s site.  I took a quick look at the site and could immediately understand why – it was purposefully obtuse.  There was virtually no HTML – rather, everything was being rendered via Javascript.  They REALLY don’t want their content to be read anywhere but their approved page.

I looked at this, and it looked like, well, work.  Which, whenever possible, I try to avoid.

So I told her that, well, I was very sorry but there was not going to be a way to get the content from this page.

And we left it at that.

Then, a few days later, I was cleaning up some chaff in my development folder and noticed the temporary files that had been created when I was trying to parse the offending page.

And, for no real reason, I opened one up in a text editor.  I just wanted to get a second look at what it was doing.

And it made me angry.  

Like, fan translations are inherently illegal.  You can’t “own” them.  It’s polite to acknowledge that translating something from one language into another is an awful lot of effort, and to appreciate the fans that do this, but it doesn’t convey any sort of ownership of the original work.

The pages on this site were incredibly wasteful, and all of it done in the name of trying to prevent anyone from downloading the text.  We’re talking, a single page with 20k of text on it was a 335k file, and the specific translated novel that she wanted to read from this site was over 80 of these files.  It was just a crazy amount of wasted space and excess processing, and the more I looked at it the more I wanted to break it just for spite.

So, I did.

The first thing I considered was trying to use curl or wget to download the site’s pages, but this just gave me the raw data and wasn’t very helpful.

Enter Lynx.

Lynx, for people who may not have ever had to browse the web over a telnet connection – I imagine that is most people – is a text-only web browser designed for non-graphical connections.  It has a couple of interesting ways in which it can be used, as well – you can use it to get a list of all links on a page, and you can tell it to download the contents of a page and save it as a text file.

It’s also open-source!  So if, for some reason, a site were to try to detect it by user agent and block it, it’s easy to tell it to pretend that it’s Chrome or something.

Once I decided that it would make a good tool for the task, about an hour of shell scripting gave me a script that blows the site’s silly protection mechanisms out of the water and gives me a large text file containing the entire text of the fan translation.

And I feel very satisfied.

I will share the script here, though I warn you in advance that the total guarantee I give you is that it “works on my machine!” and if it does not work for you or does something horrible then I do not accept any responsibility.

#!/bin/bash
# getbody.sh - gets the content of a web page and its subpages and outputs it all into a text file.
# only crawls one level deep
# Takes a URL as input. If a second command line parameter is specified, it only follows links that include that parameter as a substring
# If the second command line parameter is the word "links" then it does not crawl, just prints the links to stdout and exits

# Check for a command line parameter. No validation, we're sending this to lynx as is.
if [ -z "$1" ]; then
echo "Usage: getbody.sh <url>"
exit 1;
fi

url="${1}"

# set up a unique filename with a time stamp. I should probably do this for my temp files as well, but lazy.
filename="web_scrape_$(date +"%Y%m%d%H%M%S").txt"

echo Processing site: $url

if [ -z "$2" ]; then
lynx $url -dump -hiddenlinks=ignore -listonly -nonumbers > raw_list_of_links.xyzzy
else
if [ "$2"="links" ]; then
lynx $url -dump -hiddenlinks=ignore -listonly -nonumbers
exit 1;
else
lynx $url -dump -hiddenlinks=ignore -listonly -nonumbers | grep "$2" > raw_list_of_links.xyzzy
fi
fi

# drop anything with a ? in it because I probably don't want whatever it returns
cat raw_list_of_links.xyzzy | grep -v "?" > list_of_links.xyzzy

touch $filename
for url in $(cat list_of_links.xyzzy); do
echo $url
lynx $url -dump -nolist >> $filename
done

rm list_of_links.xyzzy
rm raw_list_of_links.xyzzy

 

This entry was posted in shell scripts. Bookmark the permalink.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.