Medicine for the Sky: Screen-scraping a .NET site with App Engine/Java

My impetus for developing this was the desire to populate a hidden page on my blog with live data from a pair of funds, namely an ebay merchant account used as a rollover fund for donations, and a 529 account from an Ohio provider.

There is an excellent API for ebay merchant accounts that I was able to integrate quickly with App Engine. The blog page performs a poor man's AJAX call to an appspot.com servlet, which in turn queries the ebay API for a current balance, and in the end the dollar amount is returned to the blogspot XMLHttpRequest, where a simple DOM .innerHTML replace populates the browser window with the balance. Here's a quick rundown of the code:

Blogger hooks

Rollover account balance: <span id="pcfBalance">??</span><br />
<script>
var xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange = function() {
  if(xmlhttp.readyState == 4){
    document.getElementById('pcfBalance').innerHTML = xmlhttp.responseText;
  }
};
xmlhttp.open("GET","http://[my application].appspot.com/pcfbal",true);
xmlhttp.send(null);
</script>

App Engine servlet

public void doGet(HttpServletRequest req, HttpServletResponse resp)
    throws IOException {

  HTTPRequest hreq = new HTTPRequest(new URL("https://api-3t.paypal.com/nvp"), HTTPMethod.POST);
  hreq.setPayload("METHOD=GetBalance&USER=[my user id]&PWD=[my password]&VERSION=94.0".getBytes());

  URLFetchService f = URLFetchServiceFactory.getURLFetchService();
  HTTPResponse hresp = f.fetch(hreq);
  
  String response = URLDecoder.decode(new String(hresp.getContent()), "UTF-8");
  Matcher m = Pattern.compile("L_AMT0=(\\d+\\.\\d+)").matcher(response);
  String output = null;
  
  if (m.find()) {
    output = "$".concat(m.group(1));
  } else {
    output = "Error getting balance.";
  }
  
  resp.setContentType("text/html; charset=UTF-8");
  resp.addHeader("Access-Control-Allow-Origin", "http://cautery.blogspot.com");
  resp.getWriter().println(output);
}

So, easy enough for ebay (except for the hardcoded credentials in the code, which is insanity if Google's code download prevention ever goes on the blink). The 529 account, on the other hand, is with College Advantage, which doesn't provide a developer API, so the integration with my balance page had to resort to screen-scraping.

Their customer website is written in .NET, so along the way I needed to decode some of what's going on under the hood. On the login page, this form is presented:

When you enter your credentials and submit the form, the next page contains the account balance:

Based on that, I assumed that I would need my servlet to use URLFetch to GET the login page, grab a session cookie, post my login credentials, and scan the return page for a dollar amount. I quickly found out that it was going to be a bit more tricky.

In the HTML of the login page, the form is plainly visible:

<form name="aspnetForm" method="post" action="login.aspx" onsubmit="javascript:return WebForm_OnSubmit();" id="aspnetForm">

A simple post back to the same ASP page, with presumably a javascript validation step. The inputs were spread out in the code, but were easily enumerated with Chrome's devloper console:

> var inputs = document.forms[0].getElementsByTagName('input')
<- undefined
> for (var i in inputs) console.log(inputs[i])
  <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value>
  <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value>
  <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTk2NDcyNzc3N ... fMckMweSG6gqi57QQ0=">
  <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWBwKZz/jaCALlu9TcCQ ... hwKmiI2E1Qp5JtZ5s">
  <input type="text" class="text-box">
  <input name="ctl00$cphContent$Login1$UserName" type="text" value="(username)" maxlength="50" id="ctl00_cphContent_Login1_UserName" tabindex="1" class="watermark">
  <input id="password-fake" class="watermark" type="text" value="(password)" tabindex="2" style="display: inline;">
  <input name="ctl00$cphContent$Login1$Password" type="password" id="ctl00_cphContent_Login1_Password" tabindex="3">
  <input id="ctl00_cphContent_Login1_RememberMe" type="checkbox" name="ctl00$cphContent$Login1$RememberMe">
  <input type="image" name="ctl00$cphContent$Login1$LoginImageButton" id="ctl00_cphContent_Login1_LoginImageButton" src="images/redesign/sign-in-btn.png" alt="Click here to login" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphContent$Login1$LoginImageButton", "", true, "Login1", "", false, false))" style="color:#284775;border-width:0px;">
  10
  function item() { [native code] }
<- undefined

This was when I first had my "aw crap" moment. The ASP page was littered with things I don't particularly care for, not the least of which is having 8 more form inputs than is necessary. It had the __VIEWSTATE and __EVENTVALIDATION objects, used by ASP.NET to manage state (since posts happen to the same URL) and add some extra security to prevent the client adding extra form fields. It had the long .NET ct100 input names, and also two layers of inputs for the password, one confusingly titled "password-fake". And lastly, the postback call, omnipresent in ASP applications.

After trying different combinations of inputs, I found that I could not programmatically submit the form as-is and get to the next step. The same login page was always returned to me. Puzzled, I decided to see what my browser actually submitted, and use Chrome's net-internals packet capturing console to catch raw HTTP requests. When I did this, I saw some more things which were irritants, but ultimately found what I was missing: copious javascript manipulation of form elements - both deleting and adding them.

The following is a somewhat truncated series of raw requests and responses showing the entire path to getting the sought-after account balance.

Step 1 - Chrome sends a GET request for the login page

GET /cas/login.aspx HTTP/1.1
Host: www.collegeadvantage.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

Step 2 - Server sends the login form, plus a session cookie

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store
Pragma: no-cache
Content-Length: 28017
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=dhevgljw5210rfyuczyq0p3o; path=/; HttpOnly
X-Powered-By: ASP.NET
Date: Thu, 21 Mar 2013 00:56:16 GMT

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_headEnrollmentMater"><title>
...rest of login page

Step 3 - After I enter my credentials and submit the form, Chrome posts this:

POST /cas/login.aspx HTTP/1.1
Host: www.collegeadvantage.com
Connection: keep-alive
Content-Length: 2598
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Origin: https://www.collegeadvantage.com
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
Content-Type: application/x-www-form-urlencoded
Referer: https://www.collegeadvantage.com/cas/login.aspx
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: ASP.NET_SessionId=dhevgljw5210rfyuczyq0p3o; __utma=229128...etc + other Google Analytics cookies

__EVENTTARGET=
&__EVENTARGUMENT=
&__VIEWSTATE=%2FwEPDwUKLTk2NDcyNzc3N...etc
&__EVENTVALIDATION=%2FwEWBwKZz%2FjaC...etc
&ctl00%24cphContent%24Login1%24UserName=[user name]
&ctl00%24cphContent%24Login1%24Password=[password]
&ctl00%24cphContent%24Login1%24LoginImageButton.x=0
&ctl00%24cphContent%24Login1%24LoginImageButton.y=0

The obvious things that stick out here are the last two form elements. In the form, there was a LoginImageButton element, but here that is removed, and replaced with the button object's .x and .y fields. Javascript from either an onload event or the validation function must have done this, and I couldn't imagine it was intentional. Astonishingly, both of these form elements are necessary to continue. Leaving either (or both) of them out results in the login silently failing, and returning you to the login form again.

Step 4 - Server sends redirect to account details page

HTTP/1.1 302 Found
Cache-Control: no-cache, no-store
Pragma: no-cache
Content-Length: 28352
Content-Type: text/html; charset=utf-8
Expires: -1
Location: /cas/AccountDetails.aspx
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: UserInfo=userName=; expires=Thu, 21-Mar-2013 00:56:24 GMT; path=/
Set-Cookie: OTTA_AUTHX=864D0BB4752D1...etc
X-Powered-By: ASP.NET
Date: Thu, 21 Mar 2013 00:56:23 GMT

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="%2fcas%2fAccountDetails.aspx">here</a>.</h2>
</body></html>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_headEnrollmentMater"><title>...

This was another "what the hell?" moment. First, why is there a doctype tag and an additional HTML page under the default IIS "Object moved" page? If your browser doesn't accept the redirect in the HTTP header, what's it going to do with the mess that follows?

Second, why on earth would the server be setting additional cookies on the redirect? After comparing that with another .NET site I frequent (my bank, Huntington) I saw that it was essentially doing the same thing. A little research showed that setting a secondary authorization cookie is a common security measure against session hijacking using the "session fixation" attack.

Since App Engine's URLFetch service isn't a full-fledged browser, this last bit actually caused me some problems. When I got to this step in my code, the URLFetch Service kept throwing convertApplicationException errors (in fact it was when these started showing up that I realized I had finally found the right combination of headers and form elements).

The problem was the session cookie in the redirect. URLFetch defaults to following redirects, but cookie management (in my version of the Java toolkit, anyway) is left to the programmer by means of setHeader or addHeader calls. So, if my hypothesis of what was happening is accurate, URLFetch followed the redirect, but did not have the OTTA_AUTHX session cookie, which led to the exception by virtue of a redirect loop.

The fix was pretty simple, just adding the doNotFollowRedirects option to the HTTPRequest object, allowing a chance to parse the headers and add the new cookie to the followup GET request.

Step 5 - Chrome sends a GET request for the account details page, including both the session and secondary authorization cookies.

GET /cas/AccountDetails.aspx HTTP/1.1
Host: www.collegeadvantage.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
Referer: https://www.collegeadvantage.com/cas/login.aspx
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: ASP.NET_SessionId=dhevgljw5210rfyuczyq0p3o; __utma=2291283...etc; UserInfo=userName=; OTTA_AUTHX=864D0BB4752D1CBE7E55843B...etc

Step 6 - Account details page is returned, containing the account balance in a simple HTML table. Since that's the only thing we care about at this point, I've omitted the headers and the rest of the document.

<td align="center">$25.07</td>

The total can be easily parsed by a regular expression, as it's the only thing on the page in $##.## format.

Initially I intended to have this servlet function the same way as the one for ebay - pulling live data with each call. However, since I'm screen-scraping a page that can change at any time, and the calling Google app server may have different IP addresses (which could look like fraud or unauthorized access to College Advantage), that's two good reasons to minimize the number of calls I make to the site.

I opted instead to split this into two parts: One that was invoked by cron daily, logging on once, pulling the balance, and putting it in a database, and another that the blogspot page would invoke, which simply queried the database for the most recent balance.

Here is the first servlet of the pair, invoked daily by cron:

public class CADatabaseWriter extends HttpServlet {
  private static final Logger log = Logger.getLogger(CADatabaseWriter.class.getName());
  
  public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
    // 1 - Download login form, grab ASP session ID cookie and hidden form fields
    HTTPRequest hreq = new HTTPRequest(new URL("https://www.collegeadvantage.com/cas/login.aspx"), HTTPMethod.GET);
    URLFetchService f = URLFetchServiceFactory.getURLFetchService();
    HTTPResponse hresp = f.fetch(hreq);

    List<HTTPHeader> headers = hresp.getHeaders();
    String cookie = "";
    for (int i = 0; i < headers.size(); i++) {
      if (headers.get(i).getName().equals("Set-Cookie")) {
        cookie = getCookie(headers.get(i).getValue(), "ASP.NET_SessionId");
        break;
      }
    }

    String[] lines = new String(hresp.getContent()).split("\n");
    Pattern p = Pattern.compile("input type=\"hidden\" name=\"(.+?)\".+value=\"(.+?)\"");
    StringBuffer formdata = new StringBuffer();
    for (int i = 0; i < lines.length; i++) {
      Matcher m = p.matcher(lines[i]);
      if (m.find()) {
        cat(formdata, m.group(1), m.group(2));
      }
    }
    cat(formdata, "ctl00$cphContent$Login1$UserName", "(hardcoded user id)");
    cat(formdata, "ctl00$cphContent$Login1$Password", "(hardcoded password");
    // Yes, I know, brazenly insecure
    cat(formdata, "ctl00$cphContent$Login1$LoginImageButton.x", "67");
    cat(formdata, "ctl00$cphContent$Login1$LoginImageButton.y", "14");


    // 2 - POST login request using data scraped from GET request
    hreq = new HTTPRequest(new URL("https://www.collegeadvantage.com/cas/login.aspx"), HTTPMethod.POST, doNotFollowRedirects());
    hreq.setHeader(new HTTPHeader("Origin", "https://www.collegeadvantage.com"));
    hreq.setHeader(new HTTPHeader("Content-Type", "application/x-www-form-urlencoded"));
    hreq.setHeader(new HTTPHeader("Referer", "https://www.collegeadvantage.com/cas/login.aspx"));
    hreq.setHeader(new HTTPHeader("Cookie", cookie));

    hreq.setPayload(formdata.toString().getBytes());
    hresp = f.fetch(hreq);
    headers = hresp.getHeaders();


    // 3 - Pull additional OTTA_AUTHX cookie from redirect page,
    // add to final GET request for account data
    for (int i = 0; i < headers.size(); i++) {
      if (headers.get(i).getName().equals("Set-Cookie")) {
        String cookieStr = headers.get(i).getValue(); 
        cookie = cookie + "; " +
        getCookie(cookieStr, "OTTA_AUTHX");
        break;
      }
    }

    hreq = new HTTPRequest(new URL("https://www.collegeadvantage.com/cas/AccountDetails.aspx"), HTTPMethod.GET);
    hreq.setHeader(new HTTPHeader("Referer", "https://www.collegeadvantage.com/cas/login.aspx"));
    hreq.setHeader(new HTTPHeader("Cookie", cookie));

    hresp = f.fetch(hreq);

    // 4 - Parse current balance
    String content = new String(hresp.getContent());
    Matcher m = Pattern.compile(">(\\$\\d+\\.\\d+)<").matcher(content);
    
    if (m.find()) {
      Entity balance = new Entity("balance");
      balance.setProperty("balance", m.group(1));
      balance.setProperty("date", new Date());
      DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
      datastore.put(balance);
      log.warning("Success");
    } else {
      log.severe("Error retrieving balance");
    }
  }

  private String getCookie(String cookieStr, String key) {
    String value = key + "=";
    Pattern p = Pattern.compile(key + "=(.+?);");
    Matcher m = p.matcher(cookieStr);
    if (m.find()) value = value.concat(m.group(1));
    return value;
  }
  
  private void cat(StringBuffer sb, String key, String value)
               throws UnsupportedEncodingException {
    if (sb.length() > 0) sb.append(('&'));
    String k = URLEncoder.encode(key, "UTF-8");
    String v = URLEncoder.encode(value, "UTF-8");
    sb.append(k).append('=').append(v);
  }
}

This is mapped as /cad in the web.xml file, along with this security constraint, which allows only admins (and cron) to call the servlet:

<security-constraint>
    <web-resource-collection>
        <url-pattern>/cad</url-pattern>
    </web-resource-collection>
    <auth-constraint>
        <role-name>admin</role-name>
    </auth-constraint>
</security-constraint>

I also needed to add a cron.xml file to WEB-INF to specify the schedule:

<?xml version="1.0" encoding="UTF-8"?>
<cronentries>
  <cron>
    <url>/cad</url>
    <description>Get daily balance from College Advantage</description>
    <schedule>every day 06:00</schedule>
    <timezone>America/New_York</timezone>
  </cron>
</cronentries>

That was parsed by App Engine and showed up as this item in "Cron Jobs" in the App Engine UI:

In a serendipitous coincidence, the next day at 6am the collegeadvatage.com login servlet was down, looking like this to a browser:

...and this to the App Engine log:

The top entry showing a success is one I ran manually after the site was repaired, so I was able to see that my error branching was placed fairly well. It's easy to spot as the manually invoked one because it is immediately preceded by a 302 redirect, meaning the security restraint was being provided, and Google was checking to make sure my account was both logged in and an administrator. On the following morning, the 6am job ran successfully, which correlates to the 10am balance (UTC) shown in the entity table here.

What's that? Yes, a deposit did just happen to clear before that last datastore entry was added.

The servlet that displays the current balance back to blogspot is much simpler, just querying the datastore for the most recent entity, and returning it's "balance" field:

public class CABalServlet  extends HttpServlet{
  public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
    DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
    Query q = new Query("balance").addSort("date", Query.SortDirection.DESCENDING);
    Entity e = datastore.prepare(q).asList(FetchOptions.Builder.withLimit(1)).get(0);
    resp.setContentType("text/html; charset=UTF-8");
    resp.addHeader("Access-Control-Allow-Origin", "http://cautery.blogspot.com");
    resp.getWriter().println(e.getProperty("balance"));
  }
}

This debugging process was a good reminder to me why being versed in the underlying protocols and bits on the wire is so important. In the ASP.NET world, a lot of what's going on under the hood is abstracted and obfuscated with runat=server includes, giving the programmer no indication of the madness going on in the HTML. Of course, if I'm going to complain about that, I should have a lot to say about GWT, which I have avoided lately out of a desire to code UI pieces in raw HTML. (What GWT loses in the enlightenment category regarding abstracting too much away from the coder, I think it makes up for by attempting to be cross-browser compliant with a large array of user controls. But wrong thinking is wrong thinking, no matter the larger intent.)

Medicine for the Sky

Saturday, March 23, 2013

Screen-scraping a .NET site with App Engine/Java

No comments:

Post a Comment

Feeds worth following

Archive