legislator.rb

Jump To … +

bill.rb cat.rb legislator.rb organization.rb

legislator.rb
¶

Download this example’s Ruby code to run locally.

The cat.rb example goes over the basics of using Pupa.rb, and bill.rb covers how to relate objects and how to separate scraping tasks for different types of data. This will explain how to run, for example, different methods to scrape legislators depending on the legislative term - particularly useful if a data source changes format from year to year.
```
require 'pupa'

require 'nokogiri'
```

parl.gc.ca uses ASP.NET forms, so we need bigger guns.

require 'mechanize'

class LegislatorProcessor < Pupa::Processor

¶

The data source publishes information from different parliaments in different formats. We override scraping_task_method to select the method used to scrape legislators according to the parliament.
```
  def scraping_task_method(task_name)
    if task_name == :people
```

If the task is to scrape people and a parliament is given, we select a method according to the parliament.

      if @options.key?('parliament')
        if @options['parliament'].to_i >= 36
          "scrape_people_36th_to_date"
        else
          "scrape_people_1st_to_35th"
        end

¶

If no parliament is given, we assume the parliament is recent, as it is more common to scrape current data than historical data.
```
      else
        "scrape_people_36th_to_date"
      end
```
¶

Otherwise, we use scraping_task_method‘s default behavior for other scraping tasks.
```
    else
      super
    end
  end
```

A helper method to put name components in a typical order.

  def swap_first_last_name(name)
    name.strip.match(/\A([^,]+?), ([^(]+?)(?: \(.+\))?\z/)[1..2].
      reverse.map{|component| component.strip.squeeze(' ')}.join(' ')
  end

  def scrape_people_36th_to_date
    url = 'http://www.parl.gc.ca/MembersOfParliament/MainMPsCompleteList.aspx?TimePeriod=Historical&Language=E'
    doc = if @options.key?('parliament')

Since we aren’t using the default Faraday HTTP client, we manually configure the Mechanize client to use Pupa.rb’s logger.

      client = Mechanize.new
      client.log = Pupa::Logger.new('mechanize', level: @level)
      page = client.get(url)
      page.form['MasterPage$MasterPage$BodyContent$PageContent$Content$ListCriteriaContent$ListCriteriaContent$ucComboParliament$cboParliaments'] = @options['parliament']
      page.form.submit.parser
    else
      get(url)
    end

    doc.css('#MasterPage_MasterPage_BodyContent_PageContent_Content_ListContent_ListContent_grdCompleteList tr:gt(1)').each do |row|
      person = Pupa::Person.new
      person.name = swap_first_last_name(row.at_css('td:eq(1)').text)
      dispatch(person)
    end
  end

  def scrape_people_1st_to_35th
    list_url = 'http://www.parl.gc.ca/Parlinfo/Lists/Members.aspx?Language=E'
    page_url = 'http://www.parl.gc.ca/Parlinfo/Lists/Members.aspx?Language=E&Parliament=%s&Riding=&Name=&Party=&Province=&Gender=&New=False&Current=False&First=False&Picture=False&Section=False&ElectionDate='
    doc = get(list_url)
    value = doc.at_xpath("//select[@id='ctl00_cphContent_cboParliamentCriteria']/option[starts-with(.,'#{@options['parliament']}')]/@value").value
    doc = get(page_url % value)

    doc.css('tr:gt(1)').each do |row|
      person = Pupa::Person.new
      person.name = swap_first_last_name(row.at_css('td:eq(1)').text)
      dispatch(person)
    end
  end
end

LegislatorProcessor.add_scraping_task(:people)

¶

To add scraping method selection criteria when running the processor, call legislator.rb following the pattern:
```
ruby legislator.rb [options] -- [criteria]
```
So, for example, to scrape and import legislators from the 37th parliament:
```
ruby legislator.rb -- parliament 37
```
Or, to scrape but not import legislators from the 12th parliament:
```
ruby legislator.rb --action scrape -- parliament 12
```
```
runner = Pupa::Runner.new(LegislatorProcessor)
runner.run(ARGV)
```
¶

Tired of scraping and importing data? See organization.rb to learn how to transform scraped data with Pupa.rb.