• Jump To … +
    bill.rb cat.rb legislator.rb organization.rb
  • bill.rb

  • ¶

    Download this example’s Ruby code to run locally.

    The cat.rb example goes over the basics of using Pupa.rb. This covers how to relate objects and how to separate scraping tasks for different types of data.

    require 'pupa'
    
    require 'nokogiri'
  • ¶

    Defines a new class to model legislative bills. In this example, we will simply scrape the names of bills and associate each bill with a sponsor and a legislative body.

    class Bill
      include Pupa::Model
    
      attr_accessor :number, :name, :sponsor_id, :organization_id
      attr_reader :sponsor, :organization
  • ¶

    When saving scraped objects to a database, these foreign keys will be used to derive an evaluation order.

      foreign_key :sponsor_id, :organization_id
  • ¶

    Sometimes, you may not know the ID of an existing foreign object, but you may have other information to identify it. In that case, put the information you have in a property named after the foreign key without the _id suffix: for example, sponsor for sponsor_id. Before saving the object to the database, Pupa.rb will use this information to identify the foreign object.

      foreign_object :sponsor, :organization
  • ¶

    We want to dump all properties, including foreign objects, to JSON after scraping. However, we do not want to import foreign objects into MongoDB. Pupa.rb automatically excludes foreign objects during import.

      dump :number, :name, :sponsor_id, :organization_id, :sponsor, :organization
  • ¶

    Overrides the sponsor= setter to automatically add the _type property, instead of having to add it each time in the processor.

      def sponsor=(sponsor)
        @sponsor = {_type: 'pupa/person'}.merge(sponsor)
      end
    
      def organization=(organization)
        @organization = {_type: 'pupa/organization'}.merge(organization)
      end
    
      def fingerprint
        to_h.slice(:number)
      end
    
      def to_s
        name
      end
    end
  • ¶

    Scrapes legislative information about the Parliament of Canada.

    class ParliamentOfCanada < Pupa::Processor
  • ¶

    Instead of defining a single scrape_objects method to perform all the scraping, we define a scraping task for each type of data we want to scrape: people, organizations and bills.

    This will let us later, for example, run each task on a different schedule. Bill data is updated more frequently than person data; we would therefore run the bills task more frequently.

    See the scraping_task_method documentation for more information on the naming of scraping methods.

      def scrape_people
        doc = get('http://www.parl.gc.ca/MembersOfParliament/MainMPsCompleteList.aspx?TimePeriod=Historical&Language=E')
        doc.css('#MasterPage_MasterPage_BodyContent_PageContent_Content_ListContent_ListContent_grdCompleteList tr:gt(1)').each do |row|
          person = Pupa::Person.new
          person.name = row.at_css('td:eq(1)').text.match(/\A([^,]+?), ([^(]+?)(?: \(.+\))?\z/)[1..2].
            reverse.map{|component| component.strip.squeeze(' ')}.join(' ')
  • ¶

    Some bills omit sponsors’ middle names, so we add an alternate name that omits any middle names.

          components = person.name.split(' ')
          person.add_name("#{components.first} #{components.last}")
          dispatch(person)
        end
      end
  • ¶

    Hardcodes the top-level organizations within Parliament.

      def scrape_organizations
        parliament = Pupa::Organization.new(name: 'Parliament of Canada')
        dispatch(parliament)
    
        house_of_commons = Pupa::Organization.new(name: 'House of Commons', parent_id: parliament._id)
        dispatch(house_of_commons)
    
        senate = Pupa::Organization.new(name: 'Senate', parent_id: parliament._id)
        dispatch(senate)
      end
    
      def scrape_bills
        doc = get('http://www.parl.gc.ca/LegisInfo/Home.aspx?language=E&ParliamentSession=41-1&Mode=1&download=xml')
        doc['Bills']['Bill'].each do |row|
  • ¶

    Skip Senate bills, since we currently only scrape Members of Parliament.

          next if row['BillNumber']['prefix'] == 'S'
    
          bill = Bill.new
          bill.number = row['BillNumber']['prefix'] + row['BillNumber']['number']
          bill.name = row['BillTitle']['Title'].find{|x| x['language'] == 'en'}['__content__']
  • ¶

    Here, we tell the Bill everything we know about the sponsor and the legislative body. Pupa.rb will later determine which objects match the given information.

          name = row['SponsorAffiliation']['Person']['FullName']
          bill.sponsor = {
            '$or' => [
              {'name' => name},
              {'other_names.name' => name},
            ],
          }
          bill.organization = {
            name: row['BillNumber']['prefix'] == 'C' ? 'House of Commons' : 'Senate',
          }
          dispatch(bill)
        end
      end
    end
    
    ParliamentOfCanada.add_scraping_task(:bills)
    ParliamentOfCanada.add_scraping_task(:organizations)
    ParliamentOfCanada.add_scraping_task(:people)
  • ¶

    By default, if you run bill.rb, it will perform all scraping tasks and import all the scraped objects into the database. Use the --action and --task switches to control the processor’s behavior.

    runner = Pupa::Runner.new(ParliamentOfCanada)
    runner.run(ARGV)
  • ¶

    Ready for more? Check out the next example: legislator.rb.