organization.rb

Jump To … +

bill.rb cat.rb legislator.rb organization.rb

organization.rb
¶

Download this example’s Ruby code to run locally.

The cat.rb, bill.rb and legislator.rb examples show you how to scrape and import data. This example shows you how to transform scraped data.
```
require 'pupa'

require 'csv'
```
¶

We’re going to scrape organizations and output them as CSV, which we can then upload to the Open Knowledge Foundation‘s Public Bodies project.
```
class PublicBodyProcessor < Pupa::Processor
```

This transformation task will write a CSV row for each scraped organization. You can name transformation tasks whatever you like.

  def csv
    puts CSV.generate_line %w(
      title
      abbr
      key
      category
      parent
      parent_key
      description
      url
      jurisdiction
      jurisdiction_code
      source
      source_url
      address
      contact
      email
      tags
    )

organizations is a lazy enumerator of all scraped organizations, so we’ll see a CSV row printed as soon as an organization is scraped.

    organizations.each do |organization|
      puts CSV.generate_line [
        organization.name,
        nil,
        organization._id,
        organization.classification,
        nil,
        nil,
        nil,
        'New Brunswick',
        'ocd-division/country:ca/province:nb',
        organization.sources[0][:note],
        organization.sources[0][:url],
        organization.contact_details.address,
        organization.extras[:contact_point],
        organization.contact_details.email,
        nil,
        nil,
        nil,
      ]
    end
  end

To keep this example short, we’ll just scrape the departments and agencies of the Government of New Brunswick.

  def scrape_organizations
    url = 'http://www1.gnb.ca/cnb/DsS/display-e.asp?typyofPublicBodyID=1'
    doc = get(url)

    doc.xpath('//table[4]//table').each do |table|
      organization = Pupa::Organization.new
      organization.name = table.at_xpath('.//u').text
      address = table.text.strip[/\A#{Regexp.escape(organization.name)}(.+?)(?=Co-ordinator:|Email:|Phone:|Fax:)/m, 1].gsub(/[[:space:]]{2,}/, "\n").strip
      email = clean(table.at_xpath('.//a/@href').value).sub(/\Amailto:/, '')
      contact_detail = table.at_xpath('.//u[text()="Co-ordinator"]').next.text.sub(/\A: /, '')
      organization.add_contact_detail('address', address)
      organization.add_extra(:contact_detail, contact_detail)
      organization.add_contact_detail('email', email)
      organization.add_source(url, note: 'New Brunswick Directory of Public Bodies')
      dispatch(organization)
    end
  end
end

PublicBodyProcessor.add_scraping_task(:organizations)

runner = Pupa::Runner.new(PublicBodyProcessor)

Registers the csv action, so that we can run it with:

ruby organization.rb --action csv > output.csv

runner.add_action(name: 'csv', description: 'Output organizations as CSV')
runner.run(ARGV)

¶

You’ve won at Pupa.rb! Explore the class and method documentation to learn how to do even more with Pupa.rb.