Skyscraper

Skyscraper 0.3
Ascending to the next floor

Daniel Janus
clojureD, 2020-02-29

Scraping flow

            
(def seed
     [{:url "https://conference.news"
       :processor :homepage}])

Scraping flow

            
[{:title ":clojureD held in Berlin"
  :art-id "1"
  :url "https://conference.news/article/1"
  :processor :article}
 {:title "People enjoying themselves at :clojureD"
  :art-id "2"
  :url "https://conference.news/article/2"
  :processor :article}
 {:title "Awesome talks all around"
  :art-id "3"
  :url "https://conference.news/article/3"
  :processor :article}]

Scraping flow

            
[{:title ":clojureD held in Berlin"
  :art-id "1"
  :body "This is the text of the first article."
  :num-comments "31"}
 {:title "People enjoying themselves at :clojureD"
  :art-id "2"
  :body "Here comes the second text."
  :num-comments "41"}
 {:title "Awesome talks all around"
  :art-id "3"
  :body "And a third one."
  :num-comments "59"}]

Processor

            
(defprocessor :homepage
  :cache-template "index"
  :process-fn (fn [doc ctx]
                (for [x (select doc "ul.articles li a")
                      :let [url (attr x :href)]]
                  {:title (text x)
                   :url url
                   :art-id (last (string/split url #"/"))
                   :processor :article})))

Running it

            
(scrape seed)

New in 0.3

Processor, again

            
(defprocessor :homepage
  :cache-template "index"
  :skyscraper.db/columns [:title :art-id]
  :skyscraper.db/key-columns [:art-id]
  :process-fn (fn [doc ctx]
                (for [x (select doc "ul.articles li a")
                      :let [url (attr x :href)]]
                  {:title (text x)
                   :url url
                   :art-id (last (string/split url #"/"))
                   :processor :article})))

Running it, again

            
(scrape! seed
         :db-file "conf.sqlite")

Lo and behold

            
sqlite> .schema
CREATE TABLE homepage (id integer primary key,
                       parent integer,
                       title text,
                       art_id text);
CREATE TABLE article (id integer primary key,
                      parent integer,
                      body text,
                      num_comments text);

Lo and behold

            
sqlite> select title, num_comments
        from article
          join homepage on article.parent = homepage.id;

title = Awesome talks all around
num_comments = 59

title = :clojureD held in Berlin
num_comments = 31

title = People enjoying themselves at :clojureD
num_comments = 41

Happy scraping!
/nathell/skyscraper

Daniel Janus
@nathell