PoshCode Archive  Update of "scrape-script"

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Artifact ID: 3cf84fa5c6b322d8d577643aa44870043227b5c64ef3e457dd0f13e9f0b022ef
Page Name:scrape-script
Date: 2018-07-12 14:00:14
Original User: mario
Mimetype:text/x-markdown
Parent: c78295372cb0597f10a64f23da25eb9deeb04fb9828b732d24595be2cd250691 (diff)
Next 5a9298e9633e287321907729c53589675e9e3f19eb0ca240378a8a2cb4305d77
Content

scrape script

  • This firstly generates an URL list from the IA search API to only retrieve the interesting content pages
  • With the extraction script converting from src/* to target/* and populating an open fossil repo right away

Now, I wouldn't recommend doing this again. It takes hours at least. (Mostly due to the wget delay, of course.)

If you want to set up your own instance, either download the /tarball. Or just clone the repo:

 fossil clone http://fossil.include-once.org/poshcode/ mypc.fsl
 fossil open mypc.fsl
 fossil ui

(I realize some people might be upset because of the embedded meta data coments. But those are clearly easier removed than added.)