- Jun 2024
-
Tags
Annotators
URL
-
- Mar 2024
-
storage.courtlistener.com storage.courtlistener.com
-
By hacking WorldCat.org, scraping and harvesting OCLC’s valuable WorldCat
Complain equates “hacking” with “scraping and harvesting”
This is a matter of some debate—notably the recent LLM web scraping cases.
-
-
www.arnoldventures.org www.arnoldventures.org
- Nov 2022
-
github.com github.com
-
If you are going to crawl sites you better use Ferrum or Vessel because you crawl, not test.
-
- Oct 2022
- Apr 2022
-
forum.newsblur.com forum.newsblur.com
-
https://forum.newsblur.com/t/is-apify-the-best-scraper-for-sites-without-rss/9179
RSS Scraper tools: - Apify https://apify.com/ - RSSHub: https://github.com/DIYgod/RSSHub - RSS Bridge: https://github.com/RSS-Bridge/rss-bridge - Five Filters: https://createfeed.fivefilters.org/ - AWS release notes feed: https://dyn.tedder.me/rss/aws-release-notes.xml - Far Side: https://dyn.tedder.me/rss/farside/daily.json
List of others here: https://tedder.me/generated_news_feeds/
-
- Nov 2021
- Jul 2020
-
hackersandslackers.com hackersandslackers.com
-
github.com github.com
Tags
Annotators
URL
-
-
github.com github.com
Tags
Annotators
URL
-
-
github.com github.com
-
Source for: https://apify.com/page-analyzer
Tags
Annotators
URL
-
- Jun 2020
-
psyarxiv.com psyarxiv.com
-
Westrupp, E., Greenwood, C., Fuller-Tyszkiewicz, M., Berkowitz, T., Hagg, L., & Youssef, G. J. (2020). Text Mining of Reddit Posts: Using Latent Dirichlet Allocation to Identify Common Parenting Issues [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/cw54u
-
- May 2020
-
scrapism.lav.io scrapism.lav.ioScrapism2
-
Facebook began as a (horny) web scraping project, as did Google and all other search engines.
Facebook... errrr.
-
Scrapism
Trying to understand how to scrape data (damn I hate that phrase... it makes me thinnk of some kind of test for colon cancer or something). This pertains to #clubcovid.
-
- Jul 2018
-
webcitation.org webcitation.orgWebCite1
-
Archiving service with an emphasis on scholarly publishing.
-
-
ageofshitlords.com ageofshitlords.com
-
Archiving pages that block it.
-
- Apr 2018
-
-
"The problem: the automated web browsing tools they want to use (commonly called “web scrapers”) are prohibited by the targeted websites’ terms of service, and the CFAA has been interpreted by some courts as making violations of terms of service a crime."
-
Good news for anyone who uses the Internet as a source of information: A district court in Washington, D.C. has ruled that using automated tools to access publicly available information on the open web is not a computer crime
-
-
www.seleniumhq.org www.seleniumhq.org
-
-
petewarden.com petewarden.com
-
Pingback: Legality of Extracting Publicly Available User-Generated Content – PromptCloud Pingback: How to Scrape Facebook Posts for Free Content Ideas Pingback: Facebook data harvesting—what you need to know (From Phys.org) – Peter Schwartz
important readings
-
Google doesn’t use the facebook API to scrape facebook; they just scrape it.
really?
-
This is an extremely important case to remember. It has implications for all Fb users who want to own their past.
-
-
github.com github.com
Tags
Annotators
URL
-
-
github.com github.com
-
warcreate.com warcreate.com
Tags
Annotators
URL
-
-
www.cs.odu.edu www.cs.odu.edu
-
WAIL in Electron,
-
-
www.cs.odu.edu www.cs.odu.edu
-
The author of the defunct ArchiveFacebook addon.
-
-
www.digitalpreservation.gov www.digitalpreservation.gov
-
benbernardblog.com benbernardblog.com
-
Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let's see: Violation of the Computer Fraud and Abuse Act (CFAA). Violation of California Penal Code. Violation of the Digital Millennium Copyright Act (DMCA). Breach of contract. Trespass. Misappropriation.
Linkedin lawsuit -- terrifying
-
-
www.octoparse.com www.octoparse.com
- Mar 2018
-
scrapinghub.com scrapinghub.com
-
Turn websites into structured data.
-
- Sep 2017
-
Local file Local file
-
First, we view technology evolution as a three-stage cyclical process of adoption, appropriation, and repossession. Users drive adoption. Users and providers alternatively drive appropriation and repossession, as users lead appropriation, while providers react when reclaiming the resulting innova-tions. Second, we identify three appropriation modes—baroquize, creolize, and canni-balize—that represent increasing degrees of power contestation by users. And third, we identify three repossession modes—co-opt, combine, and block—that represent increas-ingly antagonistic reactions by providers and mirror users’ appropriation strategies.
El documento como árbol es una convención fija inicial, para lograr cierto movimiento en el desarrollo de la plataforma y las dinámicas alrededor de la misma, pero dicha convención puede ser móvil después (como se indicaba en el primer texto sobre Grafoscopio). Textos rizomáticos o laberínticos como los presentados en la literatura latinoamericana (Cortazar, Borges) podrían ser construidos con Grafoscopio una vez la convención inicial se mueva. Esto implicaría pasar por las sucesivas fases e incluso "canibalizar" Grafoscopio al final, con la ventaja de que las tensiones entre proveedores y usuarios no son tan fuertes, pues son los usuarios los que se están proveyendo de tecnología a sí mismos y cambiándola por el camino. Los lugares de tensión ocurren cuando se manifiesta el caracter político de sus usos, por ejemplo haciendo web scrapping que viola los contenidos de los términos de uso de un sitio web (citar caso de Twitter).
-
- Jul 2017
-
blog.scrapinghub.com blog.scrapinghub.com
-
We shouldn’t have to create open data by scraping websites. This information should be already available, easily accessed and provided in a machine-readable format from the original providers, be they city councils or transportation companies. However, until there’s another option, we’ll always have scraping.
-