Web scraping 1M pages

Employer
SONTO-PL
SONTO-PL
Descripción

Need a script to quickly download 1M domains/urls, written in any technology.

quick download of about 1M domains

list of domains: attached or at http://notki.vot.pl/1m-sites.zip (this is the top 1M domains according to Alexa ranking)

run from the command line, where the first argument will be a list of domains or urls, the second argument will be the sqlite database file to which the content will be written. e.g. "python script.py 1m-sites database" retrieves a list of domains/urls from 1m-sites.txt and writes to database.sqlite

database structure - table "pages": datetime | url (from the list) | urlfinal (url actually downloaded after redirects) | content | code (returned server code, if, for example, the domain does not exist returns code 0)

the use of fast download techniques, such as the use of multithreading or asynchronous (or other ways) with the option of configuration

simulation of the browser, so that there is, for example, at least user-agent identification

Published
on 2024-01-08
Category
Funciones requeridas:
web scraping

Offers sent (13)

no avatar
sjd
2 deals
on 2024-01-11
c
databases
html5
+ 5 more
Budget
Negotiable
Copyright
-
Expires in
30 days

Recent jobs from category