Job description
Need a script to quickly download 1M domains/urls, written in any technology.
quick download of about 1M domains
list of domains: attached or at http://notki.vot.pl/1m-sites.zip (this is the top 1M domains according to Alexa ranking)
run from the command line, where the first argument will be a list of domains or urls, the second argument will be the sqlite database file to which the content will be written. e.g. "python script.py 1m-sites database" retrieves a list of domains/urls from 1m-sites.txt and writes to database.sqlite
database structure - table "pages": datetime | url (from the list) | urlfinal (url actually downloaded after redirects) | content | code (returned server code, if, for example, the domain does not exist returns code 0)
the use of fast download techniques, such as the use of multithreading or asynchronous (or other ways) with the option of configuration
simulation of the browser, so that there is, for example, at least user-agent identification