Web scraping 1M pages

Empleador
SONTO-PL
SONTO-PL
Descripción

Need a script to quickly download 1M domains/urls, written in any technology.

quick download of about 1M domains

list of domains: attached or at http://notki.vot.pl/1m-sites.zip (this is the top 1M domains according to Alexa ranking)

run from the command line, where the first argument will be a list of domains or urls, the second argument will be the sqlite database file to which the content will be written. e.g. "python script.py 1m-sites database" retrieves a list of domains/urls from 1m-sites.txt and writes to database.sqlite

database structure - table "pages": datetime | url (from the list) | urlfinal (url actually downloaded after redirects) | content | code (returned server code, if, for example, the domain does not exist returns code 0)

the use of fast download techniques, such as the use of multithreading or asynchronous (or other ways) with the option of configuration

simulation of the browser, so that there is, for example, at least user-agent identification

Publicado
el 2024-01-08
Categoría
Funciones requeridas:
web scraping
Archivo 1

Ofertas enviadas (13)

Presupuesto
Negociable
Derechos de autor
-
Válido por
30 días

Trabajos recientes desde categoría

  • no avatar
    NoneForNow 0 tratos
    java springboot development
    I have an unfinished java springboot application there a multiple (small) things that need to be...
    Negociable
    25 ofertas
    Válido por 23 días
  • no avatar
    Maxman 0 tratos
    Data Scientist
    We are looking for a Data Scientist to join our Revenue Operations team. You will be responsible...
    Negociable
    31 ofertas
    Válido por 9 días
  • no avatar
    lovedream 0 tratos
    Co-Founder / Head of Global Business Development
    We are a Philippines-based software development team specializing in trading systems, web/mobile...
    Negociable
    6 ofertas
    Válido por 3 días