Category Archives: WebHarvy

Web Scraping from Command Line

WebHarvy supports command line arguments so that you can run the software directly from the command line. This allows you to run WebHarvy from script or batch files, or to invoke it via code from your own applications.

To know more, read : Running WebHarvy Web Scraper from Command Line

Schedule scraping tasks

WebHarvy comes with an in-built scheduler using which you may schedule your scraping tasks. The scheduler window can be opened from the Mine menu.

WebHarvy Scheduler

WebHarvy Scheduler

The scheduler enables you to run scraping tasks periodically – daily, weekly or monthly.

Know More about WebHarvy Scheduler

Download  and Try  the free 15 days evaluation version of WebHarvy Web Data Extraction Software.

WebHarvy v2.0 Released !

The new features in the 2.0 update are :

  • Built-in scheduler for running scraping tasks – (know more)
  • Command Line Options – (know more)
  • MySQL Support for exporting scraped data – (know more)
  • Option to scrape sub text of selected text – (know more)
  • Updated proxy settings – (know more)
    • Supports proxies which require authentication
    • Supports importing proxies from CSV/Text files
  • Option to resume mining from where it stopped/aborted
  • Option to auto-save captured data on regular intervals – (know more)
  • Option to automatically inject pauses while mining (prevents IP blocking) – (know more)
  • Major improvements in mining
  • Minor changes
    • Number of pages & records mined are always displayed in Miner window’s status strip
    • Fixed bug related to capturing images where image text is empty
    • Updated capturing email addresses
    • Record numbers displayed inside captured data grid view in Miner window
    • Option to cancel preview generation for large index page data

You may download the latest version of WebHarvy Web Scraper from http://www.webharvy.com/download.html.

 

How to scrape text following a heading using WebHarvy ?

In the latest update of WebHarvy, the Visual Web Scraping Software, the newly introduced ‘capture following text’ option allows you to capture text/block/paragraph following a heading within a webpage.

Often with many websites the data to be scraped may not be located at the same position within all pages, but is guaranteed to be found under a given heading (Example : “Technical Details”, “Product Specification” etc). Sometimes, the text under a given heading may not be selected as a single item during configuring. In such scenarios the ‘Capture following text’ option in the capture window will provide helpful.

How to ?

While in configuration mode, click on the heading and select the ‘Capture following text’ option in the capture window. Provide a suitable name for the field and hit OK. In the preview pane you will be able to see the text following the heading captured.

Refer http://www.webharvy.com/tour1.html#ScrapeFollowingText for more details.

WebHarvy Web Scraper V1.5.0.26 released

The latest version (V1.5.0.26) of WebHarvy Visual Web Scraper is available for download. The changes in this update are :

  • New option: ‘Capture following text’ added in capture form.
  • Web Miner has been improved to handle even HTML errors of target websites.
  • Allows exporting scraped data while mining is paused.
  • For CSV, TSV exports, column names are added as the first row.
  • Option to input keywords in CSV format.
  • Option to manually set page load timeout value in application settings.

The ‘Capture following text’ feature helps to scrape text following a given heading within the page. This feature is useful when data to be scraped does not occur at a fixed position within the page, but is guaranteed to follow a heading text (Example ‘Product Details:‘ or ‘Specification‘).

The option to manually set the page load timeout value from settings window helps to scrape data from websites with slow response times or from those which employ AJAX.

We recommend that you download and try the 15 days free evaluation version.

How to scrape data anonymously ?

WebHarvy Web Scraper allows you to scrape data from remote websites anonymously with the help of proxy servers. This prevents remote web servers from blocking / black listing your computer’s IP address.

WebHarvy provides you the option to specify either a single proxy server address or a list of proxy servers addresses through which the remote website will be scraped. In case you are providing a list of proxy server addresses, WebHarvy will cycle through the list in a periodic manner.

Please follow this link to know more about this feature.

Download WebHarvy Web Scraper FREE Trial !

How to scrape search results data for a list of input keywords ?

In most cases the data to be scraped is the result of performing a search operation from the main page of the website. Often it is required that you need to extract data from the search results for a list of input keywords.

The ‘Keyword Scraping’ feature of WebHarvy allows you to perform this task with ease. You can specify a list of input keywords and WebHarvy will automatically scrape data from the search results corresponding to each keyword in the specified list.

Please follow this link to know more about ‘Keyword based Scraping’.

Video Demonstration : Keyword based Scraping

We recommend that you download and try the evaluation version of our Web Scraper to know more about the features.

 

WebHarvy Web Scraper : Scrape data from sections and sub sections within webpages

The ‘category scraping’ feature of WebHarvy allows you to easily scrape a list of links which leads to similarly formatted pages within a website with a single configuration. This helps to scrape data from sections and subsections listed under the main page of a website.

Please follow this link to know more about Category Scraping.

Category Scraping : Video demonstration 

You may download and try the free evaluation version of WebHarvy, the visual Web Scraper software, from http://www.webharvy.com/download.html.

WebHarvy V1.4.0.20 Released

The latest update of WebHarvy (version 1.4.0.20) has gone live and is available for download at www.webharvy.com/download.html.

Changes :

  • [New Feature] Keyword based Scraping : Allows you to run the same configuration for a set of input keywords (Read more : http://www.webharvy.com/tour71.html)
  • Edit Configuration : Allows you to edit an already saved WebHarvy configuration XML file (Read more : http://www.webharvy.com/tour41.html)
  • Option to contact us (WebHarvy Support) directly from the application (See Help menu)
  • Option to check for new updates directly from the application (See Help menu)
  • Miner performance improvement : Web mining performance while following links from the main page has been improved
  • Minor improvements and bug fixes
    • Miner window remembers its last position/size/state
    • Issue with Auto Scroll fixed
    • Issue with loading ‘Next Page’ and ‘Following Links’ in certain scenarios while mining has been fixed
    • Issue which resulted in application crash while parsing HTML of certain websites has been fixed

Web Scrape Anonymously

WebHarvy allows you to scrape websites anonymously via proxy servers. You can either configure WebHarvy to scrape through a single proxy server or to use a list of proxy server addresses which are cycled automatically after a specified time interval.

Scrape via Proxy Servers

You may download the 15 days evaluation copy of WebHarvy Web Scraper from http://www.webharvy.com/download.html .

Follow

Get every new post delivered to your Inbox.