爬虫如何实现每天爬取,定点爬取[以股票数据为例]

矫情吗;* 2022-07-11 08:40 756阅读 0赞
  • 分析抓取的数据
  • 抓包
  • 框架
  • model
  • main
  • util
  • parse
  • db
  • 问题所在
  • 解决方法

    • job
    • jobmain

近期,有人将本人博客,复制下来,直接上传到百度文库等平台。
本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)

分析抓取的数据

本文是以东方财富网的数据为例,这里只做技术学习使用,请勿滥用。如本文要抓取的数据是东方财富网的汽车板块及石油板块数据。如下为其地址:http://quote.eastmoney.com/center/list.html#28002481_0_2
http://quote.eastmoney.com/center/list.html#28002464_0_2
如下截图为其数据格式。

这里写图片描述

抓包

写爬虫第一步是做网络抓包,这个我之前的博客中已经讲到即看数据请求的真实地址。关于本文为什么这样设计,请看我的专题博客,爬虫原理及相关基础:http://blog.csdn.net/column/details/14269.html。
这里写图片描述

从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:
这里写图片描述

框架

本文使用的框架,如下图所示:
这里写图片描述

db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。

model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。

parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。

main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。

job:用来执行的job任务。

jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。

model

model用来封装我要爬去的数据,如当天的日期,股票的id,股票的名称,股票价格等等。如下面程序:

  1. package model;
  2. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  3. public class ExtMarketOilStockModel {
  4. private String date;
  5. private String stock_id;
  6. private String stock_name;
  7. private float stock_price;
  8. private float stock_change;
  9. private float stock_range;
  10. private float stock_amplitude;
  11. private int stock_trading_number;
  12. private int stock_trading_value;
  13. private float stock_yesterdayfinish_price;
  14. private float stock_todaystart_price;
  15. private float stock_max_price;
  16. private float stock_min_price;
  17. private float stock_fiveminuate_change;
  18. private String craw_time;
  19. public String getDate() {
  20. return date;
  21. }
  22. public void setDate(String date) {
  23. this.date = date;
  24. }
  25. public String getStock_id() {
  26. return stock_id;
  27. }
  28. public void setStock_id(String stock_id) {
  29. this.stock_id = stock_id;
  30. }
  31. public String getStock_name() {
  32. return stock_name;
  33. }
  34. public void setStock_name(String stock_name) {
  35. this.stock_name = stock_name;
  36. }
  37. public float getStock_price() {
  38. return stock_price;
  39. }
  40. public void setStock_price(float stock_price) {
  41. this.stock_price = stock_price;
  42. }
  43. public float getStock_change() {
  44. return stock_change;
  45. }
  46. public void setStock_change(float stock_change) {
  47. this.stock_change = stock_change;
  48. }
  49. public float getStock_range() {
  50. return stock_range;
  51. }
  52. public void setStock_range(float stock_range) {
  53. this.stock_range = stock_range;
  54. }
  55. public float getStock_amplitude() {
  56. return stock_amplitude;
  57. }
  58. public void setStock_amplitude(float stock_amplitude) {
  59. this.stock_amplitude = stock_amplitude;
  60. }
  61. public int getStock_trading_number() {
  62. return stock_trading_number;
  63. }
  64. public void setStock_trading_number(int stock_trading_number) {
  65. this.stock_trading_number = stock_trading_number;
  66. }
  67. public int getStock_trading_value() {
  68. return stock_trading_value;
  69. }
  70. public void setStock_trading_value(int stock_trading_value) {
  71. this.stock_trading_value = stock_trading_value;
  72. }
  73. public float getStock_yesterdayfinish_price() {
  74. return stock_yesterdayfinish_price;
  75. }
  76. public void setStock_yesterdayfinish_price(float stock_yesterdayfinish_price) {
  77. this.stock_yesterdayfinish_price = stock_yesterdayfinish_price;
  78. }
  79. public float getStock_todaystart_price() {
  80. return stock_todaystart_price;
  81. }
  82. public void setStock_todaystart_price(float stock_todaystart_price) {
  83. this.stock_todaystart_price = stock_todaystart_price;
  84. }
  85. public float getStock_max_price() {
  86. return stock_max_price;
  87. }
  88. public void setStock_max_price(float stock_max_price) {
  89. this.stock_max_price = stock_max_price;
  90. }
  91. public float getStock_min_price() {
  92. return stock_min_price;
  93. }
  94. public void setStock_min_price(float stock_min_price) {
  95. this.stock_min_price = stock_min_price;
  96. }
  97. public float getStock_fiveminuate_change() {
  98. return stock_fiveminuate_change;
  99. }
  100. public void setStock_fiveminuate_change(float stock_fiveminuate_change) {
  101. this.stock_fiveminuate_change = stock_fiveminuate_change;
  102. }
  103. public String getCraw_time() {
  104. return craw_time;
  105. }
  106. public void setCraw_time(String craw_time) {
  107. this.craw_time = craw_time;
  108. }
  109. }

main

主方法,尽量要求简单,这里我就这样写了。这里面有注释,很好理解。

  1. package navi.main;
  2. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  3. import java.util.ArrayList;
  4. import java.util.List;
  5. import db.MYSQLControl;
  6. import model.ExtMarketOilStockModel;
  7. import parse.ExtMarketOilStockParse;
  8. public class ExtMarketOilStockMain {
  9. public static void main(String[] args) throws Exception {
  10. List<String> urloillist=new ArrayList<String>();
  11. List<String> urlcarlist=new ArrayList<String>();
  12. List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();
  13. List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();
  14. //石油相关股票就两页,对应两个地址
  15. String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
  16. String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
  17. urloillist.add(url1);
  18. urloillist.add(url2);
  19. for (int i = 0; i < urloillist.size(); i++) {
  20. //解析url
  21. oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
  22. //存储每页的数据
  23. MYSQLControl.insertoilStocks(oilstocks);
  24. }
  25. //汽车相关股票有6页,对应6个地址
  26. for (int i = 1; i <6; i++) {
  27. String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
  28. urlcarlist.add(urli);
  29. }
  30. for (int i = 0; i < urlcarlist.size(); i++) {
  31. //解析url
  32. carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
  33. //存储数据
  34. MYSQLControl.insertcarStocks(carstocks);
  35. }
  36. }
  37. }

util

这里有三个文件,HTTPUtils,TimeUtils(这是我自己经常用的一个类,主要是各种日期的转化,如String转化为date,获取当前时间等等),UumericalUtil(这是一个Float保留几位小数的类)。

  1. package util;
  2. import java.io.BufferedReader;
  3. import java.io.IOException;
  4. import java.io.InputStream;
  5. import java.io.InputStreamReader;
  6. import java.net.URL;
  7. import java.net.URLConnection;
  8. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  9. public abstract class HTTPUtils {
  10. //这个方法是向后台请求数据,获取html或者json等
  11. public static String getRawHtml(String personalUrl) throws InterruptedException,IOException {
  12. URL url = new URL(personalUrl);
  13. URLConnection conn = url.openConnection();
  14. InputStream in=null;
  15. try {
  16. conn.setConnectTimeout(3000);
  17. in = conn.getInputStream();
  18. } catch (Exception e) {
  19. }
  20. //将获取的数据转化为String
  21. String html = convertStreamToString(in);
  22. return html;
  23. }
  24. //这个方法是将InputStream转化为String
  25. public static String convertStreamToString(InputStream is) throws IOException {
  26. if (is == null)
  27. return "";
  28. BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
  29. StringBuilder sb = new StringBuilder();
  30. String line = null;
  31. try {
  32. while ((line = reader.readLine()) != null) {
  33. sb.append(line);
  34. }
  35. } catch (IOException e) {
  36. e.printStackTrace();
  37. } finally {
  38. try {
  39. is.close();
  40. } catch (IOException e) {
  41. e.printStackTrace();
  42. }
  43. }
  44. reader.close();
  45. return sb.toString();
  46. }
  47. }

以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。

  1. package util;
  2. import java.text.DateFormat;
  3. import java.text.DecimalFormat;
  4. import java.text.ParseException;
  5. import java.text.SimpleDateFormat;
  6. import java.util.ArrayList;
  7. import java.util.Calendar;
  8. import java.util.Date;
  9. import java.util.List;
  10. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  11. public class TimeUtils {
  12. public static void main( String[] args ) throws ParseException{
  13. String time = getMonth("2002-1-08 14:50:38");
  14. System.out.println(time);
  15. System.out.println(getDay("2002-1-08 14:50:38"));
  16. System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm"));
  17. }
  18. //get current time
  19. public static String GetNowDate(String formate){
  20. String temp_str="";
  21. Date dt = new Date();
  22. SimpleDateFormat sdf = new SimpleDateFormat(formate);
  23. temp_str=sdf.format(dt);
  24. return temp_str;
  25. }
  26. public static String getMonth( String time ){
  27. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM");
  28. Date date = null;
  29. try {
  30. date = sdf.parse(time);
  31. Calendar cal = Calendar.getInstance();
  32. cal.setTime(date);
  33. } catch (ParseException e) {
  34. e.printStackTrace();
  35. }
  36. return sdf.format(date);
  37. }
  38. public static String getDay( String time ){
  39. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
  40. Date date = null;
  41. try {
  42. date = sdf.parse(time);
  43. Calendar cal = Calendar.getInstance();
  44. cal.setTime(date);
  45. } catch (ParseException e) {
  46. e.printStackTrace();
  47. }
  48. return sdf.format(date);
  49. }
  50. public static Date parseTime(String inputTime) throws ParseException{
  51. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  52. Date date = sdf.parse(inputTime);
  53. return date;
  54. }
  55. public static String dateToString(Date date, String type) {
  56. DateFormat df = new SimpleDateFormat(type);
  57. return df.format(date);
  58. }
  59. public static Date parseTime(String inputTime, String timeFormat) throws ParseException{
  60. SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);
  61. Date date = sdf.parse(inputTime);
  62. return date;
  63. }
  64. public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{
  65. SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);
  66. Date date = sdf.parse(inputTime);
  67. Calendar calendar = Calendar.getInstance();
  68. calendar.setTime(date);
  69. return calendar;
  70. }
  71. public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{
  72. return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600));
  73. }
  74. public static Date parseTime(long inputTime){
  75. // SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  76. Date date= new Date(inputTime);
  77. return date;
  78. }
  79. public static String parseTimeString(long inputTime){
  80. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  81. Date date= new Date(inputTime);
  82. return sdf.format(date);
  83. }
  84. public static String parseStringTime(String inputTime){
  85. String date=null;
  86. try {
  87. Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime);
  88. date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1);
  89. } catch (ParseException e) {
  90. // TODO Auto-generated catch block
  91. e.printStackTrace();
  92. }
  93. return date;
  94. }
  95. public static List<String> YearMonth(int year) {
  96. List<String> yearmouthlist=new ArrayList<String>();
  97. for (int i = 1; i < 13; i++) {
  98. DecimalFormat dfInt=new DecimalFormat("00");
  99. String sInt = dfInt.format(i);
  100. yearmouthlist.add(year+sInt);
  101. }
  102. return yearmouthlist;
  103. }
  104. public static List<String> YearMonth(int startyear,int finistyear) {
  105. List<String> yearmouthlist=new ArrayList<String>();
  106. for (int i = startyear; i < finistyear+1; i++) {
  107. for (int j = 1; j < 13; j++) {
  108. DecimalFormat dfInt=new DecimalFormat("00");
  109. String sInt = dfInt.format(j);
  110. yearmouthlist.add(i +"-"+sInt);
  111. }
  112. }
  113. return yearmouthlist;
  114. }
  115. public static List<String> TOAllDay(int year){
  116. List<String> daylist=new ArrayList<String>();
  117. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
  118. int m=1;//月份计数
  119. while (m<13)
  120. {
  121. int month=m;
  122. Calendar cal=Calendar.getInstance();//获得当前日期对象
  123. cal.clear();//清除信息
  124. cal.set(Calendar.YEAR,year);
  125. cal.set(Calendar.MONTH,month-1);//1月从0开始
  126. cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天
  127. System.out.println("##########___" + sdf.format(cal.getTime()));
  128. int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH);
  129. System.out.println("$$$$$$$$$$________" + count);
  130. for (int j=0;j<=(count - 2);)
  131. {
  132. cal.add(Calendar.DAY_OF_MONTH,+1);
  133. j++;
  134. daylist.add(sdf.format(cal.getTime()));
  135. }
  136. m++;
  137. }
  138. return daylist;
  139. }
  140. //获取昨天的日期
  141. public static String getyesterday(){
  142. Calendar cal = Calendar.getInstance();
  143. cal.add(Calendar.DATE, -1);
  144. String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime());
  145. return yesterday;
  146. }
  147. }

这个类实现的是保留几位小数。如股票价格等,保留两位小数。

  1. package util;
  2. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  3. import java.math.BigDecimal;
  4. import java.text.DecimalFormat;
  5. public class UumericalUtil {
  6. public static float FloatTO(float f, int number) {
  7. BigDecimal b = new BigDecimal(f);
  8. float f1 = b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue();
  9. return f1;
  10. }
  11. public static String NumberTO(int number) {
  12. DecimalFormat dfInt=new DecimalFormat("00");
  13. String sInt = dfInt.format(number);
  14. System.out.println(sInt);
  15. return sInt;
  16. }
  17. }

parse

parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里只是采用最简单的字符串解析的方式。如下为某一页的数据,这要针对的是此类型的数据进行解析:

  1. var quote_123={rank:["2,002662,京威股份,15.62,0.38,2.49%,2.95,10294,15948185,15.24,15.28,15.65,15.20,-,-,-,-,-,-,-,-,0.00%,0.62,0.17,33.47","2,002536,西泵股份,13.15,0.32,2.49%,3.74,26558,34710121,12.83,12.88,13.27,12.79,-,-,-,-,-,-,-,-,0.00%,0.99,0.87,41.09","1,600741,华域汽车,16.22,0.39,2.46%,2.59,215140,346480560,15.83,15.85,16.26,15.85,-,-,-,-,-,-,-,-,0.12%,1.23,0.75,8.59","1,601689,拓普集团,29.74,0.68,2.34%,3.20,36329,107964394,29.06,29.06,29.94,29.01,-,-,-,-,-,-,-,-,-0.20%,1.34,2.13,34.32","1,603306,华懋科技,33.87,0.74,2.23%,4.50,9251,31242113,33.13,33.14,34.20,32.71,-,-,-,-,-,-,-,-,-0.03%,0.72,1.25,29.60","1,601799,星宇股份,37.40,0.80,2.19%,3.80,5522,20477010,36.60,36.40,37.50,36.11,-,-,-,-,-,-,-,-,0.03%,0.86,0.23,28.43","1,603166,福达股份,14.02,0.29,2.11%,2.91,47265,66170428,13.73,13.80,14.14,13.74,-,-,-,-,-,-,-,-,0.21%,0.96,3.15,95.59","2,002190,成飞集成,32.44,0.66,2.08%,2.99,25213,81219488,31.78,31.63,32.58,31.63,-,-,-,-,-,-,-,-,0.03%,0.86,0.73,93.58","1,600213,亚星客车,14.77,0.30,2.07%,3.46,18878,27820060,14.47,14.52,14.88,14.38,-,-,-,-,-,-,-,-,-0.07%,0.64,0.86,55.39","2,300432,富临精工,21.28,0.43,2.06%,4.70,28707,60945368,20.85,20.60,21.58,20.60,-,-,-,-,-,-,-,-,-0.14%,1.29,2.07,50.58","2,300375,鹏翎股份,21.25,0.42,2.02%,3.94,11367,24164157,20.83,20.83,21.45,20.63,-,-,-,-,-,-,-,-,-0.14%,0.83,1.44,30.27","2,002363,隆基机械,11.47,0.22,1.96%,2.49,33946,38796837,11.25,11.27,11.55,11.27,-,-,-,-,-,-,-,-,0.00%,0.80,0.88,61.45","1,600469,风神股份,11.55,0.22,1.94%,3.09,38444,44305565,11.33,11.33,11.63,11.28,-,-,-,-,-,-,-,-,0.09%,0.67,0.68,27.07","2,002454,松芝股份,12.98,0.24,1.88%,2.83,27839,36056020,12.74,12.70,13.06,12.70,-,-,-,-,-,-,-,-,0.00%,1.17,0.87,25.84","2,002488,金固股份,14.79,0.27,1.86%,2.48,29002,42872475,14.52,14.52,14.88,14.52,-,-,-,-,-,-,-,-,0.00%,0.72,0.75,-","2,002284,亚太股份,13.18,0.24,1.85%,3.32,61756,81198133,12.94,12.87,13.30,12.87,-,-,-,-,-,-,-,-,0.30%,1.10,0.90,58.15","1,603788,宁波高发,35.97,0.64,1.81%,3.40,6719,24160418,35.33,35.21,36.33,35.13,-,-,-,-,-,-,-,-,0.03%,0.59,1.37,34.10","2,000957,中通客车,14.36,0.25,1.77%,2.69,59696,85581415,14.11,14.07,14.45,14.07,-,-,-,-,-,-,-,-,0.00%,0.79,1.25,13.99","2,300304,云意电气,52.12,0.90,1.76%,5.70,179330,922614032,51.22,50.38,52.83,49.91,-,-,-,-,-,-,-,-,-0.04%,1.12,9.35,108.58","2,002607,亚夏汽车,10.03,0.17,1.72%,4.16,27760,27878904,9.86,9.89,10.19,9.78,-,-,-,-,-,-,-,-,-0.30%,0.97,1.03,57.87"],pages:6}
  2. package parse;
  3. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  4. import java.util.ArrayList;
  5. import java.util.List;
  6. import model.ExtMarketOilStockModel;
  7. import util.HTTPUtils;
  8. import util.TimeUtils;
  9. import util.UumericalUtil;
  10. public class ExtMarketOilStockParse {
  11. public static List<ExtMarketOilStockModel> parseurl(String url) throws Exception {
  12. List<ExtMarketOilStockModel> list=new ArrayList<ExtMarketOilStockModel>();
  13. String response=HTTPUtils.getRawHtml(url);
  14. String html = response.toString();
  15. String jsonarra=html.split("rank:")[1].split(",pages")[0];
  16. String stocks[]=jsonarra.split("\",");
  17. List<String> stocklist=new ArrayList<String>();
  18. for (int i = 0; i < stocks.length; i++) {
  19. stocklist.add(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
  20. System.out.println(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
  21. }
  22. for (int i = 0; i < stocklist.size(); i++) {
  23. String date=TimeUtils.GetNowDate("yyyy-MM-dd");
  24. String stock_id=stocklist.get(i).split(",")[1];
  25. String stock_name=stocklist.get(i).split(",")[2];
  26. float stock_price=0;
  27. float stock_change=0;
  28. float stock_range=0;
  29. float stock_amplitude=0;
  30. int stock_trading_number=0;
  31. int stock_trading_value=0;
  32. float stock_yesterdayfinish_price=0;
  33. float stock_todaystart_price=0;
  34. float stock_max_price=0;
  35. float stock_min_price=0;
  36. float stock_fiveminuate_change=0;
  37. if (!stocklist.get(i).split(",")[3].equals("-")) {
  38. //价格
  39. stock_price=Float.parseFloat(stocklist.get(i).split(",")[3]);
  40. //涨跌额
  41. stock_change=Float.parseFloat(stocklist.get(i).split(",")[4]);
  42. System.out.println(stock_change);
  43. //涨跌幅
  44. stock_range=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[5].replace("%", ""))*0.01),4);
  45. stock_amplitude=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[6].replace("%", ""))*0.01),4);;
  46. stock_trading_number=Integer.parseInt(stocklist.get(i).split(",")[7].replace("%", ""));
  47. stock_trading_value=Integer.parseInt(stocklist.get(i).split(",")[8].replace("%", ""));
  48. stock_yesterdayfinish_price=Float.parseFloat(stocklist.get(i).split(",")[9]);
  49. stock_todaystart_price=Float.parseFloat(stocklist.get(i).split(",")[10]);
  50. stock_max_price=Float.parseFloat(stocklist.get(i).split(",")[11]);
  51. stock_min_price=Float.parseFloat(stocklist.get(i).split(",")[12]);
  52. stock_fiveminuate_change=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[21].replace("%", ""))*0.01),4);;
  53. System.out.println(stock_fiveminuate_change);
  54. }
  55. String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss");
  56. ExtMarketOilStockModel model=new ExtMarketOilStockModel();
  57. model.setDate(date);
  58. model.setStock_id(stock_id);
  59. model.setStock_name(stock_name);
  60. model.setStock_price(stock_price);
  61. model.setStock_change(stock_change);
  62. model.setStock_range(stock_range);
  63. model.setStock_amplitude(stock_amplitude);
  64. model.setStock_trading_number(stock_trading_number);
  65. model.setStock_trading_value(stock_trading_value);
  66. model.setStock_yesterdayfinish_price(stock_yesterdayfinish_price);
  67. model.setStock_todaystart_price(stock_todaystart_price);
  68. model.setStock_max_price(stock_max_price);
  69. model.setStock_min_price(stock_min_price);
  70. model.setStock_fiveminuate_change(stock_fiveminuate_change);
  71. model.setCraw_time(craw_time);
  72. list.add(model);
  73. }
  74. return list;
  75. }
  76. }

db

db中包含两个java文件,MyDataSource,MYSQLControl。这两个文件的作用已在前面说明了。

  1. package db;
  2. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  3. import javax.sql.DataSource;
  4. import org.apache.commons.dbcp2.BasicDataSource;
  5. public class MyDataSource {
  6. public static DataSource getDataSource(String connectURI){
  7. BasicDataSource ds = new BasicDataSource();
  8. //MySQL的jdbc驱动
  9. ds.setDriverClassName("com.mysql.jdbc.Driver");
  10. ds.setUsername("root"); //所要连接的数据库名
  11. ds.setPassword("112233"); //MySQL的登陆密码
  12. ds.setUrl(connectURI);
  13. return ds;
  14. }
  15. }
  16. package db;
  17. import java.sql.SQLException;
  18. import java.util.List;
  19. import javax.sql.DataSource;
  20. import org.apache.commons.dbutils.QueryRunner;
  21. import org.apache.commons.dbutils.ResultSetHandler;
  22. import org.apache.commons.dbutils.handlers.BeanListHandler;
  23. import org.apache.commons.dbutils.handlers.ColumnListHandler;
  24. import org.apache.commons.dbutils.handlers.ScalarHandler;
  25. import org.apache.commons.logging.Log;
  26. import org.apache.commons.logging.LogFactory;
  27. import model.ExtMarketOilStockModel;
  28. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  29. public class MYSQLControl {
  30. static final Log logger = LogFactory.getLog(MYSQLControl.class);
  31. static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/datacollection");
  32. static QueryRunner qr = new QueryRunner(ds);
  33. //第一类方法
  34. public static void executeUpdate(String sql){
  35. try {
  36. qr.update(sql);
  37. } catch (SQLException e) {
  38. logger.error(e);
  39. }
  40. }
  41. //按照SQL查询单个结果
  42. public static Object getScalaBySQL ( String sql ){
  43. ResultSetHandler<Object> h = new ScalarHandler<Object>(1);
  44. Object obj = null;
  45. try {
  46. obj = qr.query(sql, h);
  47. } catch (SQLException e) {
  48. e.printStackTrace();
  49. }
  50. return obj;
  51. }
  52. //按照SQL查询多个结果
  53. public static <T> List<T> getListInfoBySQL (String sql, Class<T> type ){
  54. List<T> list = null;
  55. try {
  56. list = qr.query(sql,new BeanListHandler<T>(type));
  57. } catch (SQLException e) {
  58. e.printStackTrace();
  59. }
  60. return list;
  61. }
  62. //查询一列
  63. public static List<Object> getListOneBySQL (String sql,String id){
  64. List<Object> list=null;
  65. try {
  66. list = (List<Object>) qr.query(sql, new ColumnListHandler(id));
  67. } catch (SQLException e) {
  68. e.printStackTrace();
  69. }
  70. return list;
  71. }
  72. //此种数据库操作方法需要优化
  73. public static int insertoilStocks ( List<ExtMarketOilStockModel> oilstocks ) {
  74. Object[][] params = new Object[oilstocks.size()][17];
  75. int c = 0; //success number of update
  76. int[] sum;
  77. for ( int i = 0; i < oilstocks.size(); i++ ){
  78. params[i][0] = oilstocks.get(i).getDate();
  79. params[i][1] = oilstocks.get(i).getStock_id();
  80. params[i][2] = oilstocks.get(i).getStock_name();
  81. params[i][3] = oilstocks.get(i).getStock_price();
  82. params[i][4] = oilstocks.get(i).getStock_change();
  83. params[i][5] = oilstocks.get(i).getStock_range();
  84. params[i][6] = oilstocks.get(i).getStock_amplitude();
  85. params[i][7] = oilstocks.get(i).getStock_trading_number();
  86. params[i][8] = oilstocks.get(i).getStock_trading_value();
  87. params[i][9] = oilstocks.get(i).getStock_yesterdayfinish_price();
  88. params[i][10] = oilstocks.get(i).getStock_todaystart_price();
  89. params[i][11] = oilstocks.get(i).getStock_max_price();
  90. params[i][12] = oilstocks.get(i).getStock_min_price();
  91. params[i][13] = oilstocks.get(i).getStock_fiveminuate_change();
  92. params[i][14] = oilstocks.get(i).getCraw_time();
  93. params[i][15] = null;
  94. params[i][16] = null;
  95. }
  96. QueryRunner qr = new QueryRunner(ds);
  97. try {
  98. sum = qr.batch("INSERT INTO `datacollection`.`ext_market_oil_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params);
  99. } catch (SQLException e) {
  100. System.out.println(e);
  101. }
  102. System.out.println("石油数据入库完毕");
  103. return c;
  104. }
  105. //此种数据库操作方法需要优化
  106. public static int insertcarStocks ( List<ExtMarketOilStockModel> carstocks ) {
  107. int c = 0; //success number of update
  108. int[] sum;
  109. Object[][] params1 = new Object[carstocks.size()][17];
  110. int c1 = 0; //success number of update
  111. for ( int i = 0; i < carstocks.size(); i++ ){
  112. params1[i][0] = carstocks.get(i).getDate();
  113. params1[i][1] = carstocks.get(i).getStock_id();
  114. params1[i][2] = carstocks.get(i).getStock_name();
  115. params1[i][3] = carstocks.get(i).getStock_price();
  116. params1[i][4] = carstocks.get(i).getStock_change();
  117. params1[i][5] = carstocks.get(i).getStock_range();
  118. params1[i][6] = carstocks.get(i).getStock_amplitude();
  119. params1[i][7] = carstocks.get(i).getStock_trading_number();
  120. params1[i][8] = carstocks.get(i).getStock_trading_value();
  121. params1[i][9] = carstocks.get(i).getStock_yesterdayfinish_price();
  122. params1[i][10] = carstocks.get(i).getStock_todaystart_price();
  123. params1[i][11] = carstocks.get(i).getStock_max_price();
  124. params1[i][12] = carstocks.get(i).getStock_min_price();
  125. params1[i][13] = carstocks.get(i).getStock_fiveminuate_change();
  126. params1[i][14] = carstocks.get(i).getCraw_time();
  127. params1[i][15] = null;
  128. params1[i][16] = null;
  129. }
  130. QueryRunner qr = new QueryRunner(ds);
  131. try {
  132. //插入的数据表及数据
  133. sum = qr.batch("INSERT INTO `datacollection`.`ext_market_car_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params1);
  134. } catch (SQLException e) {
  135. System.out.println(e);
  136. }
  137. System.out.println("汽车数据入库完毕");
  138. return c;
  139. }
  140. }

这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。

这里写图片描述

问题所在

问题1:针对股票这种数据,每周1到周五都会发布相关股票数据,那么如何每天定时定点让程序自动的去抓取,而不是手工每天运行一下呢?

问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?

首先,我带大家来看看我的数据库设计。

这里写图片描述

解决方法

这里使用Quartz实线定期运行程序,即上面提的第一个问题。(http://blog.csdn.net/qy20115549/article/details/52723907)。
针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。

job

  1. package job;
  2. import java.util.ArrayList;
  3. import java.util.List;
  4. import org.quartz.Job;
  5. import org.quartz.JobExecutionContext;
  6. import org.quartz.JobExecutionException;
  7. import db.MYSQLControl;
  8. import model.ExtMarketOilStockModel;
  9. import parse.ExtMarketOilStockParse;
  10. import timecontrol.TimeControl;
  11. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  12. public class ExtMarketOilStockJob implements Job {
  13. @Override
  14. public void execute(JobExecutionContext arg0) throws JobExecutionException {
  15. //获取上次的插入股票日期,加入判断是否为节假日
  16. List<ExtMarketOilStockModel> randomlist = MYSQLControl.getListInfoBySQL("select stock_id,stock_price,stock_change from ext_market_oil_stock where date = (select date from ext_market_oil_stock order by date desc limit 1) ",ExtMarketOilStockModel.class);
  17. //表格更新时间
  18. List<String> urloillist=new ArrayList<String>();
  19. List<String> urlcarlist=new ArrayList<String>();
  20. List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();
  21. List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();
  22. String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
  23. String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
  24. urloillist.add(url1);
  25. urloillist.add(url2);
  26. int judge=0;
  27. for (int i = 0; i < urloillist.size(); i++) {
  28. try {
  29. oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
  30. } catch (Exception e) {
  31. e.printStackTrace();
  32. }
  33. for (int j = 0; j < oilstocks.size(); j++) {
  34. String stock_id=oilstocks.get(j).getStock_id();
  35. float stock_price=oilstocks.get(j).getStock_price();
  36. if (stock_id.equals(randomlist.get(0).getStock_id())) {
  37. if (stock_price==randomlist.get(0).getStock_price()) {
  38. judge++;
  39. }
  40. }
  41. }
  42. for (int j = 0; j < oilstocks.size(); j++) {
  43. String stock_id=oilstocks.get(j).getStock_id();
  44. float stock_price=oilstocks.get(j).getStock_price();
  45. if (stock_id.equals(randomlist.get(1).getStock_id())) {
  46. if (stock_price==randomlist.get(1).getStock_price()) {
  47. judge++;
  48. }
  49. }
  50. }
  51. for (int j = 0; j < oilstocks.size(); j++) {
  52. String stock_id=oilstocks.get(j).getStock_id();
  53. float stock_price=oilstocks.get(j).getStock_price();
  54. if (stock_id.equals(randomlist.get(2).getStock_id())) {
  55. if (stock_price==randomlist.get(2).getStock_price()) {
  56. judge++;
  57. }
  58. }
  59. }
  60. if (judge!=3) {
  61. MYSQLControl.insertoilStocks(oilstocks);
  62. }
  63. }
  64. if (judge!=3) {
  65. for (int i = 1; i <6; i++) {
  66. String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
  67. urlcarlist.add(urli);
  68. }
  69. for (int i = 0; i < urlcarlist.size(); i++) {
  70. try {
  71. carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
  72. } catch (Exception e) {
  73. e.printStackTrace();
  74. }
  75. MYSQLControl.insertcarStocks(carstocks);
  76. }
  77. }
  78. }
  79. }

jobmain

如下,控制的时间是每周一到周五,8点39执行job,即每天都去抓取数据。

  1. package jobmain;
  2. import static org.quartz.CronScheduleBuilder.cronSchedule;
  3. import static org.quartz.JobBuilder.newJob;
  4. import static org.quartz.TriggerBuilder.newTrigger;
  5. import java.text.SimpleDateFormat;
  6. import java.util.Date;
  7. import org.quartz.CronTrigger;
  8. import org.quartz.JobDetail;
  9. import org.quartz.Scheduler;
  10. import org.quartz.SchedulerFactory;
  11. import org.quartz.impl.StdSchedulerFactory;
  12. import job.ExtMarketOilStockJob;
  13. /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */
  14. public class ExtMarketOilStockJobMain {
  15. public void go() throws Exception {
  16. // 首先,必需要取得一个Scheduler的引用
  17. SchedulerFactory sf = new StdSchedulerFactory();
  18. Scheduler sched = sf.getScheduler();
  19. //jobs可以在scheduled的sched.start()方法前被调用
  20. JobDetail job = newJob(ExtMarketOilStockJob.class).withIdentity("stockjob", "stockgroup").build();
  21. //每周一到周五8点39开始执行job
  22. CronTrigger trigger = newTrigger().withIdentity("stocktrigger", "stockgroup").withSchedule(cronSchedule("0 39 20 ? * MON-FRI")).build();
  23. Date ft = sched.scheduleJob(job, trigger);
  24. SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS");
  25. System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression());
  26. sched.start();
  27. }
  28. public static void main(String[] args) throws Exception {
  29. ExtMarketOilStockJobMain maingo = new ExtMarketOilStockJobMain();
  30. maingo.go();
  31. }
  32. }

运行jobmain中的类,便可以实现每天定点爬取数据。

发表评论

表情:
评论列表 (有 0 条评论,756人围观)

还没有评论,来说两句吧...

相关阅读

    相关 Python如何免费爬虫ip

    做过大数据抓取的程序员应该都知道,正常市面上的爬虫ip只分为两种,一种是API提取式的,还有一种是账密形式隧道模式的。往往因为高昂费用而止步。对于初学者觉得没有必要,我们...