clscripts

Repository for computational linguistics scripts (bash, python, octave, R, etc).

View the Project on GitHub leolca/clscripts

This is the documentation webpage for the clscripts, a collection of computational linguistics scripts written in different languages (bash, python, octave, R, C, etc).

Table of Contents

  1. wordcounttfl.sh
    count the occurrence of words in a text file
  2. entropy.py
    estimate the Shannon’s entropy for a list of counts
  3. heapslaw.py
    extract vocabulary size from different lengths of a text file
  4. vgc.py
    vocabulary growth curve data
  5. wordslengthdist.sh
    estimate the word length distribution
  6. surroundingcontext.sh
    surrounding context of a word
  7. wordposition.sh
    get word location in a file
  8. wordchart.sh
    word chart
  9. simons.py
    Simon model regarding the constant growth of vocabulary via addition of new words
  10. swnwdensity.sh
    new word density in a sliding window
  11. windowentropy.sh
    entropy in sliding windows of text
  12. windowindex.py
    star and end indexes of windows
  13. getwindow.sh
    star and end index
  14. downloadGutenbergTop100in30days.sh
    download the top 100 ebooks from Gutenberg
  15. downloadTheFederalistPapers.sh
    download the Federalist Papers
  16. wordfreq.awk
    word frequency
  17. ngram
    compute ngrams of chars or word
  18. tokenize.sh
    tokenizer